XML files often contain aggregate data that can be disaggregated by splitting it into multiple smaller documents rooted at a recurring element. Disaggregating large XML files consumes fewer resources during loading and improves performance when searching and retrieving content. This guide describes how to use the connector to read aggregate XML files and produce many rows from specific child elements.

Table of contents

Usage

The connector supports the above use case via the spark.marklogic.read.aggregates.xml.element and optional spark.marklogic.read.aggregates.xml.namespace options. When using these options, the connector will return rows with the same schema as used by Spark’s Binary data source. The connector knows how to write rows adhering to this schema as documents in MarkLogic.

The examples/getting-started directory in this repository contains a small XML file with multiple occurrences of the element Employee in the namespace org:example. The following command demonstrates how to read this file such that each occurrence of the element Employee becomes a separate row in Spark (note that the namespace option is not required):

df = spark.read.format("marklogic") \
  .option("spark.marklogic.read.aggregates.xml.element", "Employee") \
  .option("spark.marklogic.read.aggregates.xml.namespace", "org:example") \
  .load("data/employees.xml")
df.show()

You can then write each of the rows as separate XML documents in MarkLogic:

df = spark.read.format("marklogic") \
  .option("spark.marklogic.read.aggregates.xml.element", "MedlineCitation") \
  .option("spark.marklogic.read.aggregates.xml.uriElement", "MedlineID") \
  .load("/Users/rudin/data/medline02n0228.xml")

df.write.format("marklogic") \
  .option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
  .option("spark.marklogic.write.collections", "aggregate-xml") \
  .option("spark.marklogic.write.permissions", "rest-reader,read,rest-writer,update") \
  .option("spark.marklogic.write.uriReplace", ".*/data,'/xml'") \
  .mode("append") \
  .save()

The above will produce 3 XML documents, each with a root element of Employee in the org:example namespace, in the aggregate-xml` collection in MarkLogic.

Generating a URI via an element

Some XML documents may contain a particular element that is useful for generating a unique URI for each document. You can specify the element name and optional namespace via the spark.marklogic.read.aggregates.xml.uriElement and optional spark.marklogic.read.aggregates.xml.uriNamespace options as shown below:

df = spark.read.format("marklogic") \
  .option("spark.marklogic.read.aggregates.xml.element", "Employee") \
  .option("spark.marklogic.read.aggregates.xml.namespace", "org:example") \
  .option("spark.marklogic.read.aggregates.xml.uriElement", "name") \
  .option("spark.marklogic.read.aggregates.xml.uriNamespace", "org:example") \
  .load("data/employees.xml")
df.show()

Reading compressed files

The connector supports reading GZIP and ZIP compressed files via the spark.marklogic.read.files.compression option.

For a GZIP compressed file, set the option to a value of gzip:

.option("spark.marklogic.read.files.compression", "gzip")

Each aggregate XML file will be unzipped first and then processed normally.

For a ZIP compressed file, which may contain one to many aggregate XML files, set the option to a value of zip:

.option("spark.marklogic.read.files.compression", "zip")

Each entry in the zip file must be an aggregate XML file. The same element and namespace, along with URI element and namespace, will be applied to every file in the zip.

Error handling

The connector defaults to throwing any error that occurs while reading an aggregate XML file. You can set the spark.marklogic.read.files.abortOnFailure option to false to have each error logged instead. The connector will continue trying to process each aggregate XML file.

In the case of an error due to an element missing the child element specified by spark.marklogic.read.aggregates.xml.uriElement, the connector will log the error and continue trying to process elements in the aggregate XML file.