The MarkLogic connector extends Spark’s support for reading files to include file types that benefit from special handling when trying to import files into MarkLogic. This page describes the features that are inherited from Spark for reading files.
Table of contents
- Selecting files to read
- Generic Spark file source options
- Reading and writing large binary files
- Reading any file
Selecting files to read
Use Spark’s standard load()
function or path
option:
df = spark.read.format("marklogic") \
.option("spark.marklogic.read.files.compression", "zip") \
.load("path/to/zipfiles")
Or:
df = spark.read.format("marklogic") \
.option("spark.marklogic.read.files.compression", "zip") \
.option("path", "path/to/zipfiles") \
.load()
Generic Spark file source options
The connector also supports the following generic Spark file source options:
- Use
pathGlobFilter
to only include files with file names matching the given pattern. - Use
recursiveFileLookup
to include files in child directories. - Use
modifiedBefore
andmodifiedAfter
to select files based on their modification time.
Reading and writing large binary files
The 2.4.0 connector introduces support for reading and writing large binary files to MarkLogic, allowing for the contents of each file to be streamed from its source to MarkLogic. This avoids an issue where the Spark environment runs out of memory while trying to fit the contents of a file into an in-memory row.
To enable this, include the following in the set of options passed to your reader:
.option("spark.marklogic.streamFiles", "true")
As a result of this option, the content
column in each row will not contain the contents of the file. Instead, it will contain a serialized object intended to be used during the write phase to read the contents of the file as a stream.
Files read from the MarkLogic Spark connector with the above option can then be written as documents to MarkLogic with the same option above being passed to the writer. The connector will then stream the contents of each file to MarkLogic, submitting one request to MarkLogic per document.
The spark.marklogic.streamFiles
option can also be used when reading GZIP, ZIP, and archive files.
Reading any file
If you wish to read files without any special handling provided by the connector, you can use the Spark Binary data source. If you try to write these rows as documents, the connector will recognize the Binary data source schema and write each row as a separate document. For example, the following will write each file in the examples/getting-started/data
directory in this repository without any special handling of each file:
spark.read.format("binaryFile") \
.option("recursiveFileLookup", True) \
.load("data") \
.write.format("marklogic") \
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003") \
.option("spark.marklogic.write.collections", "binary-example") \
.option("spark.marklogic.write.permissions", "rest-reader,read,rest-writer,update") \
.option("spark.marklogic.write.uriReplace", ".*data,'/binary-example'") \
.mode("append") \
.save()
The above will result in each document in the data
directory being written as a document to MarkLogic. MarkLogic will determine the document type based on the file extension.
If you are writing files with extensions that MarkLogic does not recognize based on its configured set of MIME types, you can force a document type for each file with an unrecognized extension:
.option("spark.marklogic.write.fileRows.documentType", "JSON")
The spark.marklogic.write.fileRows.documentType
option supports values of JSON
, XML
, and TEXT
.
Please see the guide on writing data for information on how “file rows” can then be written to MarkLogic as documents.