The MarkLogic connector has 3 sets of configuration options - connection options, reading options, and writing options. Each set of options is defined below.
Table of contents
Referencing the connector
Starting with the 2.2.0 release, you can reference the MarkLogic connector via the short name “marklogic”:
session.read.format("marklogic")
Prior to 2.2.0, you must use the full name:
session.read.format("com.marklogic.spark")
Connection options
These options define how the connector connects and authenticates with MarkLogic.
Option | Description |
---|---|
spark.marklogic.client.host | Required; the host name to connect to; this can be the name of a host in your MarkLogic cluster or the host name of a load balancer. |
spark.marklogic.client.port | Required; the port of the app server in MarkLogic to connect to. |
spark.marklogic.client.basePath | Base path to prefix on each request to MarkLogic. |
spark.marklogic.client.database | Name of the database to interact with; only needs to be set if it differs from the content database assigned to the app server that the connector will connect to. |
spark.marklogic.client.connectionType | Either gateway for when connecting to a load balancer, or direct when connecting directly to MarkLogic. Defaults to gateway which works in either scenario. |
spark.marklogic.client.authType | Required; one of basic , digest , cloud , kerberos , certificate , or saml . Defaults to digest . |
spark.marklogic.client.username | Required for basic and digest authentication. |
spark.marklogic.client.password | Required for basic and digest authentication. |
spark.marklogic.client.certificate.file | Required for certificate authentication; the path to a certificate file. |
spark.marklogic.client.certificate.password | Required for certificate authentication; the password for accessing the certificate file. |
spark.marklogic.client.cloud.apiKey | Required for MarkLogic cloud authentication. |
spark.marklogic.client.kerberos.principal | Required for kerberos authentication. |
spark.marklogic.client.saml.token | Required for saml authentication. |
spark.marklogic.client.sslEnabled | If ‘true’, an SSL connection is created using the JVM’s default SSL context. |
spark.marklogic.client.sslHostnameVerifier | Either any , common , or strict ; see the MarkLogic Java Client documentation for more information on these choices. |
spark.marklogic.client.ssl.keystore.path | File path to a Java keystore for 2-way SSL; since 2.1.0. |
spark.marklogic.client.ssl.keystore.password | Optional password for a Java keystore for 2-way SSL; since 2.1.0. |
spark.marklogic.client.ssl.keystore.type | Java keystore type for 2-way SSL; defaults to “JKS”; since 2.1.0. |
spark.marklogic.client.ssl.keystore.algorithm | Java keystore algorithm for 2-way SSL; defaults to “SunX509”; since 2.1.0. |
spark.marklogic.client.uri | Shortcut for setting the host, port, username, and password when using basic or digest authentication. See below for more information. |
Connecting with a client URI
The spark.marklogic.client.uri
is a convenience for the common case of using basic
or digest
authentication. It allows you to specify username, password, host, and port via the following syntax:
username:password@host:port
This avoids the need to set the individual options for the above 4 properties.
You may also configure a database name via the following syntax:
username:password@host:port/database
A database name is only needed when you wish to work with a database other than the one associated with the app server that the connector will connect to via the port
value.
Using this convenience can provide a much more succinct set of options - for example:
df = spark.read.format("marklogic")\
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8003")\
.option("spark.marklogic.read.opticQuery", "op.fromView('example', 'employee')")\
.load()
Note that if the username or password contain either a @
or a :
character, you must first convert them using percent encoding into the correct character triplet. For example, a password of sp@r:k
must appear in the spark.marklogic.client.uri
string as sp%40r%3Ak
.
Configuring SSL
If the MarkLogic app server that the connector will connect to requires SSL but does not require that the client present a certificate, set the spark.marklogic.client.sslEnabled
option to ‘true’. This causes the associated JVM’s certificate store - typically the $JAVA_HOME/jre/lib/security/cacerts
file - to be used for establishing an SSL connection. The certificate store should contain the public certificate associated with the SSL certificate template used by the MarkLogic app server.
Starting in 2.1.0, if the MarkLogic app server requires the client to present a certificate, set the spark.marklogic.client.ssl.keystore.path
option to point to a Java keystore containing the client certificate. Set spark.marklogic.client.ssl.keystore.password
if the keystore requires a password. The keystore will also be used as the truststore so it must also contain the public certificate associated with the SSL certificate template used by the MarkLogic app server. A future release of the connector will allow for the truststore to be a separate file.
If you receive an error containing a message of “PKIX path building failed”, the most likely issue is that your JVM’s certificate store does not contain the public certificate associated with the MarkLogic app server, or your Spark environment may be using a JVM different from the one you think it is. This guide provides some common solutions for solving this error.
If you receive an javax.net.ssl.SSLPeerUnverifiedException
error, you will need to adjust the spark.marklogic.client.sslHostnameVerifier
option. A value of ANY
will disable hostname verification, which may be appropriate in a development or test environment. The MarkLogic Java Client documentation describes the other choices for this option.
Read options
See the guide on reading for more information on how data is read from MarkLogic.
Read options for Optic queries
The following options control how the connector reads rows from MarkLogic via an Optic query:
Option | Description |
---|---|
spark.marklogic.read.batchSize | Approximate number of rows to retrieve in each call to MarkLogic; defaults to 100000. |
spark.marklogic.read.numPartitions | The number of Spark partitions to create; defaults to spark.default.parallelism . |
spark.marklogic.read.opticQuery | Required; the Optic DSL query to run for retrieving rows; must use op.fromView as the accessor. |
spark.marklogic.read.pushDownAggregates | Whether to push down aggregate operations to MarkLogic; defaults to true . Set to false to prevent aggregates from being pushed down to MarkLogic. |
Read options for custom code
The following options control how the connector reads rows from MarkLogic via custom code:
Option | Description |
---|---|
spark.marklogic.read.invoke | The path to a module to invoke; the module must be in your application’s modules database. |
spark.marklogic.read.javascript | JavaScript code to execute. |
spark.marklogic.read.javascriptFile | Local file path containing JavaScript code to execute. |
spark.marklogic.read.xquery | XQuery code to execute. |
spark.marklogic.read.xqueryFile | Local file path containing XQuery code to execute. |
spark.marklogic.read.vars. | Prefix for user-defined variables to be sent to the custom code. |
If you are using Spark’s streaming support with custom code, or you need to break up your custom code query into multiple queries, the following options can also be used to control how partitions are defined:
Option | Description |
---|---|
spark.marklogic.read.partitions.invoke | The path to a module to invoke; the module must be in your application’s modules database. |
spark.marklogic.read.partitions.javascript | JavaScript code to execute. |
spark.marklogic.read.partitions.javascriptFile | Local file path containing JavaScript code to execute. |
spark.marklogic.read.partitions.xquery | XQuery code to execute. |
spark.marklogic.read.partitions.xqueryFile | Local file path containing XQuery code to execute. |
Read options for documents
The following options control how the connector reads document rows from MarkLogic via search queries:
Option | Description |
---|---|
spark.marklogic.read.documents.stringQuery | A MarkLogic string query for selecting documents. |
spark.marklogic.read.documents.query | A JSON or XML representation of a structured query, serialized CTS query, or combined query. |
spark.marklogic.read.documents.categories | Controls which metadata is returned for each document. Defaults to content . Allowable values are content , metadata , collections , permissions , quality , properties , and metadatavalues . |
spark.marklogic.read.documents.collections | Comma-delimited string of zero to many collections to constrain the query. |
spark.marklogic.read.documents.directory | Database directory - e.g. “/company/employees/” - to constrain the query. |
spark.marklogic.read.documents.filtered | Set to true for filtered searches. Defaults to false as unfiltered searches are significantly faster and will produce accurate results when your application indexes are sufficient for your query. |
spark.marklogic.read.documents.options | Name of a set of MarkLogic search options to be applied against a string query. |
spark.marklogic.read.documents.partitionsPerForest | Number of Spark partition readers to create per forest; defaults to 4. |
spark.marklogic.read.documents.transform | Name of a MarkLogic REST transform to apply to each matching document. |
spark.marklogic.read.documents.transformParams | Comma-delimited sequence of transform parameter names and values - e.g. param1,value1,param2,value . |
spark.marklogic.read.documents.transformParamsDelimiter | Delimiter for transform parameters; defaults to a comma. |
Read options for files
As of the 2.3.0 release, the connector supports reading aggregate XML files, RDF files, and ZIP files. The following options control how the connector reads files:
Option | Description |
---|---|
spark.marklogic.read.aggregates.xml.element | Required when reading aggregate XML files; defines the name of the element for selecting elements to convert into Spark rows. |
spark.marklogic.read.aggregates.xml.namespace | Optional namespace for the element identified by spark.marklogic.read.aggregates.xml.element . |
spark.marklogic.read.aggregates.xml.uriElement | Optional element name for constructing a URI based on an element value. |
spark.marklogic.read.aggregates.xml.uriNamespace | Optional namespace for the element identified by spark.marklogic.read.aggregates.xml.uriElement . |
spark.marklogic.read.files.abortOnFailure | Set to false so that the connector logs errors and continues processing files. Defaults to true . |
spark.marklogic.read.files.compression | Set to gzip or zip when reading compressed files. |
spark.marklogic.read.files.type | Set to rdf when reading RDF files. This option only needs to be set when the connector is otherwise unable to detect that it should perform some sort of handling for the file. |
Write options
See the guide on writing for more information on how data is written to MarkLogic.
Writing rows as documents to MarkLogic
The following options control how the connector writes rows as documents to MarkLogic:
Option | Description |
---|---|
spark.marklogic.write.abortOnFailure | Whether the Spark job should abort if a batch fails to be written; defaults to true . |
spark.marklogic.write.batchSize | The number of documents written in a call to MarkLogic; defaults to 100. |
spark.marklogic.write.collections | Comma-delimited string of collection names to add to each document. |
spark.marklogic.write.permissions | Comma-delimited string of role names and capabilities to add to each document - e.g. role1,read,role2,update,role3,execute . |
spark.marklogic.write.fileRows.documentType | Forces a document type when MarkLogic does not recognize a URI extension; must be one of JSON , XML , or TEXT . |
spark.marklogic.write.jsonRootName | As of 2.3.0, specifies a root field name when writing JSON documents based on arbitrary rows. |
spark.marklogic.write.temporalCollection | Name of a temporal collection to assign each document to. |
spark.marklogic.write.threadCount | The number of threads used across all partitions to send documents to MarkLogic; defaults to 4. |
spark.marklogic.write.threadCountPerPartition | New in 2.3.0; the number of threads used per partition to send documents to MarkLogic. |
spark.marklogic.write.transform | Name of a REST transform to apply to each document. |
spark.marklogic.write.transformParams | Comma-delimited string of transform parameter names and values - e.g. param1,value1,param2,value2 . |
spark.marklogic.write.transformParamsDelimiter | Delimiter to use instead of a command for the transformParams option. |
spark.marklogic.write.uriPrefix | String to prepend to each document URI, where the URI defaults to a UUID. |
spark.marklogic.write.uriReplace | Modify the initial URI for a row via a comma-delimited list of regular expression and replacement string pairs - e.g. regex,’value’,regex,’value’. Each replacement string must be enclosed by single quotes. |
spark.marklogic.write.uriSuffix | String to append to each document URI, where the URI defaults to a UUID. |
spark.marklogic.write.uriTemplate | String defining a template for constructing each document URI. See Writing data for more information. |
Processing rows via custom code
The following options control how rows can be processed with custom code in MarkLogic:
Option | Description |
---|---|
spark.marklogic.write.abortOnFailure | Whether the Spark job should abort if a batch fails to be written; defaults to true . |
spark.marklogic.write.batchSize | The number of rows sent in a call to MarkLogic; defaults to 1. |
spark.marklogic.write.invoke | The path to a module to invoke; the module must be in your application’s modules database. |
spark.marklogic.write.javascript | JavaScript code to execute. |
spark.marklogic.write.javascriptFile | Local file path containing JavaScript code to execute. |
spark.marklogic.write.xquery | XQuery code to execute. |
spark.marklogic.write.xqueryFile | Local file path containing XQuery code to execute. |
spark.marklogic.write.externalVariableName | Name of the external variable in custom code that is populated with row values; defaults to URI . |
spark.marklogic.write.externalVariableDelimiter | Delimiter used when multiple row values are sent in a single call; defaults to a comma. |
spark.marklogic.write.vars. | Prefix for user-defined variables to be sent to the custom code. |