The custom-export-rows
and custom-export-documents
commands allow you to read rows and documents respectively from MarkLogic and write the results to a custom target.
Table of contents
Usage
With the required --target
option, you can specify any Spark data source or the name of a third-party Spark connector. For a third-party Spark connector, you must include the necessary JAR files for the connector in the ./ext
directory of your Flux installation. Note that if the connector is not available as a single “uber” jar, you will need to ensure that the connector and all of its dependencies are added to the ./ext
directory.
As an example, Flux does not provide an out-of-the-box command that uses the Spark Text data source. You can use this data source via custom-export-rows
:
-
./bin/flux custom-export-rows \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('schema', 'view')" \ --target text \ -Ppath=export
-
bin\flux custom-export-rows ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('schema', 'view')" ^ --target text ^ -Ppath=export
Exporting rows
When using custom-export-rows
with an Optic query to select rows from MarkLogic, each row sent to the connector or data source defined by --target
will have a schema based on the output of the Optic query. You may find the --preview
and --preview-schema
options helpful in understanding what data will be in these rows. See Common Options for more information.
Exporting documents
When using custom-export-documents
, each document returned by MarkLogic will be represented as a Spark row with the following column definitions:
URI
containing a string.content
containing a byte array.format
containing a string.collections
containing an array of strings.permissions
containing a map of strings and arrays of strings representing roles and permissions.quality
containing an integer.properties
containing an XML document serialized to a string.metadataValues
containing a map of string keys and string values.
These are normal Spark rows that can be written via Spark data sources like Parquet and ORC. If using a third-party Spark connector, you will likely need to understand how that connector will make use of rows defined via the above schema in order to get your desired results.