Flux can export rows retrieved via the MarkLogic Optic API, writing them either to files or to another database via JDBC.
Table of contents
- Querying for rows
- Exporting to JDBC
- Exporting to files
- Controlling the save mode
- Tuning query performance
Querying for rows
The following commands support executing an Optic query and exporting the matching rows to an external data source:
export-avro-files
export-delimited-files
export-jdbc
export-json-lines-files
export-orc-files
export-parquet-files
An Optic query is specified via the --query
option. The query must be defined using the Optic DSL and must begin with the op.fromView
data accessor. The MarkLogic Spark connector documentation provides additional guidance on how to write an Optic query.
You must also specify connection information for the MarkLogic database you wish to query. Please see the guide on common options for instructions on doing so.
Exporting to JDBC
The export-jdbc
command writes rows retrieved by an Optic query to a table in an external database via a JDBC driver specific to the database.
JDBC driver installation
To export rows to a database, you must obtain the database’s JDBC driver JAR file and add it to the ./ext
directory location in the Flux installation directory. Any JAR file placed in the ./ext
directory is added to the classpath of Flux.
Configuring a JDBC connection
The export-jdbc
command requires that you specify connection details for the database you wish to write to via JDBC. Connection details are specified via the following options:
--jdbc-url
is required and specifies the JDBC connection URL.--jdbc-driver
is required specifies the main class name of the JDBC driver.--jdbc-user
specifies an optional user to authenticate as (this may already be specified via--jdbc-url
).--jdbc-password
specifies an optional password to authenticate with (this may already be specified via--jdbc-url
).
Exporting to a table
Once you have installed your database’s JDBC driver and determined your JDBC connection details, you can use export-jdbc
to export all rows matching an Optic query to a table in the external database. The following shows a notional example of doing so:
-
./bin/flux export-jdbc \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --jdbc-url "jdbc:postgresql://localhost/example?user=postgres&password=postgres" \ --jdbc-driver "org.postgresql.Driver" \ --table "marklogic-employee-data"
-
bin\flux export-jdbc ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --jdbc-url "jdbc:postgresql://localhost/example?user=postgres&password=postgres" ^ --jdbc-driver "org.postgresql.Driver" ^ --table "marklogic-employee-data"
Exporting to files
Rows selected via an Optic query can be exported to any of the below file formats.
Exporting rows to files defaults in Flux 1.0 to overwriting data in the selected path. The “Controlling save mode” section belows describes the different options for controlling this behavior. A future release of Flux will change the default from “overwrite” to “append” to avoid accidentally deleting any data in an existing file path.
Avro
The export-avro-files
command writes one or more Avro files to the directory specified by the --path
option. This command reuses Spark’s support for writing Avro files:
-
./bin/flux export-avro-files \ --connection-string "flux-example-user:password@localhost:8004" \ --collections example \ --path destination
-
bin\flux export-avro-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --collections example ^ --path destination
You can include any of the Spark Avro data source options via the -P
option to control how Avro content is written. These options are expressed as -PoptionName=optionValue
.
For configuration options listed in the above Spark Avro guide, use the -C
option instead. For example, -Cspark.sql.avro.compression.codec=deflate
would change the type of compression used for writing Avro files.
Delimited text
The export-delimited-files
command writes one or more delimited text (commonly CSV) files to the directory specified by the --path
option:
-
./bin/flux export-delimited-files \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --path destination
-
bin\flux export-delimited-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --path destination
This command reuses Spark’s support for writing delimited text files. You can include any of the Spark CSV options via the -P
option to control how delimited text is written. These options are expressed as -PoptionName=optionValue
.
The command defaults to setting the Spark CSV header
option to true
so that column names from your Optic query for selecting rows from MarkLogic are included in each output file. You can override this via -Pheader=false
if desired.
The command also defaults to setting the Spark CSV inferSchema
option to true
. This results in Flux, via Spark CSV, attempting to determine a type for each column in the delimited text file. To disable this behavior, resulting in every column having a type of string
, include -PinferSchema=false
.
By default, each file will be written using the UTF-8 encoding. You can specify an alternate encoding via the --encoding
option - e.g.
-
./bin/flux export-delimited-files \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --path destination \ --encoding ISO-8859-1 \
-
bin\flux export-delimited-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --path destination ^ --encoding ISO-8859-1 ^
JSON Lines
The export-json-lines-files
command writes one or more JSON Lines to the directory specified by the --path
option:
-
./bin/flux export-json-lines-files \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --path destination
-
bin\flux export-json-lines-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --path destination
This command reuses Spark’s support for writing JSON files. You can include any of the Spark JSON options via the -P
option to control how JSON Lines files are written. These options are expressed as -PoptionName=optionValue
.
By default, each file will be written using the UTF-8 encoding. You can specify an alternate encoding via the --encoding
option - e.g.
-
./bin/flux export-json-lines-files \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --path destination \ --encoding ISO-8859-1
-
bin\flux export-json-lines-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --path destination ^ --encoding ISO-8859-1
ORC
The export-orc-files
command writes one or more ORC files to the directory specified by the --path
option:
-
./bin/flux export-orc-files \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --path destination
-
bin\flux export-orc-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --path destination
This command reuses Spark’s support for writing ORC files. You can include any of the Spark ORC data source options via the -P
option to control how ORC content is written. These options are expressed as -PoptionName=optionValue
.
For configuration options listed in the above Spark ORC guide, use the -C
option instead. For example, -Cspark.sql.orc.impl=hive
would change the type of ORC implementation.
Parquet
The export-parquet-files
command writes one or more Parquet files to the directory specified by the --path
option:
-
./bin/flux export-parquet-files \ --connection-string "flux-example-user:password@localhost:8004" \ --query "op.fromView('example', 'employee', '')" \ --path destination
-
bin\flux export-parquet-files ^ --connection-string "flux-example-user:password@localhost:8004" ^ --query "op.fromView('example', 'employee', '')" ^ --path destination
This command reuses Spark’s support for writing Parquet files. You can include any of the Spark Parquet data source options via the -P
option to control how Parquet content is written. These options are expressed as -PoptionName=optionValue
.
For configuration options listed in the above Spark Parquet guide, use the -C
option instead. For example, -Cspark.sql.parquet.compression.codec=gzip
would change the compressed used for writing Parquet files.
Controlling the save mode
Each of the commands for exporting rows to files supports a --mode
option that controls how data is written to a location where data already exists. This option supports the following values:
Append
= append data if the destination already exists.Overwrite
= replace data if the destination already exists.ErrorIfExists
= throw an error if the destination already exists. This is the default mode.Ignore
= do not write any data if the destination already exists.
For convenience, the above values are case-sensitive so that you can ignore casing when choosing a value.
As of the 1.1.0 release of Flux, --mode
defaults to Append
for commands that write to a filesystem. In the 1.0.0 release, these commands defaulted to Overwrite
. The export-jdbc
command defaults to ErrorIfExists
to avoid altering an existing table in any way.
For further information on each mode, please see the Spark documentation.
Tuning query performance
The --batch-size
and --partitions
options are used to tune performance by controlling how many rows are retrieved in a single call to MarkLogic and how many requests are made in parallel to MarkLogic. It is recommended to first test your command without setting these options to see if the performance is acceptable. When you are ready to attempt to optimize the performance of your export command, please see the this guide on Optic query performance.