Exporting RDF data

Flux can export semantic triples to a variety of RDF file formats, allowing you to easily exchange large numbers of triples and quads with other systems and users.

Usage
Specifying the number of files to write
Specifying a base IRI
Overriding the graph
gzip compression
Exporting consistent results

Usage

The export-rdf-files command requires a query for selecting documents to export and a directory path for writing RDF files to:

Unix
Windows

./bin/flux export-rdf-files \
    --connection-string "flux-example-user:password@localhost:8004" \
    --collections example \
    --path destination

bin\flux export-rdf-files ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --collections example ^
    --path destination

Similar to exporting documents, the export-rdf-files command supports the following options for selecting the documents that contain the triples you wish to export:

Option	Description
`--collections`	Comma-delimited sequence of collection names.
`--directory`	A database directory for constraining on URIs.
`--graphs`	Comma-delimited sequence of MarkLogic graph names.
`--options`	Name of a REST API search options document; typically used with a string query.
`--query`	A structured, serialized CTS, or combined query expressed as JSON or XML.
`--string-query`	A string query utilizing MarkLogic’s search grammar.
`--uris`	Newline-delimited sequence of document URIs to retrieve.

You may specify any combination of these options, with the exception that --query will be ignored if --uris is specified.

Prior to Flux 1.2.0, Flux required at least one of --collections, --directory, --graphs, --query, --string-query, or --uris to be specified. Starting in Flux 1.2.0, if you do not specify any of those options, then Flux will select all documents that the configured MarkLogic user is able to read.

For each document matching the query specified by your options above, Flux will retrieve the triples from the document and write them to a file. You must specify a --path option for where files should be written. See Specifying a path for more information on paths.

By default, Flux will write files using the standard Turtle or TTL format. You can change this via the --format option, with the following choices supported:

nq = N-Quads
nt = N-Triples
rdfthrift = RDF Binary
trig = TriG
trix = Triples in XML
ttl = Turtle, the default format.

Specifying the number of files to write

The --file-count option controls how many files are written by Flux. The default equals the number of partitions used for reading triples from MarkLogic, which is controlled by the --partitions-per-forest option, which has a default value of 4. For example, if the database you are querying has 3 forests, Flux will write 12 files by default.

Specifying a base IRI

With the --base-iri option, you can specify a base IRI to prepend to the graph associated with each triple when the graph is relative. If the graph for a triple is absolute, the base IRI will not be prepended.

Overriding the graph

For some use cases involving exporting triples with their graphs to files containing quads, it may not be desirable to reference the graph that each triple belongs to in MarkLogic. You can use --graph-override to specify an alternative graph value that will then be associated with every triple that Flux writes to a file.

gzip compression

To compress each file written by Flux using gzip, simply include --gzip as an option.

Exporting consistent results

By default, Flux uses MarkLogic’s support for point-in-time queries when querying for documents, thus ensuring a consistent snapshot of data. Point-in-time queries depend on the same MarkLogic system timestamp being used for each query. Because system timestamps can be deleted when MarkLogic merges data, you may encounter the following error that causes an export command to fail:

Server Message: XDMP-OLDSTAMP: Timestamp too old for forest

To resolve this issue, you must enable point-in-time queries for your database by configuring the merge timestamp setting. The recommended practice is to use a negative value that exceeds the expected duration of the export operation. For example, a value of -864,000,000,000 for the merge timestamp would give the export operation 24 hours to complete.

Alternatively, you can disable the use of point-in-time queries by including the following option:

--no-snapshot

The above option will not use a snapshot for queries but instead will query for data at multiple points in time. As noted above in the guide for consistent snapshots, you may get unpredictable results if your query matches on data that changes during the export operation. If your data is not changing, this approach is recommended as it avoids the need to configure merge timestamp.

Table of contents