Common import features

Each Flux import command will write one or more documents to MarkLogic, regardless of the data source. The sections below detail the common features for writing documents that are available for every import command, unless otherwise noted by the documentation for that command.

Controlling document URIs
- Replacing URI contents
- Configuring URIs via a template
Configuring document metadata
Building a RAG data pipeline
Transforming content

Controlling document URIs

Each import command will generate an initial URI for each document, typically based on the data source from which the command reads. The following command line options offer further control over each URI:

Option	Description
`--uri-prefix`	A prefix to apply to each URI.
`--uri-suffix`	A suffix to apply to each URI.
`--uri-replace`	Comma-delimited list of regular expressions and replacement values, with each replacement value surrounded by single quotes.
`--uri-template`	Template for each URI containing one or more column names.

Replacing URI contents

The --uri-replace option supports replacing one or more parts of an initial URI. Each part is identified by a regular expression, and the replacement for each part is surrounded in single quotes. Replacing parts of the URI is often useful when importing data from files where the initial URI is based on an absolute file path. For example, if you import files from a path of /path/to/my/data and you only want to include /data in your URIs, you would include the following option:

--uri-replace ".*/data,'/data'"

Configuring URIs via a template

The --uri-template option supports configuring a URI based on a JSON representation of each record that a command reads from its associated data source. This option is supported for the following commands:

import-aggregate-json-files
import-avro-files
import-delimited-files
import-files, but only for JSON files and JSON entries in ZIP files.
import-jdbc
import-orc-files
import-parquet-files

By default, each of the above commands will write each record that it reads as a JSON document to MarkLogic. A URI template is applied against that JSON representation of each record. This is true even when electing to write XML documents to MarkLogic instead.

A URI template is a string containing any text you wish to include in every URI along with one or more expressions surrounded by single braces. Each expression must refer to either a top-level field name in the JSON representation of a record, or it must be a JSON Pointer expression that points to a non-empty value in the JSON representation.

For example, consider an employee data source where the JSON representation of each record from that data source has top-level fields of id and last_name. You could configure a URI for each document using the following option:

--uri-template "/employee/{id}/{last_name}.json"

A JSON Pointer expression is useful in conjunction with the optional --json-root-name option for defining a root field name in each JSON document. For example, using the above example, you may want each employee document to have a single root field of “employee” so that each document is more self-describing. The URI template will be evaluated against a JSON document with this root field applied, so you would need to use JSON Pointer expressions to refer to the id and last_name values:

--json-root-name employee --uri-template "/employee/{/employee/id}/{/employee/last_name}.json"

The following techniques can assist you with writing a URI template:

Run the import command with --limit 1 to write a single JSON document to MarkLogic. You can then see the JSON fields that can be referenced in your template.
Run the import command with --preview 1 to see a tabular representation of a single record read from the command’s data source. This also helps you understand the fields that can be referenced in your template.
Consider using an options file, as the inclusion of sequences such as "{/ can be mis-interpreted by some shell environments.

Configuring document metadata

When writing documents, you can configure any number of collections, any number of permissions, and a temporal collection. Collections are useful for organizing documents into related sets and provide a convenient mechanism for queries. You will typically want to configure at least one set of read and update permissions for your documents to ensure that non-admin users can access your data. A temporal collection is only necessary when leveraging MarkLogic’s support for querying bi-temporal data.

Starting with Flux 1.3.0, you can also configure metadata values and document properties.

Each of the above types of metadata can be configured via the following options:

Option	Description
`--collections`	Comma-delimited list of collection names to add to each document.
`--permissions`	Comma-delimited list of MarkLogic role names and capabilities - e.g. `rest-reader,read,rest-writer,update`.
`--temporal-collection`	Name of a MarkLogic temporal collection to assign to each document.
`-Mkey=value`	Key and value to add as a metadata value. Can be specified multiple times.
`-Rkey=value`	Key and value to add as a document property. Can be specified multiple times.

The following shows an example of each option:

--collections employees,imported-data \
--permissions my-reader-role,read,my-writer-role,update \
--temporal-collection my-temporal-data \
-Mmeta1=value1 -Mmeta2=value2 \
-Rprop1=value1 -Pprop2-value 

Building a RAG data pipeline

Retrieval-augmented generation, or RAG, with MarkLogic depends on preparing data so that the most relevant chunks of text for a user’s question can be sent to a Large Language Model, or LLM. Starting with release 1.2.0, Flux supports the construction of a data pipeline by splitting the text in a document into chunks and adding a vector embedding to each chunk while importing data. Please see the guide on splitting text and the guide on adding embeddings for more information.

Transforming content

For each import command, you can apply a MarkLogic REST transform to each document before it is written. A transform is configured via the following options:

Option	Description
`--transform`	Name of a MarkLogic REST transform to apply to the document before writing it.
`--transform-params`	Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2.
`--transform-params-delimiter`	Delimiter for `--transform-params`; typically set when a value contains a comma.

The following shows an example of each option:

--transform my-transform
--transform-params param1;value1;param2;value2
--transform-params-delimiter ;

The above link for REST transforms includes instructions on manually installing a transform. If you are using ml-gradle to deploy an application to MarkLogic, you can let ml-gradle automatically install your transform.

Note that a REST transform is executed within MarkLogic and is applied to each document sent by Flux to MarkLogic. If you only wish to transform a subset of documents sent to MarkLogic, you should add logic to your REST transform to determine if a document should be processed or not.

Table of contents

Controlling document URIs

Replacing URI contents

Configuring URIs via a template

Configuring document metadata

Building a RAG data pipeline

Transforming content