Each Flux import command will write one or more documents to MarkLogic, regardless of the data source. The sections below detail the common features for writing documents that are available for every import command, unless otherwise noted by the documentation for that command.
Table of contents
- Controlling document URIs
- Configuring document metadata
- Building a RAG data pipeline
- Transforming content
Controlling document URIs
Each import command will generate an initial URI for each document, typically based on the data source from which the command reads. The following command line options offer further control over each URI:
Option | Description |
---|---|
--uri-prefix | A prefix to apply to each URI. |
--uri-suffix | A suffix to apply to each URI. |
--uri-replace | Comma-delimited list of regular expressions and replacement values, with each replacement value surrounded by single quotes. |
--uri-template | Template for each URI containing one or more column names. |
Replacing URI contents
The --uri-replace
option supports replacing one or more parts of an initial URI. Each part is identified by a regular expression, and the replacement for each part is surrounded in single quotes. Replacing parts of the URI is often useful when importing data from files where the initial URI is based on an absolute file path. For example, if you import files from a path of /path/to/my/data
and you only want to include /data
in your URIs, you would include the following option:
--uri-replace ".*/data,'/data'"
Configuring URIs via a template
The --uri-template
option supports configuring a URI based on a JSON representation of each record that a command reads from its associated data source. This option is supported for the following commands:
import-aggregate-json-files
import-avro-files
import-delimited-files
import-files
, but only for JSON files and JSON entries in ZIP files.import-jdbc
import-orc-files
import-parquet-files
By default, each of the above commands will write each record that it reads as a JSON document to MarkLogic. A URI template is applied against that JSON representation of each record. This is true even when electing to write XML documents to MarkLogic instead.
A URI template is a string containing any text you wish to include in every URI along with one or more expressions surrounded by single braces. Each expression must refer to either a top-level field name in the JSON representation of a record, or it must be a JSON Pointer expression that points to a non-empty value in the JSON representation.
For example, consider an employee data source where the JSON representation of each record from that data source has top-level fields of id
and last_name
. You could configure a URI for each document using the following option:
--uri-template "/employee/{id}/{last_name}.json"
A JSON Pointer expression is useful in conjunction with the optional --json-root-name
option for defining a root field name in each JSON document. For example, using the above example, you may want each employee document to have a single root field of “employee” so that each document is more self-describing. The URI template will be evaluated against a JSON document with this root field applied, so you would need to use JSON Pointer expressions to refer to the id
and last_name
values:
--json-root-name employee --uri-template "/employee/{/employee/id}/{/employee/last_name}.json"
The following techniques can assist you with writing a URI template:
- Run the import command with
--limit 1
to write a single JSON document to MarkLogic. You can then see the JSON fields that can be referenced in your template. - Run the import command with
--preview 1
to see a tabular representation of a single record read from the command’s data source. This also helps you understand the fields that can be referenced in your template. - Consider using an options file, as the inclusion of sequences such as
"{/
can be mis-interpreted by some shell environments.
Configuring document metadata
When writing documents, you can configure any number of collections, any number of permissions, and a temporal collection. Collections are useful for organizing documents into related sets and provide a convenient mechanism for queries. You will typically want to configure at least one set of read
and update
permissions for your documents to ensure that non-admin users can access your data. A temporal collection is only necessary when leveraging MarkLogic’s support for querying bi-temporal data.
Each of the above types of metadata can be configured via the following options:
Option | Description |
---|---|
--collections | Comma-delimited list of collection names to add to each document. |
--permissions | Comma-delimited list of MarkLogic role names and capabilities - e.g. rest-reader,read,rest-writer,update . |
--temporal-collection | Name of a MarkLogic temporal collection to assign to each document. |
The following shows an example of each option:
--collections employees,imported-data \
--permissions my-reader-role,read,my-writer-role,update \
--temporal-collection my-temporal-data
Building a RAG data pipeline
Retrieval-augmented generation, or RAG, with MarkLogic depends on preparing data so that the most relevant chunks of text for a user’s question can be sent to a Large Language Model, or LLM. Starting with release 1.2.0, Flux supports the construction of a data pipeline by splitting the text in a document into chunks and adding a vector embedding to each chunk while importing data. Please see the guide on splitting text and the guide on adding embeddings for more information.
Transforming content
For each import command, you can apply a MarkLogic REST transform to each document before it is written. A transform is configured via the following options:
Option | Description |
---|---|
--transform | Name of a MarkLogic REST transform to apply to the document before writing it. |
--transform-params | Comma-delimited list of transform parameter names and values - e.g. param1,value1,param2,value2. |
--transform-params-delimiter | Delimiter for --transform-params ; typically set when a value contains a comma. |
The following shows an example of each option:
--transform my-transform
--transform-params param1;value1;param2;value2
--transform-params-delimiter ;
The above link for REST transforms includes instructions on manually installing a transform. If you are using ml-gradle to deploy an application to MarkLogic, you can let ml-gradle automatically install your transform.