Importing aggregate JSON

Flux provides special handling for JSON “aggregate” files that either conform to the JSON Lines format or contain arrays of objects. If you wish to import JSON files as-is, you may find it simpler to import them as generic files instead.

Usage
Importing JSON Lines files
- Importing JSON Lines files as is
- Failing fast on malformed JSON
Specifying a JSON root name
Ignoring null fields
Specifying an encoding
Reading compressed files
Advanced options

Usage

The import-aggregate-json-files command defaults to writing multiple documents to MarkLogic if a file contains an array of JSON objects and writing a single document if a file contains a JSON object. If you would rather a file with an array of objects be written as a single document, use the import-files command instead.

You must specify at least one --path option along with connection information for the MarkLogic database you wish to write to:

Unix
Windows

./bin/flux import-aggregate-json-files \
    --path /path/to/files \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update

bin\flux import-aggregate-json-files ^
    --path path\to\files ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update

The URI of each document will default to a UUID followed by .json. To include the file path at the start of the URI, include the --uri-include-file-path option. You can also make use of the common import features for controlling document URIs.

Note that Flux’s support for an array of objects requires the root of the JSON to be an array. For example, consider a file containing the following JSON:

{
  "items": [
    {"id": 1},
    {"id": 2}
  ]
}

Flux can only import the above JSON as an object which becomes a single document. If you wish to instead import each object in the items array as a separate JSON document, consider pre-processing the file by removing the outer object so that the file only contains an array.

Importing JSON Lines files

If your files conform to the JSON Lines format, include the --json-lines option with no value. Flux will then read each line in each file as a separate JSON object and write it to MarkLogic as a JSON document.

Important - by default, Flux will attempt to conform all JSON objects in a JSON Lines file to a common schema. This process may result in changes to the JSON objects and may also add a significant amount of time to the import process. If you do not need a common schema enforced, use the --json-lines-raw option instead. Please see the next section for more information on this option.

For example, consider a file with the following content:

{"first": "george", "last": "washington"}
{"id": 12345, "price": 8.99, "in-stock": true}

The file can be imported with the following notional command:

Unix
Windows

./bin/flux import-aggregate-json-files \
    --json-lines \
    --path path/to/file.txt \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update

bin\flux import-aggregate-json-files ^
    --json-lines ^
    --path path\to\file.txt ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update

Flux will write two separate JSON documents, each with a completely different schema.

The JSON Lines format is often useful for exporting data from MarkLogic as well. Please see this guide for more information on exporting data to JSON Lines files.

Importing JSON Lines files as is

When importing JSON Lines files, Flux uses the Spark JSON data source to read each line and conform the JSON objects to a common schema across the entire set of lines. As noted in the Advanced Options section below, Spark JSON provides a number of configuration options for controlling how the lines are read. These features can result in changes to the JSON objects, such as the keys being reordered and fields being added to match the common schema.

For some use cases, you may wish to read each line “as is” without any modification to it. To do so, use the --json-lines-raw option instead of --json-lines. With the --json-lines-raw option, Flux will read each line as a JSON document and will not attempt to enforce any commonality across the lines. This option also has the following effects on the import-aggregate-json-files command:

You cannot use any -P options as described in the “Advanced Options” section below.
The --uri-include-file-path option has no effect as each JSON document will default to a URI including the file path.
The following options also have no effect as each JSON document is intentionally left as is: --json-root-name, --xml-root-name, --xml-namespace, and --ignore-null-fields.
You can still read a gzipped file if its filename ends in .gz.

In addition, because the --json-lines-raw option avoids the need to conform the JSON objects to a common schema, the import process will typically be faster by avoiding the need to determine a common schema.

Failing fast on malformed JSON

Flux defaults to reading JSON lines with a mode of “permissive”. With this mode, any malformed lines will be dropped and Flux will continue processing lines.

You can configure the underlying Spark JSON data source to instead fail fast when it encounters a malformed line by including -Pmode=failfast as an option. When Flux encounters a malformed line, it will immediately fail with an error message describing the problem with the line.

Specifying a JSON root name

It is often useful to have a single “root” field in a JSON document so that it is more self-describing. It can help with indexing purposes in MarkLogic as well. To include a JSON root field in the JSON documents, use the --json-root-name option with a value for the name of the root field. The data read from a row will then be nested under this root field.

Ignoring null fields

By default, Flux will include any fields in a JSON Lines file that have a null value (this does not include a value that has whitespace) when creating JSON or XML documents. You can instead ignore fields with a null value via the --ignore-null-fields option:

Unix
Windows

./bin/flux import-aggregate-json-files \
    --json-lines \
    --path path/to/file.txt \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update \
    --ignore-null-fields

bin\flux import-aggregate-json-files ^
    --json-lines ^
    --path path\to\file.txt ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update ^
    --ignore-null-fields

The decision on whether to include null fields will depend on your application requirements. For example, if your documents have large numbers of null fields, you may find them to be noise and decide to ignore them. In another case, it may be important to query for documents that have a particular field with a value of null.

Specifying an encoding

MarkLogic stores all content in the UTF-8 encoding. If your files use a different encoding, you must specify that via the --encoding option so that the content can be correctly translated to UTF-8 when written to MarkLogic:

Unix
Windows

./bin/flux import-aggregate-json-files \
    --path source \
    --encoding ISO-8859-1 \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update

bin\flux import-aggregate-json-files ^
    --path source ^
    --encoding ISO-8859-1 ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update

Reading compressed files

Flux will automatically read files compressed with gzip when they have a filename ending in .gz; you do not need to specify a compression option. As noted in the “Advanced options” section below, you can use -Pcompression= to explicitly specify a compression algorithm if Flux is not able to read your compressed files automatically. Note that the use of -Pcompression= is only supported if the --json-lines-raw option is not used.

Advanced options

The import-aggregate-json-files command reuses Spark’s support for reading JSON files. You can include any of the Spark JSON options via the -P option to control how JSON content is read. These options are expressed as -PoptionName=optionValue.

For example, if your files use a format other than yyyy-MM-dd values, you can specify that format via the following:

Unix
Windows

./bin/flux import-aggregate-json-files \
    --path source \
    -PdateFormat=MM-dd-yyyy \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update

bin\flux import-aggregate-json-files ^
    --path source ^
    -PdateFormat=MM-dd-yyyy ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update

Table of contents