When importing from tabular data sources, Flux provides support for aggregating rows together to produce hierarchical JSON or XML documents that often better resemble the logical entities associated with the documents. Flux supports this capability for each of the following data sources:
As an example, consider a scenario where you are importing customer data from a relational database via JDBC. Each customer record can have one or more related address records, which are stored in a separate table. You could simply import all customer rows and all address rows as separate documents in MarkLogic. But for a variety of use cases - including searching and security - you may want each customer document to contain all of its related addresses. Flux allows you to achieve that goal by aggregating related data such that hierarchical documents are written to MarkLogic containing one or more arrays of related records.
Usage
To aggregate rows together when using import-avro-files
, import-delimited-files
, import-jdbc
, import-orc-files
, or import-parquet-files
, use the following options:
--group-by
specifies a column name to group rows by; this is typically the column used in a join.--aggregate
specifies a string of the formnew_column_name=column1,column2,column3
. Thenew_column_name
column will contain an array of objects, with each object having columns ofcolumn1
,column2
, andcolumn3
.
For example, consider the Postgres tutorial database that features 15 tables with multiple relationships. One such relationship is between customers and payments. The following options would be used to write customer documents with each customer document containing an array of its related payments:
-
./bin/flux import-jdbc \ --query "select c.*, p.payment_id, p.amount, p.payment_date from customer c inner join payment p on c.customer_id = p.customer_id" \ --group-by customer_id \ --aggregate "payments=payment_id,amount,payment_date" \ --connection-string "flux-example-user:password@localhost:8004" \ --permissions flux-example-role,read,flux-example-role,update
-
bin\flux import-jdbc ^ --query "select c.*, p.payment_id, p.amount, p.payment_date from customer c inner join payment p on c.customer_id = p.customer_id" ^ --group-by customer_id ^ --aggregate "payments=payment_id,amount,payment_date" ^ --connection-string "flux-example-user:password@localhost:8004" ^ --permissions flux-example-role,read,flux-example-role,update
The options above result in the following aggregation being performed:
- Rows retrieved from the Postgres database are grouped together based on values in the
customer_id
column. - For each payment for a given customer, the values in a payment row -
payment_id
,amount
, andpayment_date
- are added to a struct that is then added to an array in a new column namedpayments
. - The
payment_id
,amount
, andpayment_date
columns are removed.
Each customer JSON document will as a result have a top-level payments
array containing one object for each related payment, as shown in the example below:
{
"customer_id": 1,
"first_name": "Mary",
"payments": [
{
"payment_id": 18495,
"amount": 5.99,
"payment_date": "2007-02-15T04:22:38.996Z"
},
{
"payment_id": 18496,
"amount": 0.99,
"payment_date": "2007-02-15T21:31:19.996Z"
}
]
}
The approach can be used for many joins as well, thus producing multiple top-level array fields containing related objects. You are restricted to a single --group-by
, but you can include many --aggregate
options, one for each join that you wish to aggregate rows from.