Filtering data

Flux provides two options for controlling which data is included when importing from structured data sources: use --where to filter rows, and --drop to exclude columns.

Supported commands

Row and column filtering are available for the following commands:

Filtering rows usage

Use the --where option with a SQL-like expression to filter rows. Only rows where the expression evaluates to true are imported.

For example, to import only customers with a specific status:

Unix
Windows

./bin/flux import-delimited-files \
    --path customers.csv \
    --where "status = 'active'" \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update

bin\flux import-delimited-files ^
    --path customers.csv ^
    --where "status = 'active'" ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update

Expression syntax

The --where option accepts any valid SQL WHERE clause expression supported by Apache Spark, including comparison operators, logical operators, pattern matching, null checks, and more.

For example, to filter by a numeric range:

--where "age >= 18 AND age < 65"

For complete SQL expression syntax and additional examples, see the Apache Spark SQL WHERE clause reference.

Combining with other features

The --where filter is applied before row aggregation, allowing you to filter data before grouping. This is useful when you want to aggregate only a subset of your source data.

For example:

Unix
Windows

./bin/flux import-jdbc \
    --query "SELECT c.*, p.payment_id, p.amount FROM customer c INNER JOIN payment p ON c.customer_id = p.customer_id" \
    --where "amount > 10.0" \
    --group-by customer_id \
    --aggregate "payments=payment_id,amount" \
    --connection-string "flux-example-user:password@localhost:8004" \
    --permissions flux-example-role,read,flux-example-role,update

bin\flux import-jdbc ^
    --query "SELECT c.*, p.payment_id, p.amount FROM customer c INNER JOIN payment p ON c.customer_id = p.customer_id" ^
    --where "amount > 10.0" ^
    --group-by customer_id ^
    --aggregate "payments=payment_id,amount" ^
    --connection-string "flux-example-user:password@localhost:8004" ^
    --permissions flux-example-role,read,flux-example-role,update

In this example, only payments with an amount greater than $10 are included in each customer’s payments array.

Performance considerations

Filtering rows with --where is processed by Apache Spark after reading the data from the source. For optimal performance:

When using import-jdbc, consider using SQL WHERE clauses in your --query option to filter data at the database level before Spark processes it.
For file-based sources (Avro, ORC, Parquet), the --where filter is applied during Spark’s data processing phase and can take advantage of file-level optimizations depending on the file format.

Reference

For complete SQL expression syntax, see the Apache Spark SQL reference.

Dropping columns

Use --drop to exclude one or more columns from the imported documents. For example: --drop column1 column2

Table of contents