Flux provides two options for controlling which data is included when importing from structured data sources: use --where to filter rows, and --drop to exclude columns.
Table of contents
- TOC
Supported commands
Row and column filtering are available for the following commands:
Filtering rows usage
Use the --where option with a SQL-like expression to filter rows. Only rows where the expression evaluates to true are imported.
For example, to import only customers with a specific status:
-
./bin/flux import-delimited-files \ --path customers.csv \ --where "status = 'active'" \ --connection-string "flux-example-user:password@localhost:8004" \ --permissions flux-example-role,read,flux-example-role,update -
bin\flux import-delimited-files ^ --path customers.csv ^ --where "status = 'active'" ^ --connection-string "flux-example-user:password@localhost:8004" ^ --permissions flux-example-role,read,flux-example-role,update
Expression syntax
The --where option accepts any valid SQL WHERE clause expression supported by Apache Spark, including comparison operators, logical operators, pattern matching, null checks, and more.
For example, to filter by a numeric range:
--where "age >= 18 AND age < 65"
For complete SQL expression syntax and additional examples, see the Apache Spark SQL WHERE clause reference.
Combining with other features
The --where filter is applied before row aggregation, allowing you to filter data before grouping. This is useful when you want to aggregate only a subset of your source data.
For example:
-
./bin/flux import-jdbc \ --query "SELECT c.*, p.payment_id, p.amount FROM customer c INNER JOIN payment p ON c.customer_id = p.customer_id" \ --where "amount > 10.0" \ --group-by customer_id \ --aggregate "payments=payment_id,amount" \ --connection-string "flux-example-user:password@localhost:8004" \ --permissions flux-example-role,read,flux-example-role,update -
bin\flux import-jdbc ^ --query "SELECT c.*, p.payment_id, p.amount FROM customer c INNER JOIN payment p ON c.customer_id = p.customer_id" ^ --where "amount > 10.0" ^ --group-by customer_id ^ --aggregate "payments=payment_id,amount" ^ --connection-string "flux-example-user:password@localhost:8004" ^ --permissions flux-example-role,read,flux-example-role,update
In this example, only payments with an amount greater than $10 are included in each customer’s payments array.
Performance considerations
Filtering rows with --where is processed by Apache Spark after reading the data from the source. For optimal performance:
- When using
import-jdbc, consider using SQL WHERE clauses in your--queryoption to filter data at the database level before Spark processes it. - For file-based sources (Avro, ORC, Parquet), the
--wherefilter is applied during Spark’s data processing phase and can take advantage of file-level optimizations depending on the file format.
Reference
For complete SQL expression syntax, see the Apache Spark SQL reference.
Dropping columns
Use --drop to exclude one or more columns from the imported documents. For example: --drop column1 column2