Flux provides two options for controlling which data is included when importing from structured data sources: use --where to filter rows, and --drop to exclude columns.

Table of contents

  • TOC

Supported commands

Row and column filtering are available for the following commands:

Filtering rows usage

Use the --where option with a SQL-like expression to filter rows. Only rows where the expression evaluates to true are imported.

For example, to import only customers with a specific status:

  • ./bin/flux import-delimited-files \
        --path customers.csv \
        --where "status = 'active'" \
        --connection-string "flux-example-user:password@localhost:8004" \
        --permissions flux-example-role,read,flux-example-role,update
    
  • bin\flux import-delimited-files ^
        --path customers.csv ^
        --where "status = 'active'" ^
        --connection-string "flux-example-user:password@localhost:8004" ^
        --permissions flux-example-role,read,flux-example-role,update
    

Expression syntax

The --where option accepts any valid SQL WHERE clause expression supported by Apache Spark, including comparison operators, logical operators, pattern matching, null checks, and more.

For example, to filter by a numeric range:

--where "age >= 18 AND age < 65"

For complete SQL expression syntax and additional examples, see the Apache Spark SQL WHERE clause reference.

Combining with other features

The --where filter is applied before row aggregation, allowing you to filter data before grouping. This is useful when you want to aggregate only a subset of your source data.

For example:

  • ./bin/flux import-jdbc \
        --query "SELECT c.*, p.payment_id, p.amount FROM customer c INNER JOIN payment p ON c.customer_id = p.customer_id" \
        --where "amount > 10.0" \
        --group-by customer_id \
        --aggregate "payments=payment_id,amount" \
        --connection-string "flux-example-user:password@localhost:8004" \
        --permissions flux-example-role,read,flux-example-role,update
    
  • bin\flux import-jdbc ^
        --query "SELECT c.*, p.payment_id, p.amount FROM customer c INNER JOIN payment p ON c.customer_id = p.customer_id" ^
        --where "amount > 10.0" ^
        --group-by customer_id ^
        --aggregate "payments=payment_id,amount" ^
        --connection-string "flux-example-user:password@localhost:8004" ^
        --permissions flux-example-role,read,flux-example-role,update
    

In this example, only payments with an amount greater than $10 are included in each customer’s payments array.

Performance considerations

Filtering rows with --where is processed by Apache Spark after reading the data from the source. For optimal performance:

  • When using import-jdbc, consider using SQL WHERE clauses in your --query option to filter data at the database level before Spark processes it.
  • For file-based sources (Avro, ORC, Parquet), the --where filter is applied during Spark’s data processing phase and can take advantage of file-level optimizations depending on the file format.

Reference

For complete SQL expression syntax, see the Apache Spark SQL reference.

Dropping columns

Use --drop to exclude one or more columns from the imported documents. For example: --drop column1 column2


This site uses Just the Docs, a documentation theme for Jekyll.