Spark Integration

Flux can be used with an existing Apache Spark cluster to support large workloads that require more system resources than what are available when running Flux as a command line application.

Spark security notice
Submitting Flux commands
- Accessing S3 with spark-submit
- Using Avro data with spark-submit
Configuring Spark for local usage

Spark security notice

As of October 2024 and the Flux 1.1.0 release, all public releases of Apache Spark 3.4.x through 3.5.6 depend on Apache Hadoop 3.3.4. This version of Hadoop has a CVE filed against it. The CVE involves Spark running with a YARN cluster manager and the YARN cluster “is accepting work from remote (authenticated) users”.

For normal Flux CLI usage, this CVE is a false positive as Flux does not use a YARN cluster manager. Flux uses Spark’s standalone cluster manager by default - see the Spark documentation for further information on Spark cluster managers.

If you use Flux with the Apache Spark spark-submit program as described below, you should consider the above CVE if you are also running a YARN cluster manager. In this scenario, you have full control and responsibility over your Spark cluster, as Flux does not include any Spark or Hadoop packages when it is used with spark-submit.

Submitting Flux commands

Flux integrates with spark-submit to allow you to submit a Flux command invocation to a remote Spark cluster. Every Flux command is a Spark application, and thus every Flux command, along with all of its option, can be invoked via spark-submit.

To use Flux with spark-submit, first download the marklogic-flux-1.4.0-all.jar file from the GitHub release page. This jar file includes Flux and all of its dependencies, excluding those of Spark itself, which will be provided via the Spark cluster that you connect to via spark-submit.

You can now run any Flux command with spark-submit. As of Flux 1.4.0, Flux depends on Spark 3.5.6, and thus you should use Spark 3.5.6 or higher. Prior versions of Spark 3.5.x may work as well. Please also ensure that you use a Spark runtime that depends on Scala 2.12 and not Scala 2.13.

The following shows a notional example of running the Flux import-files command:

Unix
Windows

$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
    --master spark://changeme:7077 \
    marklogic-flux-1.4.0-all.jar \
    import-files \
    --path path/to/data \
    --connection-string user:password@host:8000 \
    --preview 10

$SPARK_HOME\bin\spark-submit --class com.marklogic.flux.spark.Submit ^
    --master spark://changeme:7077 ^
    marklogic-flux-1.4.0-all.jar ^
    import-files ^
    --path path/to/data ^
    --connection-string user:password@host:8000 ^
    --preview 10

Key points on the above example:

You must provide a value of com.marklogic.flux.spark.Submit for the required --class option. This class provides the entry point that allows for a Flux command to be run.
Set the value of --master to the master URL of your Spark cluster.
As noted in the Spark documentation, the path to the Flux jar - the “application jar” - must be globally visible inside your cluster.
The “application arguments” consist of the name of the Flux command and the options you provide to that command.

Accessing S3 with spark-submit

If you wish to run a Flux command that accesses an S3 path, you must use the --packages option with spark-submit to include the necessary dependencies to allow for Spark to access S3, as shown below:

--packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4

The above packages are not included in the Flux jar file, as the version required by your Spark cluster may differ.

Using Avro data with spark-submit

Per the Spark Avro documentation, the spark-avro dependency is not included in Spark by default. If you wish to run either import-avro-files or export-avro-files with Flux and spark-submit, you must include the following line in your spark-submit invocation:

--packages org.apache.spark:spark-avro_2.12:3.4.3

You should change the version number of this dependency to match that of your Spark cluster.

Configuring Spark for local usage

When Flux is run via the command line, it creates a temporary local Spark cluster with a Spark master URL of local[*]. You can adjust this via the --spark-master-url option. Typically, this will involve still creating a local Spark cluster but with different settings. If you wish to connect to a remote Spark cluster, you will most likely want to use spark-submit as described above.

Table of contents

Spark security notice

Submitting Flux commands

Accessing S3 with spark-submit

Using Avro data with spark-submit

Configuring Spark for local usage