This repo is still an early-stage work in progress.
This is the PySpark plugin component.
There are several Spark config options needed to use the library. tests/conftest.py provides a basic example. The key settings to note are:
.config("spark.driver.userClassPathFirst", "true")
.config("spark.executor.userClassPathFirst", "true")
.config("spark.driver.extraClassPath", get_dependency_classpath())
.config("spark.executor.extraClassPath", get_dependency_classpath())
Starting with Spark 3.5, Spark includes an older version of the DataSketches java library, so Spark needs to know to use the provided verison.
When using a datasketches-spark library compiled for Java 17 or newer, there are additional configuration options needed. Specifically, you must enable the jdk.incubator.foreign
module in both the Spark driver and executors. Especially for the driver, this must be specified prior to creating the JVM; the module cannot be added once the JVM is initialized.
Looking again at tests/conftest.py, we need to add
.config('spark.executor.extraJavaOptions', java_opts)
with java_opts
set to --add-modules=jdk.incubator.foreign --add-exports=java.base/sun.nio.ch=ALL-UNNAMED'
. We must also use that value to specify --driver-java-options ${java_opts}
. That option may be specified either via the PYSPARK_SUBMIT_ARGS
environment variable if running a python script directly, or as a command-line argument to spark-submit
if submitting to a cluster.
This component requires that the Scala library is already built. The build process will check for the availability of the relevant jars and fail if they do not exist. It will also update the jars in the event the current python module's copies are older.
The easiest way to build the library is with the build
package:
python -m build --wheel
. The resulting wheel can then be installed with python -m pip install dist/datasketches_spark_<version-info>.whl
Tests are run with pytest
or tox
.