This playground is a simple example of how to use the Delta Lake gRPC server to write to a Delta Lake table.
We need to build an image with the additional Python dependencies required for PySpark to connect to the gRPC server. The image extends the Delta Docker base image.
Take a peek at the entrypoint.sh file to see how the container runs. There are two available modes: server and client. These options are passed from the docker-compose.yaml via command: ["server"] or command: ["client"].
docker buildx build \
-t newfrontdocker/delta-connect-playground:latest \
-f Dockerfile \
--load \
.The image doesn't need to be built if you want to use what is in Docker Hub: newfrontdocker/delta-connect-playground:4.0.0
docker network create connect
Note: The
docker-compose.yamlwill start both the server and client automatically. If you want to run just the server then you can use the following command.
docker compose up --remove-orphans delta-connect-serverThis is where the magic happens. Once the server is up and running, you can connect to it via the spark-connect protocol.
docker compose up --remove-orphans delta-connect-serverYou will see a lot of logging followed by this line:
INFO SparkConnectServer: Spark Connect server started at: 0:0:0:0:0:0:0:0:15002
Once the server is started, any interaction you have will be visible via the logs in the container.
View the Server UI: http://localhost:4040/connect/
Note: The
docker-compose.yamlwill start both the server and client automatically. If you want to run just the client then you can use the following command.
docker compose up --remove-orphans delta-connect-playground Note: the client is pretty useless without the server running :)
Here is an example of connecting outside of Jupyter from within the Docker network
$SPARK_HOME/bin/pyspark --remote "sc:docker-connect-server"Here is an example of connecting from localhost.
Note: the docker-compose.yaml file exposes port 15002 on the host.
$SPARK_HOME/bin/pyspark --remote "sc:localhost"Note: The scripts directory starts the
delta-connectserver when you start up the Docker image. This will pull down thespark.connectextension jars before spinning up the server.
The command will install the Spark extensions to enable the gRPC server we'll use to write to our Delta Lake tables.
${SPARK_HOME}/sbin/start-connect-server.sh \
--conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp -Dio.netty.tryReflectionSetAccessible=true" \
--packages io.delta:delta-connect-server_2.13:4.0.0,com.google.protobuf:protobuf-java:3.25.1 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
--conf "spark.connect.extensions.relation.classes=org.apache.spark.sql.connect.delta.DeltaRelationPlugin" \
--conf "spark.connect.extensions.command.classes=org.apache.spark.sql.connect.delta.DeltaCommandPlugin"
Server Port: 15002 - we have to punch a hole in the Docker network to allow peering on this port via the connect client application. This is done via the docker-compose.yaml file.
Tip: To see if you can connect to the remote server:
% nc -z localhost 15002
Connection to localhost port 15002 [tcp/*] succeeded!
- Spark Connect - docs on Spark Connect
- Delta Lake - main Delta Lake docs
- Delta Connect Server: https://mvnrepository.com/artifact/io.delta/delta-connect-server_2.13/4.0.0
- Delta Connect Client: https://mvnrepository.com/artifact/io.delta/delta-connect-client_2.13/4.0.0
export VERSION=latest
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t newfrontdocker/delta-connect-playground:${VERSION} \
-f Dockerfile \
--push \
.
Thanks to Buf for enabling me to work on this project for the Open Source Community. Check out the following links to learn more about Buf:
- Buf's Protobuf toolchain - with simple Protobuf compilation, linting, and breaking-change detection.
- Bufstream - our drop-in replacement for Kafka with Broker-Side Protobuf Semantic Validation.