Description
What is the new container you'd like to have?
Spark connect introduces a decoupled client-server architecture to allow remote connectivity to spark server, official documentation is here.
It's used by data engineers to distribute data transformation jobs into multiple clusters. Spark connect is an addition to spark with leverages the jvm.
Benefits of having this in container would enable data engineers:
- to be able to tests their workflows without having to go through a cloud provider like Databricks
- prevent the manual setup of jvm which can be quite cumbersome
The most commonly used docker image is apache/spark.
Why not just use a generic container for this?
The implementation of the spark connect server with DockerContainer would expose extra configurations. On corporate projects, the following implementation is required
kwargs = {
"entrypoint": "/opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master --packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 --conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' --conf spark.connect.grpc.binding.port=8081",
}
with (
DockerContainer(
"apache/spark",
)
.with_bind_ports(8081, 8081)
.with_env("SPARK_NO_DAEMONIZE", "True")
.with_volume_mapping(pytest_tmp_dir, pytest_tmp_dir, "rw")
.with_kwargs(**kwargs) as container
):
_ = wait_for_logs(container, "SparkConnectServer: Spark Connect server started at")
yield container
The added complexity is due to configuration of the entrypoint, one would need to have expertise in spark connect to launch the server and ensure the proper port exports. There is a compatibility versions to guarantee between spark and the delta-core jar package.
Other references:
Some resources here