New Container: Spark Connect

**What is the new container you'd like to have?**

Spark connect introduces a decoupled client-server architecture to allow remote connectivity to spark server, official documentation is [here](https://spark.apache.org/docs/latest/spark-connect-overview.html).

It's used by data engineers to distribute data transformation jobs into multiple clusters. Spark connect is an addition to spark with leverages the jvm.

Benefits of having this in container would enable data engineers:
- to be able to tests their workflows without having to go through a cloud provider like Databricks
- prevent the manual setup of jvm which can be quite cumbersome

The most commonly used docker image is [apache/spark](https://hub.docker.com/r/apache/spark).

**Why not just use a generic container for this?**

The implementation of the spark connect server with DockerContainer would expose extra configurations. On corporate projects, the following implementation is required

```
    kwargs = {
        "entrypoint": "/opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master --packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 --conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp' --conf spark.connect.grpc.binding.port=8081",
    }
with (
        DockerContainer(
            "apache/spark",
        )
        .with_bind_ports(8081, 8081)
        .with_env("SPARK_NO_DAEMONIZE", "True")
        .with_volume_mapping(pytest_tmp_dir, pytest_tmp_dir, "rw")
        .with_kwargs(**kwargs) as container
    ):
        _ = wait_for_logs(container, "SparkConnectServer: Spark Connect server started at")
        yield container

```

The added complexity is due to configuration of the entrypoint, one would need to have expertise in spark connect to launch the server and ensure the proper port exports. There is a compatibility versions to guarantee between spark and the delta-core jar package.

**Other references:**

Some resources [here](https://github.com/sdesilva26/docker-spark/blob/master/docker/Dockerfile_master)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Container: Spark Connect #825

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New Container: Spark Connect #825

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions