Scalable Spark deployment using Kubernetes

Scalable Spark
Deployment using
Kubernetes
Power of Containers For Big Data
https://github.com/phatak-dev/kubernetes-spark

● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at
datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com

Agenda
● Deploying Big Data Products on Scale
● Microservices and Containers
● Introduction to Kubernetes
● Kubernetes Abstractions
● Spark 2.0 Docker Images
● Building Spark Cluster
● Scaling Spark Cluster
● Multiple Clusters
● Resource Isolation

Problem Statement
Need of unified deployment platform to
deploy big data based products on
cloud and on-prem with support for non
big data tools at scale.

A brief about Tellius Product
● Advanced Analytics product with support for ETL, data
exploration , visualization and advanced machine
learning
● Uses mongodb, Akka, Memsql, Node.js,Angular apart
from the spark
● Supported on both on cloud and on-prem
● Scales from few gb data to TB’s

Challenges of deploying our product
● Should support both big data and non big data based
deployments
● Multiple frameworks need clustering support for
horizontal scaling Ex: Spark, Memsql,Akka etc
● Should support different cloud platforms : Aws, Azure
etc
● Should support on-prem deployments also
● Ability to scale on demand

Challenges of Resource Sharing
● As multiple parts of application need horizontal scaling
choosing the right machines becomes a challenge
● We need to define the clustering parameters in terms of
machines rather than resource usage
● Should we deploy spark and memsql , which memory
hungry, applications on same nodes or different nodes?
● If on same cluster, how to isolate the different
applications on their resource usage?
● Support for multi tenancy?

Current Options
● Amazon EMR only supports the big data tools
deployment on aws
● Databricks only supports spark based deployments
● Azure and Google Cloud has their own way of setting
up deployments and scaling the spark
● On-prem, cloudera and other distribution of hadoop
have their own way setting up cluster.
● Also none of the above option have automated way of
delivering non-big data tools.

Microservice
● Way of developing and deploying an application as
collection of multiple services which communicate to
each other with lightweight mechanisms, often an HTTP
resource API
● These services are built around business capabilities
and independently deployable by fully automated
deployment machinery
● These services can be written in different languages
and can have different deployment strategies

Containerisation
● Containerisation is os-level virtualization
● In VM world, each VM has it’s own copy of operating
system.
● Container share common kernel in a given machine
● Very light weight
● Supports resource isolation
● Most of the time, each micro service will be deployed as
independent container
● This gives ability to scale independently

Introduction to Docker
● Containers were available in some operating systems
like solaris over a decade
● Docker popularised the containers on linux
● Docker is container runtime for running containers on
multiple operating system
● Started at 2013 and now synonymous with container
● Rocket from Coreos and LXD from canonical are the
alternative ones

Challenges with Containers
● Containers makes individual services of application
scale independently, but make discovering and
consuming these services challenging
● Also monitoring these services across multiple hosts are
also challenging
● Ability to cluster multiple containers for big data
clustering is challenge by default docker tools
● So there need to be way to orchestrate these containers
when you run a lot of services on top of it

Container Orchestrators
● Container orchestration are the tools for orchestrating
the containers on scale
● They provide mainly
○ Declarative configurations
○ Rules and Constraints
○ Provisioning on multiple hosts
○ Service Discovery
○ Health Monitoring
● Support multiple container runtimes

Different Container Orchestrators
● Docker Compose - Not a orchestrator, but has basic
service discovery
● Docker Swarm by Docker Company
● Kubernetes by Google
● Apache Mesos with Docker integrations

Solution
● Deploy each part of the product as micro service
● Use a container orchestrator to scale each service
depending upon the needs
● Discover services using orchestrator capabilities
● Use the orchestrator to deploy on different cloud and
on-prem

Kubernetes
● Open source system for
○ Automating deployment
○ Scaling
○ Management
of containerized applications.
● Production Grade Container Orchestrator
● Based on Borg and Omega , the internal container
orchestrators used by Google for 15 years
● https://kubernetes.io/

Why Kubernetes
● Production Grade Container Orchestration
● Support for Cloud and On-Prem deployments
● Agnostic to Container Runtime
● Support for easy clustering and load balancing
● Support for service upgradation and rollback
● Effective Resource Isolation and Management
● Well defined storage management

Minikube
● Minikube is a tool that is used to run kubernetes locally
● It runs single node kubernetes cluster using
virtualization layers like virtual box, hyper-v etc
● In our example, we run minikube using virtualbox
● Very useful trying out kubernetes for development and
testing purpose
● For installation steps, refer
http://blog.madhukaraphatak.com/scaling-spark-with-kuber
netes-part-2/

Kubectl
● Kubectl is a command line utility to interact with
kubernetes REST API
● This allows us to create, manage and delete different
resources in kubernetes
● Kubectl can connect to any kubernetes cluster
irrespective where it’s running
● We need to install the kubectl with minikube for
interacting with kubernetes

Minikube Operations
● Starting minikube
minikube start
● Observe running VM in the virtualbox
● See kubernetes dashboard
minikube dashboard
● Run kubectl
kubectl get po

Different Types of Abstraction
● Compute Abstractions ( CPU)
Abstraction related to create and manage compute
entities. Ex : Pod, Deployment
● Service/Network Abstractions (Network)
Abstraction related to exposing service on network
● Storage Abstractions (Disk)
Disk related abstractions

Pod Abstraction
● Pod is a collection of one or more containers
● Smallest compute unit you can deploy on the
kubernetes
● Host Abstraction for Kubernetes
● All containers run in single node
● Provides the ability for containers to communicate to
each other using localhost

Defining Pod
● Kubernetes uses YAML/Json for defining resources in
its framework
● YAML is human readable serialization format mainly
used for configuration
● All our examples, uses the YAML.
● We are going to define a pod , where we create
container of nginx
● kube_examples/nginxpod.yaml

Creating and Running Pod
● Once we define the pod, we need create and run the
pod
kubectl create -f kube_examples/nginxpod.yaml
● See running pod
kubectl get po
● Observe same on dashboard
● Stop Pod
kubectl delete -f kube_examples/ngnixpod.yaml

Drawbacks of Pod Abstraction
● Pod abstraction allows to define only single copy
container at a time
● It’s good enough for monolithic web applications
● But for spark kind of applications, which we need
clustering, we need to define multiple copies of same
container for clustering purposes
● Also pod abstraction, doesn’t support high availability
and upgrade support

Deployment Abstraction
● Abstraction for end to end life cycle of pods
● Ability to
○ Create
○ Upgrade
○ Destroy
pods
● Support multiple replicas
● kube_examples/ngnixdeployment.yaml

Container Port
● containerPort exposes the specific port on the container
● Uses the underneath container runtime, like docker, to
implement this functionality
● Used for open up port for web container to listen on 80
etc
● kube_examples/ngnixdeployment.yaml

Service
● Service abstraction defines a set of logical pods.
● This is a network abstraction which defines a policy to
expose micro service using these pods to other parts of
the application.
● Separation of Concern for compute and service
● Ability to upgrade independent parts
● Labeling abstraction for connecting services and pods
● kube_examples/nginxservice.yaml

Creating and Running Service
● Create Service
kubectl create -f kube_examples/nginxservice.yaml
● List Services
kubectl get svc
● Describe Service Details
kubectl describe svc nginx-service

Service EndPoint
● By default, all the services defined in the kubernetes are
only accessible within the pods of the cluster
● This one make sure that only services needed has to be
exposed to the public explicitly
● So we need to know the end point to actually call this
service
● This can be retrieved using the below command
kubectl describe svc nginx-service

Testing Service With BusyBox
● Once we have the endpoint, we can test it by a pod
inside our cluster
● We create a pod of the image using busybox
● Busybox is a minimal linux distribution with shell utilities
● kubectl run -i --tty busybox --image=busybox
--restart=Never -- sh
● wget -0 - <end-point>

Building Spark 2.0 Docker Image

Need for Custom Spark Image
● All kubernetes deployments need a docker image to
create pod or deployment
● Default spark image and configuration provided in the
kubernetes uses old version of spark
● It also uses google cloud specific configuration which
we don’t need in our application
● Having custom image allows us to control the
upgradation of the spark in future

Docker File
● Dockerfile is a file format defined by docker to create
reproducible docker images
● We create single image for used in both spark master
and worker containers
● We are using spark 2.1.0 version with Java 8
● We will add external shell scripts for starting master and
starting worker
● docker/Dockerfile

Building Docker Image
● We need to connect to the docker daemon of the
minikube to build the image inside vm
eval $(minikube docker-env)
● Run docker ps
● Build the docker image
docker build -t spark-2.1.0-bin-hadoop2.6 .
● View docker images
docker images

Spark Master Deployment
● Spark Master deployment, defines the configuration for
running spark master as single pod
● We expose 7077 port as the master listens on that port
● Use start-master script inside the docker image to start
the spark-master
● We are using standalone cluster for cluster
● spark-master.yaml

Spark Master Service
● Once we define the spark-master, we need to expose it
using a service
● This service will be used for workers to connect to
master pod
● We will expose
○ 8080 - For Web UI
○ 7077 - For Connecting to master
● We also name the service as spark-master

Spark Worker Deployment
● Once we defined the spark-master, we need to define
the spark-worker deployment
● As it’s two node cluster, we will single worker as of now
● We will expose
○ 7078 - For UI communication purposes
● Uses start-worker.sh script to start the worker
● Doesn’t need the service as workers are not exposed

Testing Single Node Cluster
● We can verify the UI using port-forward
kubectl port-forward <spark-master-name> 8080:8080
● Login to the master
kubectl exec -it <spark-master-name> bash
● Run spark-shell and run spark code
/opt/spark/bin/spark-shell --master spark://spark-master:7077
sc.makeRDD(List(1,2,4,4)).count

Dynamically Scaling
● We can increase/decrease number of worker pods
without changing the configurations
● Increase
kubectl scale deployment spark-worker --replicas 2
● Decrease
kubectl scale deployment spark-worker --replicas 1
● Observe change in spark ui

Namespace Abstraction
● We can create multiple spark clusters on single
kubernetes cluster using namespace abstraction
● Namespace is a virtual cluster on physical kubernetes
cluster
● Namespace gives separate namespace for pods,
services etc
● We can also apply resource restriction on the
namespace for resource management

Multiple Cluster using Namespace
● Create namespace
kubectl create namespace cluster2
● Get all namespace
Kubectl get namespaces
● Set the namespace
export CONTEXT=$(kubectl config view | awk '/current-context/ {print $2}')
kubectl config set-context $CONTEXT --namespace=cluster2

Changing Version of Spark
● Now we have 2.1.0 version running
● We can change our deployment without changing our
configuration
● We have another image spark-1.6.3-bin-hadoop2.6
● We can use deployment abstraction lifecycle
management to set the different image to running pods
● This will make new pods up and then deletes the old
pods

Deployment Set Image
● kubectl set image deployment/spark-master
spark-master=spark-1.6.3-bin-hadoop2.6
● kubectl set image deployment/spark-worker
spark-worker=spark-1.6.3-bin-hadoop2.6
● kubectl rollout status deployment/spark-master
● kubectl rollout status deployment/spark-worker

Resource Isolation and Management

Controlling Resource Usage
● By default, pod can use unlimited memory and cpu
● We can set minimum and maximum resource usage per
pod
● In our example, we are going to set limits on spark
worker which will use 1GB RAM and 1 core
● We can same information to spark also, so that it will
reflect on spark UI
● spark-worker-resource.yaml

Summary
● Microservice based architecture to develop and deploy
spark with other tools
● Use container orchestrator kubernetes to deploy and
manage application lifecycle
● Make sure deployment and service abstractions for
clustering and scale
● Use resource isolation of docker and kubernetes for
better server density

References
● https://martinfowler.com/articles/microservices.html
● https://thenewstack.io/containers-container-orchestratio
n/
● http://blog.madhukaraphatak.com/categories/kubernete
s-series/
● https://kubernetes.io/docs/home/

Scalable Spark deployment using Kubernetes

More Related Content

What's hot (20)

Similar to Scalable Spark deployment using Kubernetes (20)

More from datamantra (20)

Recently uploaded (20)

Scalable Spark deployment using Kubernetes