SlideShare a Scribd company logo
Big Data w/Python
On Kubernetes & Apache Spark
with @holdenkarau
Slides will be at:
http://bit.ly/2FifqdM
CatLoversShow
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC
● co-author of Learning Spark & High Performance Spark
● Twitter: @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Code review livestreams: https://www.twitch.tv/holdenkarau /
https://www.youtube.com/user/holdenkarau
● Spark Talk Videos http://bit.ly/holdenSparkVideos
● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
PySpark on Kubernetes @ Python Barcelona March Meetup
What is going to be covered:
● What is Spark
● What is Kubernetes
● How it’s different from YARN and other similar systems we use Spark on
● How “simple” it is to switch cluster managers
○ Plus the not so simple (where’s my HDFS and auto-scaling?)
● Links to recorded demos because I'm tired
● A brief detour in Kubeflow
● Future work and directions
Andrew
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Part of what lead to the success of Spark
● Integrated different tools which traditionally required different systems
○ Mahout, hive, etc.
● e.g. can use same system to do ML and SQL
*Often written in Python!
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
PySpark:
● The Python interface to Spark
● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API
● Has some serious performance hurdles from the design
● Same general technique used as the bases for the other non JVM
implementations in Spark
○ C#
○ R
○ Julia
○ Javascript - surprisingly different
Yes, we have wordcount! :p
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(output)
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
These are still
combined and
executed in
one python
executor
Trish Hamme
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
Kubernetes
“New” open-source cluster manager.
- github.com/kubernetes/kubernetes
Runs programs in Linux containers.
1600+ contributors and 60,000+ commits.
Kubernetes
“New” open-source cluster manager.
- github.com/kubernetes/kubernetes
libs
app
kernel
libs
app
libs
app
libs
app
Runs programs in Linux containers.
1600+ contributors and 60,000+ commits.
More isolation is good
Kubernetes provides each program with:
● a lightweight virtual file system -- Docker image
○ an independent set of S/W packages
● a virtual network interface
○ a unique virtual IP address
○ an entire range of ports
Aleksei I
Other isolation layers
● Separate process ID space
● Max memory limit
● CPU share throttling
● Mountable volumes
○ Config files -- ConfigMaps
○ Credentials -- Secrets
○ Local storages -- EmptyDir, HostPath
○ Network storages -- PersistentVolumes
Jarek Reiner
Dependencies
● Spark alone isn’t enough
● Think: spacy, sci-kit learn, tensorflow, etc.
● YARN: Shared conda env, but supporting different version is hard
Fuzzy Gerdes
Kubernetes architecture
node A node B
Pod 1 Pod 2 Pod 3
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Pod, a unit of scheduling and isolation.
● runs a user program in a primary container
● holds isolation layers like a virtual IP in an infra container
Robbt
Big Data on Kubernetes
Since Spark 2.3, the community has been working on a few
important new features that make Spark on Kubernetes more
usable and ready for a broader spectrum of use cases:
● non-JVM binding support and memory customization
● client-mode support for running interactive apps
● Kerberos support
● large framework refactors: rm init-container; scheduler
The Last Cookie
Spark on Kubernetes
Spark Core Kubernetes Scheduler Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
babbagecabbage
Spark on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Client
Client
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.4 10.0.0.5 10.0.1.3
Job 1
Job 2
How to change to running on Kubernetes?
In theory “just”:
--master yarn to --master k8s://[...]
In practice:
● Build a container with your dependencies
● Possibly change your storage (HDFS to S3 or GCS)
● Change your cluster manager
● Re-do your tuning work
Hisashi
Demo: Everyone loves wordcount!
It’s big data which means we have to do WordCount
Recorded demo - https://youtu.be/jaIU2VCTv88
Hisashi
Demo #2: Wordcount in client mode on K8s
Recorded demo - https://youtu.be/s2aU81Zyq9E
Luxus M
Demo #3: Wordcount in a notebook on K8s
Everyone loves notebooks, except ops, qa and your very
stressed out data engineers.
Recorded demo - https://youtu.be/eMj0Pv1-Nfo
Tim (Timothy)
Pearce
What do we need to do next?
● Support dynamic scaling
● Storage?
● Better auth integration
● Better documentation (ugh client mode)
● Want to help? Check out my slides on how to contribute:
http://bit.ly/2OeQl7w
Hisashi
Dynamic Scaling:
● Need a seperate shuffle service
● We could do smart scale down maybe -
https://github.com/apache/spark/pull/19045
Jennifer C.
Related talks & blog posts
● Running custom Spark on GKE and Azure -
https://www.oreilly.com/ideas/how-to-run-a-custom-version-of-spark-on-hoste
d-kubernetes
● Deploying Spark on Kubernetes -
http://spark.apache.org/docs/latest/running-on-kubernetes.html
● Getting PySpark 2.4 working on GKE recorded livestream -
https://www.youtube.com/watch?v=3j9D7B6PE60
Interested in OSS (especially Spark)?
● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau
& https://www.youtube.com/user/holdenkarau
Becky Lai
Unrelated: Kubeflow workshop
● Kubeflow is a tool for building end-to-end machine learning workflows on
Kubernetes. It supports Spark*
● Trevor & I are doing a workshop @ Strata SF and we'd love to trick offer the
option to do a self-guided to try a self-guided version free of charge of course
and provide us feedback. I can get cloud credits for you to try
● What you will might learn:
○ Installing Kubeflow
○ Setting up a project
○ Deploying that project to GCP / Azure / IBM
○ Monkeying around with a project and still having it work
● Please come and talk to me if interested. I'm wearing a dress with unicorns
*in the master branch as of February
fionasjournal
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
High Performance Spark!
Available today, nothing on Kubernetes, but that should not stop you from
buying several copies (if you have an expense account).
Cat’s love it!
Amazon sells it: http://bit.ly/hkHighPerfSpark :D
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
And some upcoming talks:
● March
○ Strata San Francisco -- next week
● April
○ Spark Summit
● May
○ KiwiCoda Mania
● June
○ Scala Days EU
● July
○ OSCON Portland
○ Skills Matter in London
Sparkling Pink Panda Scooter group photo by Kenzi
k thnx bye! (or questions…)
If you want to fill out a survey:
http://bit.ly/holdenTestingSpark
Give feedback on this presentation
http://bit.ly/holdenTalkFeedback
I'll be in the hallway or you
can email me:
holden@pigscanfly.ca

More Related Content

What's hot (20)

PDF
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Holden Karau
 
PDF
Spark Autotuning Talk - Strata New York
Holden Karau
 
PDF
Validating big data pipelines - FOSDEM 2019
Holden Karau
 
PDF
Validating big data jobs - Spark AI Summit EU
Holden Karau
 
PDF
Validating big data pipelines - Scala eXchange 2018
Holden Karau
 
PDF
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Holden Karau
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PDF
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 
PDF
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
PDF
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
Codemotion
 
PDF
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Codemotion
 
PDF
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
PDF
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
PPTX
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
PDF
data.table and H2O at LondonR with Matt Dowle
Sri Ambati
 
PPTX
Apache Flink Hands On
Robert Metzger
 
PDF
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Citus Data
 
PPTX
Open Source Monitoring Tools
m_richardson
 
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...
Holden Karau
 
Spark Autotuning Talk - Strata New York
Holden Karau
 
Validating big data pipelines - FOSDEM 2019
Holden Karau
 
Validating big data jobs - Spark AI Summit EU
Holden Karau
 
Validating big data pipelines - Scala eXchange 2018
Holden Karau
 
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark...
Databricks
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
HTTP2 in action - Piet Van Dongen - Codemotion Amsterdam 2017
Codemotion
 
Data Science Apps: Beyond Notebooks - Natalino Busa - Codemotion Amsterdam 2017
Codemotion
 
Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...
Chris Fregly
 
Atlanta Hadoop Users Meetup 09 21 2016
Chris Fregly
 
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
data.table and H2O at LondonR with Matt Dowle
Sri Ambati
 
Apache Flink Hands On
Robert Metzger
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Citus Data
 
Open Source Monitoring Tools
m_richardson
 

Similar to PySpark on Kubernetes @ Python Barcelona March Meetup (20)

PDF
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
DataWorks Summit
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PPTX
Tranquilizer
Albert DeFusco
 
PDF
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
PDF
Machine learning on kubernetes
Anirudh Ramanathan
 
PDF
Getting Started with Apache Spark on Kubernetes
Databricks
 
PDF
Kubernetes: The Next Research Platform
Bob Killen
 
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
PDF
Spark China Summit 2015 Guancheng Chen
Guancheng (G.C.) Chen
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
PDF
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
PDF
The path to a serverless-native era with Kubernetes
sparkfabrik
 
PDF
SCM Puppet: from an intro to the scaling
Stanislav Osipov
 
PDF
Migrating to spark 2.0
datamantra
 
PDF
Yannis Zarkadas. Enterprise data science workflows on kubeflow
MarynaHoldaieva
 
PDF
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
Lviv Startup Club
 
PDF
workshop_8_c__.pdf
AtulAvhad2
 
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
DataWorks Summit
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Tranquilizer
Albert DeFusco
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
SeungYong Oh
 
Machine learning on kubernetes
Anirudh Ramanathan
 
Getting Started with Apache Spark on Kubernetes
Databricks
 
Kubernetes: The Next Research Platform
Bob Killen
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
Spark China Summit 2015 Guancheng Chen
Guancheng (G.C.) Chen
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Scaling Apache Spark on Kubernetes at Lyft
Databricks
 
The path to a serverless-native era with Kubernetes
sparkfabrik
 
SCM Puppet: from an intro to the scaling
Stanislav Osipov
 
Migrating to spark 2.0
datamantra
 
Yannis Zarkadas. Enterprise data science workflows on kubeflow
MarynaHoldaieva
 
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
Lviv Startup Club
 
workshop_8_c__.pdf
AtulAvhad2
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Ad

Recently uploaded (20)

PPTX
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
PPTX
internet básico presentacion es una red global
70965857
 
PPTX
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
PDF
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
PDF
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
PDF
The Internet - By the numbers, presented at npNOG 11
APNIC
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PPTX
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
PPTX
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
PPTX
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PDF
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PDF
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
PPTX
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
internet básico presentacion es una red global
70965857
 
原版西班牙莱昂大学毕业证(León毕业证书)如何办理
Taqyea
 
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
Azure_DevOps introduction for CI/CD and Agile
henrymails
 
The Internet - By the numbers, presented at npNOG 11
APNIC
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
PE introd.pptxfrgfgfdgfdgfgrtretrt44t444
nepmithibai2024
 
Orchestrating things in Angular application
Peter Abraham
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
Apple_Environmental_Progress_Report_2025.pdf
yiukwong
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
𝐁𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓
hokimamad0
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
Optimization_Techniques_ML_Presentation.pptx
farispalayi
 
Ad

PySpark on Kubernetes @ Python Barcelona March Meetup

  • 1. Big Data w/Python On Kubernetes & Apache Spark with @holdenkarau
  • 2. Slides will be at: http://bit.ly/2FifqdM CatLoversShow
  • 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
  • 5. What is going to be covered: ● What is Spark ● What is Kubernetes ● How it’s different from YARN and other similar systems we use Spark on ● How “simple” it is to switch cluster managers ○ Plus the not so simple (where’s my HDFS and auto-scaling?) ● Links to recorded demos because I'm tired ● A brief detour in Kubeflow ● Future work and directions Andrew
  • 6. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 7. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 8. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 9. Part of what lead to the success of Spark ● Integrated different tools which traditionally required different systems ○ Mahout, hive, etc. ● e.g. can use same system to do ML and SQL *Often written in Python! Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 10. PySpark: ● The Python interface to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design ● Same general technique used as the bases for the other non JVM implementations in Spark ○ C# ○ R ○ Julia ○ Javascript - surprisingly different
  • 11. Yes, we have wordcount! :p lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(output) No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD These are still combined and executed in one python executor Trish Hamme
  • 12. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 13. Kubernetes “New” open-source cluster manager. - github.com/kubernetes/kubernetes Runs programs in Linux containers. 1600+ contributors and 60,000+ commits.
  • 14. Kubernetes “New” open-source cluster manager. - github.com/kubernetes/kubernetes libs app kernel libs app libs app libs app Runs programs in Linux containers. 1600+ contributors and 60,000+ commits.
  • 15. More isolation is good Kubernetes provides each program with: ● a lightweight virtual file system -- Docker image ○ an independent set of S/W packages ● a virtual network interface ○ a unique virtual IP address ○ an entire range of ports Aleksei I
  • 16. Other isolation layers ● Separate process ID space ● Max memory limit ● CPU share throttling ● Mountable volumes ○ Config files -- ConfigMaps ○ Credentials -- Secrets ○ Local storages -- EmptyDir, HostPath ○ Network storages -- PersistentVolumes Jarek Reiner
  • 17. Dependencies ● Spark alone isn’t enough ● Think: spacy, sci-kit learn, tensorflow, etc. ● YARN: Shared conda env, but supporting different version is hard Fuzzy Gerdes
  • 18. Kubernetes architecture node A node B Pod 1 Pod 2 Pod 3 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Pod, a unit of scheduling and isolation. ● runs a user program in a primary container ● holds isolation layers like a virtual IP in an infra container Robbt
  • 19. Big Data on Kubernetes Since Spark 2.3, the community has been working on a few important new features that make Spark on Kubernetes more usable and ready for a broader spectrum of use cases: ● non-JVM binding support and memory customization ● client-mode support for running interactive apps ● Kerberos support ● large framework refactors: rm init-container; scheduler The Last Cookie
  • 20. Spark on Kubernetes Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s babbagecabbage
  • 21. Spark on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.4 10.0.0.5 10.0.1.3 Job 1 Job 2
  • 22. How to change to running on Kubernetes? In theory “just”: --master yarn to --master k8s://[...] In practice: ● Build a container with your dependencies ● Possibly change your storage (HDFS to S3 or GCS) ● Change your cluster manager ● Re-do your tuning work Hisashi
  • 23. Demo: Everyone loves wordcount! It’s big data which means we have to do WordCount Recorded demo - https://youtu.be/jaIU2VCTv88 Hisashi
  • 24. Demo #2: Wordcount in client mode on K8s Recorded demo - https://youtu.be/s2aU81Zyq9E Luxus M
  • 25. Demo #3: Wordcount in a notebook on K8s Everyone loves notebooks, except ops, qa and your very stressed out data engineers. Recorded demo - https://youtu.be/eMj0Pv1-Nfo Tim (Timothy) Pearce
  • 26. What do we need to do next? ● Support dynamic scaling ● Storage? ● Better auth integration ● Better documentation (ugh client mode) ● Want to help? Check out my slides on how to contribute: http://bit.ly/2OeQl7w Hisashi
  • 27. Dynamic Scaling: ● Need a seperate shuffle service ● We could do smart scale down maybe - https://github.com/apache/spark/pull/19045 Jennifer C.
  • 28. Related talks & blog posts ● Running custom Spark on GKE and Azure - https://www.oreilly.com/ideas/how-to-run-a-custom-version-of-spark-on-hoste d-kubernetes ● Deploying Spark on Kubernetes - http://spark.apache.org/docs/latest/running-on-kubernetes.html ● Getting PySpark 2.4 working on GKE recorded livestream - https://www.youtube.com/watch?v=3j9D7B6PE60 Interested in OSS (especially Spark)? ● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau & https://www.youtube.com/user/holdenkarau Becky Lai
  • 29. Unrelated: Kubeflow workshop ● Kubeflow is a tool for building end-to-end machine learning workflows on Kubernetes. It supports Spark* ● Trevor & I are doing a workshop @ Strata SF and we'd love to trick offer the option to do a self-guided to try a self-guided version free of charge of course and provide us feedback. I can get cloud credits for you to try ● What you will might learn: ○ Installing Kubeflow ○ Setting up a project ○ Deploying that project to GCP / Azure / IBM ○ Monkeying around with a project and still having it work ● Please come and talk to me if interested. I'm wearing a dress with unicorns *in the master branch as of February fionasjournal
  • 30. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  • 31. High Performance Spark! Available today, nothing on Kubernetes, but that should not stop you from buying several copies (if you have an expense account). Cat’s love it! Amazon sells it: http://bit.ly/hkHighPerfSpark :D
  • 32. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  • 33. And some upcoming talks: ● March ○ Strata San Francisco -- next week ● April ○ Spark Summit ● May ○ KiwiCoda Mania ● June ○ Scala Days EU ● July ○ OSCON Portland ○ Skills Matter in London
  • 34. Sparkling Pink Panda Scooter group photo by Kenzi k thnx bye! (or questions…) If you want to fill out a survey: http://bit.ly/holdenTestingSpark Give feedback on this presentation http://bit.ly/holdenTalkFeedback I'll be in the hallway or you can email me: [email protected]