SlideShare a Scribd company logo
Brad Kaiser, IBM/TWC
Craig Ingram, IBM/TWC
Supporting Highly
Multitenant Spark
Notebook Workloads
#EUdev8
Hosting Multitenant Spark Notebooks Is
Hard But It Doesn't Have To Be
• Our Journey
• Best Practices
• New Work
2#EUdev8
Our Journey
3#EUdev8
Who we are
4#EUdev8
IBM's Commitment to Open Source
5#EUdev8
• Contribute intellectual and technical
capital to the Apache Spark
community.
• Make the core technology enterprise-
and cloud-ready.
• Build data science skills to drive
intelligence into business applications
— https://cognitiveclass.ai/
• http://spark.tc
Mission
• Provide:
– a secure, performant, stable cluster.
– interactive analytics, visualizations, and reports.
– collaboration and sharing with other data scientists,
engineers, and consumers.
– job scheduling capabilities.
– a quick and easy way to get started with Spark.
6#EUdev8
Goals
• Support hundreds of analysts/data scientists
using Spark
– Quick kernel creation (<10s from notebook creation to
available sc).
– Utilize cluster resources efficiently.
– Elastically scale based on load.
7#EUdev8
Lessons Learned at TWC
8#EUdev8
One Big Cluster
9#EUdev8
• Spark collocated with
Cassandra
• Fast
• Stable
One Big Cluster - Cons
10#EUdev8
• Outside analysts start using
our cluster
• Provided notebook services
• Interrupted our perfectly
scheduled jobs
• Used a lot of resources
causing Cassandra to crash
Add Smaller Clusters
11#EUdev8
• We built some smaller
clusters
• Still platform agnostic
• Other teams couldn't
affect our production
cluster
Add Smaller Clusters - Cons
12#EUdev8
• Hassle to set up
• Required a lot of
maintenance
• Sat idle
EMR
• Analysts make ad hoc clusters for their needs
• No maintenance from us
• Learning curve for analysts
• They tend to leave them running
13#EUdev8
Lessons Learned - IBM
14#EUdev8
Data Science Experience
• Collaborative environment on the front end
– Collaboration Tools
– Shared Data Sets
– Flows
– GitHub Integration
• Multiple compute environments on the back end
– DSX on the cloud: compute runs on IBM cloud
– DSX Local: compute runs on private cloud or Z
15#EUdev8
Lessons Learned - IBM
• You need kernel remoting
– Allows advanced collaborative tools in the application tier
– Allows resource consolidation in the analytics tier
• Resource consolidation puts stress on the analytics
tier
– Starvation
– Management of cached data
– Performance bottlenecks (example: Spark web UI)
16#EUdev8
Best Practices
17#EUdev8
Best Practices
• Use kernel remoting
• Use fewer, bigger clusters
• Know your workloads
• Isolate users
• Schedule resources efficiently
18#EUdev8
Use Kernel Remoting
• Running all of your notebook kernels on the
same server is a bottle neck
• Run your kernels distributed on the cluster
• You can run a lot more notebooks
19#EUdev8
Jupyter Enterprise Gateway
• New Open Source project from IBM
• Goals:
– Allow hundreds of notebook users to share a single Spark
cluster…
– …with enterprise-level security and performance.
• Used in IBM Analytics Engine (GA)
• developer.ibm.com/code/openprojects/jupyter-
enterprise-gateway-2
20#EUdev8
Jupyter Enterprise Gateway:
How it Works
21#EUdev8
Spark Cluster
Security
Layer
Jupyter Enterprise Gateway
• Multitenant
• Remote Kernel Lifecycle Management
YARN
Workers
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
Spark
ExecutorsSpark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark
ExecutorsSpark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Secure
Secure
Secure Secure
Use Fewer, Bigger Clusters
• Better resource utilization through statistical multiplexing
• Improved security and auditing due to centralization
– Hive, Ranger, and Atlas are common in the ecosystem.
– Many new, platform specific solutions to address this problem.
• Easier collaboration between users
– Shared notebooks with interactive visualizations and markdown
support.
– GitHub integration for versioning and external sharing.
– Catalog based data discovery and sharing.
– Governance and auditing support.
22#EUdev8
Know your workloads
• What is the main resource they use?
• Overprovision the hardware
• When to scale up and down?
– CPU load
– YARN/RM Queue Stats (depth, waiting jobs, available
CPU/mem, preemptions)
• If containers are getting preempted, it’s due to queues filling
up.
23#EUdev8
Isolate Users
• Don't have a generic user account
• Catalog and Governance
– Hive
– Atlas
– Ranger
• You don't want users embedding keys and
passwords in their notebooks.
24#EUdev8
Schedule Resources Efficiently
25#EUdev8
YARN Queues
• Take advantage of YARN’s hierarchical queue system to manage
and organize resources.
– Over-allocate queues for better resource utilization and sharing.
– Take advantage of node labels for users that have priority jobs
that require an SLA.
– Intra-queue preemption and asynchronous container allocation
should be available in YARN 3.0.
26#EUdev8
YARN Queues
27#EUdev8
Dynamic Allocation
• Dynamic allocation lets you take advantage of varying activity
– Proactively scales the number of executors based on the
scheduler's backlog.
– Removes idle executors after a timeout.
– Be sure to set a sensible number of initial executors
and minimum executor floor and let them ramp up on
demand.
• Static allocation best for known workloads
28#EUdev8
New Work
29#EUdev8
Improvements to Spark
• Alleviate the tradeoffs inherent in current best
practices.
– Recover cached data when shutting down idle
executors
– Proactively shut down executors to prevent
starvation
30#EUdev8
Recovering Cached Data
• Replicates cached data to remaining executors
before shutting them down.
• Ameliorates cache issues with dynamic
allocation
• Useful in shared spark notebook environments
31#EUdev8
Benchmarks
32#EUdev8
Benchmarks
33#EUdev8
Check out my PR
– SPARK-21097
– github.com/apache/spark/pull/19041
34#EUdev8
Preventing Starvation
• Eliminate issues where users are unable to run
anything due to other users taking up all of the
cluster’s resources.
• Especially useful in shared spark notebook
environments where idle resources can be
reclaimed easily.
• Preemption can solve this.
35#EUdev8
Enter Preemption
• Requests containers associated with over-
allocated queues to shut down.
• Handle YARN's PreemptionMessage in a way
that best suits the workload.
• Pick the right executors to terminate.
36#EUdev8
Keep up with the JIRA
• SPARK-21122
• PR coming soon
37#EUdev8
Call to action
• Look at our JIRAs
• Try out our PRs
38#EUdev8
Shout-outs!!!
• Our notebook workload simulator, benchmark,
and tracing tools.
– spark-bench - github.com/SparkTC/spark-bench
• Check out Emily Curtin’s talk tomorrow about spark-bench.
– spark-tracing - github.com/SparkTC/spark-tracing
• Matthew Schauer's baby awaiting open-source approval.
• Special thanks to Vijay Bommireddipalli and
Fred Reiss for their guidance and support!
39#EUdev8
Contact Info
Brad Kaiser
kaiserb@us.ibm.com
Craig Ingram
cingram@us.ibm.com
40#EUdev8
Extra Material
41#EUdev8
YARN Asynchronous Scheduling
• Enable asynchronous scheduling of containers in YARN.
– yarn.scheduler.capacity.schedule-asynchronously.enable
– YARN-7327 and YARN-5139
42#EUdev8
References
• Some Icons provided by Icons8
43#EUdev8
Quick Settings
• Use spark.yarn.jars or
spark.yarn.archive.
• Running Spark from tmpfs did not improve
performance.
• Support for multiple/new versions of Spark.
44#EUdev8
Move to backup
Disable unused credential providers
• spark.yarn.security.credentials.hive.enabled
• spark.yarn.security.credentials.hbase.enabled
45#EUdev8
conf avg stdev
default 8.745 0.051
nohive 8.7 0.05
nohbase 7.87 0.05
Move to backup
Problem Domain
• Security
– UI and service protection
– Data governance and auditing
• Stability
• Performance
46#EUdev8
What’s next…
• spark-on-k8s
• Scheduler improvements
• Executor startup time reduction
47#EUdev8
• Hundreds of notebook users leads to a
highly multitenant and interactive workload.
• Challenge: Give each user the illusion of having
a large cluster all to herself
48#EUdev8
Many users
at once
Bursty
offered load
Latency is
important
What platforms are out there…
Hosted Solutions
• Data Science Experience
(DSX)/Watson Data Platform (WDP)
• databricks
• Google Cloud Platform
• Microsoft Azure HDInsight
• and others!!!
49#EUdev8
What platforms are out there…
Self-Hosted Solutions
• HDP – Hadoop Data Platform
• CDH – Cloudera Distribution
Including Hadoop
• MapR
• Mesosphere
50#EUdev8
Reasons to Self Host at TWC
• Cloud Agnostic
• Flexibility
• Sensitive Data
• Fewer Options in 2014
• Cassandra Colocation
• Cost…maybe?
51#EUdev8
Potential Downfalls
• In a self-hosted environment, everything is up to
you.
– Security
– Stability and Performance
– Scalability
• Compute
• Storage
– Monitoring
– Alerting
– Logging
52#EUdev8

More Related Content

What's hot (20)

PDF
Apache Spark Performance: Past, Future and Present
Databricks
 
PDF
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
PDF
Using Spark with Tachyon by Gene Pang
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
PDF
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Databricks
 
PDF
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
PDF
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PDF
Spark on Mesos
Jen Aman
 
Apache Spark Performance: Past, Future and Present
Databricks
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
Joan Viladrosa Riera
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Spark Summit
 
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Supporting Over a Thousand Custom Hive User Defined Functions
Databricks
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
Databricks
 
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Databricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit
 
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Real-Time Detection of Anomalies in the Database Infrastructure using Apache ...
Spark Summit
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Spark on Mesos
Jen Aman
 

Similar to Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser (20)

PDF
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
PDF
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
PDF
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Apache Spark At Scale in the Cloud
Rose Toomey
 
PPTX
An Introduction to Apache Spark
Dona Mary Philip
 
PPTX
Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Plain Concepts
 
PDF
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Databricks
 
PDF
Commit to excellence - Java in containers
Red Hat Developers
 
PPTX
Databricks clusters in autopilot mode
Prakash Chockalingam
 
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
PDF
Virtualizing Apache Spark with Justin Murray
Databricks
 
PDF
(ATS4-PLAT06) Considerations for sizing and deployment
BIOVIA
 
PPTX
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
PDF
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
PDF
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
PPTX
Tech Spark Presentation
Stephen Borg
 
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
Luciano Resende
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
Luciano Resende
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Rose Toomey
 
An Introduction to Apache Spark
Dona Mary Philip
 
Desplegar en la nube y no morir en el intento - Plain Concepts Dev Day
Plain Concepts
 
Delivering Insights from 20M+ Smart Homes with 500M+ Devices
Databricks
 
Commit to excellence - Java in containers
Red Hat Developers
 
Databricks clusters in autopilot mode
Prakash Chockalingam
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
Virtualizing Apache Spark with Justin Murray
Databricks
 
(ATS4-PLAT06) Considerations for sizing and deployment
BIOVIA
 
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
DevOps Supercharged with Docker on Exadata
MarketingArrowECS_CZ
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Luciano Resende
 
Tech Spark Presentation
Stephen Borg
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Ad

Recently uploaded (20)

PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 

Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and Brad Kaiser

  • 1. Brad Kaiser, IBM/TWC Craig Ingram, IBM/TWC Supporting Highly Multitenant Spark Notebook Workloads #EUdev8
  • 2. Hosting Multitenant Spark Notebooks Is Hard But It Doesn't Have To Be • Our Journey • Best Practices • New Work 2#EUdev8
  • 5. IBM's Commitment to Open Source 5#EUdev8 • Contribute intellectual and technical capital to the Apache Spark community. • Make the core technology enterprise- and cloud-ready. • Build data science skills to drive intelligence into business applications — https://cognitiveclass.ai/ • http://spark.tc
  • 6. Mission • Provide: – a secure, performant, stable cluster. – interactive analytics, visualizations, and reports. – collaboration and sharing with other data scientists, engineers, and consumers. – job scheduling capabilities. – a quick and easy way to get started with Spark. 6#EUdev8
  • 7. Goals • Support hundreds of analysts/data scientists using Spark – Quick kernel creation (<10s from notebook creation to available sc). – Utilize cluster resources efficiently. – Elastically scale based on load. 7#EUdev8
  • 8. Lessons Learned at TWC 8#EUdev8
  • 9. One Big Cluster 9#EUdev8 • Spark collocated with Cassandra • Fast • Stable
  • 10. One Big Cluster - Cons 10#EUdev8 • Outside analysts start using our cluster • Provided notebook services • Interrupted our perfectly scheduled jobs • Used a lot of resources causing Cassandra to crash
  • 11. Add Smaller Clusters 11#EUdev8 • We built some smaller clusters • Still platform agnostic • Other teams couldn't affect our production cluster
  • 12. Add Smaller Clusters - Cons 12#EUdev8 • Hassle to set up • Required a lot of maintenance • Sat idle
  • 13. EMR • Analysts make ad hoc clusters for their needs • No maintenance from us • Learning curve for analysts • They tend to leave them running 13#EUdev8
  • 14. Lessons Learned - IBM 14#EUdev8
  • 15. Data Science Experience • Collaborative environment on the front end – Collaboration Tools – Shared Data Sets – Flows – GitHub Integration • Multiple compute environments on the back end – DSX on the cloud: compute runs on IBM cloud – DSX Local: compute runs on private cloud or Z 15#EUdev8
  • 16. Lessons Learned - IBM • You need kernel remoting – Allows advanced collaborative tools in the application tier – Allows resource consolidation in the analytics tier • Resource consolidation puts stress on the analytics tier – Starvation – Management of cached data – Performance bottlenecks (example: Spark web UI) 16#EUdev8
  • 18. Best Practices • Use kernel remoting • Use fewer, bigger clusters • Know your workloads • Isolate users • Schedule resources efficiently 18#EUdev8
  • 19. Use Kernel Remoting • Running all of your notebook kernels on the same server is a bottle neck • Run your kernels distributed on the cluster • You can run a lot more notebooks 19#EUdev8
  • 20. Jupyter Enterprise Gateway • New Open Source project from IBM • Goals: – Allow hundreds of notebook users to share a single Spark cluster… – …with enterprise-level security and performance. • Used in IBM Analytics Engine (GA) • developer.ibm.com/code/openprojects/jupyter- enterprise-gateway-2 20#EUdev8
  • 21. Jupyter Enterprise Gateway: How it Works 21#EUdev8 Spark Cluster Security Layer Jupyter Enterprise Gateway • Multitenant • Remote Kernel Lifecycle Management YARN Workers Impersonation: Alice’s kernel runs under Alice’s user ID. Spark ExecutorsSpark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark ExecutorsSpark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Secure Secure Secure Secure
  • 22. Use Fewer, Bigger Clusters • Better resource utilization through statistical multiplexing • Improved security and auditing due to centralization – Hive, Ranger, and Atlas are common in the ecosystem. – Many new, platform specific solutions to address this problem. • Easier collaboration between users – Shared notebooks with interactive visualizations and markdown support. – GitHub integration for versioning and external sharing. – Catalog based data discovery and sharing. – Governance and auditing support. 22#EUdev8
  • 23. Know your workloads • What is the main resource they use? • Overprovision the hardware • When to scale up and down? – CPU load – YARN/RM Queue Stats (depth, waiting jobs, available CPU/mem, preemptions) • If containers are getting preempted, it’s due to queues filling up. 23#EUdev8
  • 24. Isolate Users • Don't have a generic user account • Catalog and Governance – Hive – Atlas – Ranger • You don't want users embedding keys and passwords in their notebooks. 24#EUdev8
  • 26. YARN Queues • Take advantage of YARN’s hierarchical queue system to manage and organize resources. – Over-allocate queues for better resource utilization and sharing. – Take advantage of node labels for users that have priority jobs that require an SLA. – Intra-queue preemption and asynchronous container allocation should be available in YARN 3.0. 26#EUdev8
  • 28. Dynamic Allocation • Dynamic allocation lets you take advantage of varying activity – Proactively scales the number of executors based on the scheduler's backlog. – Removes idle executors after a timeout. – Be sure to set a sensible number of initial executors and minimum executor floor and let them ramp up on demand. • Static allocation best for known workloads 28#EUdev8
  • 30. Improvements to Spark • Alleviate the tradeoffs inherent in current best practices. – Recover cached data when shutting down idle executors – Proactively shut down executors to prevent starvation 30#EUdev8
  • 31. Recovering Cached Data • Replicates cached data to remaining executors before shutting them down. • Ameliorates cache issues with dynamic allocation • Useful in shared spark notebook environments 31#EUdev8
  • 34. Check out my PR – SPARK-21097 – github.com/apache/spark/pull/19041 34#EUdev8
  • 35. Preventing Starvation • Eliminate issues where users are unable to run anything due to other users taking up all of the cluster’s resources. • Especially useful in shared spark notebook environments where idle resources can be reclaimed easily. • Preemption can solve this. 35#EUdev8
  • 36. Enter Preemption • Requests containers associated with over- allocated queues to shut down. • Handle YARN's PreemptionMessage in a way that best suits the workload. • Pick the right executors to terminate. 36#EUdev8
  • 37. Keep up with the JIRA • SPARK-21122 • PR coming soon 37#EUdev8
  • 38. Call to action • Look at our JIRAs • Try out our PRs 38#EUdev8
  • 39. Shout-outs!!! • Our notebook workload simulator, benchmark, and tracing tools. – spark-bench - github.com/SparkTC/spark-bench • Check out Emily Curtin’s talk tomorrow about spark-bench. – spark-tracing - github.com/SparkTC/spark-tracing • Matthew Schauer's baby awaiting open-source approval. • Special thanks to Vijay Bommireddipalli and Fred Reiss for their guidance and support! 39#EUdev8
  • 42. YARN Asynchronous Scheduling • Enable asynchronous scheduling of containers in YARN. – yarn.scheduler.capacity.schedule-asynchronously.enable – YARN-7327 and YARN-5139 42#EUdev8
  • 43. References • Some Icons provided by Icons8 43#EUdev8
  • 44. Quick Settings • Use spark.yarn.jars or spark.yarn.archive. • Running Spark from tmpfs did not improve performance. • Support for multiple/new versions of Spark. 44#EUdev8 Move to backup
  • 45. Disable unused credential providers • spark.yarn.security.credentials.hive.enabled • spark.yarn.security.credentials.hbase.enabled 45#EUdev8 conf avg stdev default 8.745 0.051 nohive 8.7 0.05 nohbase 7.87 0.05 Move to backup
  • 46. Problem Domain • Security – UI and service protection – Data governance and auditing • Stability • Performance 46#EUdev8
  • 47. What’s next… • spark-on-k8s • Scheduler improvements • Executor startup time reduction 47#EUdev8
  • 48. • Hundreds of notebook users leads to a highly multitenant and interactive workload. • Challenge: Give each user the illusion of having a large cluster all to herself 48#EUdev8 Many users at once Bursty offered load Latency is important
  • 49. What platforms are out there… Hosted Solutions • Data Science Experience (DSX)/Watson Data Platform (WDP) • databricks • Google Cloud Platform • Microsoft Azure HDInsight • and others!!! 49#EUdev8
  • 50. What platforms are out there… Self-Hosted Solutions • HDP – Hadoop Data Platform • CDH – Cloudera Distribution Including Hadoop • MapR • Mesosphere 50#EUdev8
  • 51. Reasons to Self Host at TWC • Cloud Agnostic • Flexibility • Sensitive Data • Fewer Options in 2014 • Cassandra Colocation • Cost…maybe? 51#EUdev8
  • 52. Potential Downfalls • In a self-hosted environment, everything is up to you. – Security – Stability and Performance – Scalability • Compute • Storage – Monitoring – Alerting – Logging 52#EUdev8