SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Building Efficient Pipelines in
Apache Spark
Guru Medasani
2© Cloudera, Inc. All rights reserved.
Agenda
• Introduction
• Myself
• Cloudera
• Spark Pipeline Essentials
• Using Spark UI
• Resource Allocation
• Tuning
• Data Formats
• Streaming
• Questions
3© Cloudera, Inc. All rights reserved.
Introduction: Myself
• Current: Senior Solutions Architect at Cloudera (Chicago, IL)
• Past: BigData Engineer at Monsanto Research & Development (St. Louis, MO)
4© Cloudera, Inc. All rights reserved.
Introduction: Cloudera
The modern platform for data management, machine learning and
advanced analytics
Founded 2008, by former employees of
Product First commercial distribution of Hadoop CDH – Shipped 2009
World Class Support 24x7 Global Staff & Operations in 27 Countries
Proactive & Predictive Support Programs using our EDH
Mission Critical Production deployments in run-the-business applications worldwide –
Financial Services, Retail, Telecom, Media, Health Care, Energy,
Government
The Largest Ecosystem 2,500+ Partners
Cloudera University Over 45,000 Trained
Open Source Leaders Cloudera employees are leading developers & contributors to the
complete Apache Hadoop ecosystem of projects
5© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Using Spark UI
6© Cloudera, Inc. All rights reserved.
UI: Event Timeline
7© Cloudera, Inc. All rights reserved.
UI: Job Details - DAG
8© Cloudera, Inc. All rights reserved.
UI: Stage Details
9© Cloudera, Inc. All rights reserved.
UI: Stage Metrics
10© Cloudera, Inc. All rights reserved.
UI: Skewed Data Metrics - Example
11© Cloudera, Inc. All rights reserved.
UI: Job Labels and Storage
12© Cloudera, Inc. All rights reserved.
UI: Job Labels and RDD Names
13© Cloudera, Inc. All rights reserved.
UI: DataFrame and Dataset Names
https://issues.apache.org/jira/browse/SPARK-8480
14© Cloudera, Inc. All rights reserved.
UI: Skipped Stages
http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui
15© Cloudera, Inc. All rights reserved.
UI: Using Shuffle Metrics
16© Cloudera, Inc. All rights reserved.
Lot’s more in the UI
• SQL Queries
• Environment Variables
• Executor Aggregates
17© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Resource Allocation
18© Cloudera, Inc. All rights reserved.
Resources: Basics
• If running Spark on YARN
• First Step: Setup proper YARN resource queues and dynamic resource pools
19© Cloudera, Inc. All rights reserved.
Resources: Dynamic Allocation
• Dynamic allocation allows Spark to dynamically scale the cluster resources
allocated to your application based on the workload.
• Originally just Spark-On-Yarn, now all cluster managers
20© Cloudera, Inc. All rights reserved.
Static Allocation vs Dynamic Allocation
• Static Allocation
• --num-executors NUM
• Dynamic Allocation
• Enabled by default in CDH
• Good starting point
• Not the final solution
21© Cloudera, Inc. All rights reserved.
Dynamic Allocation in Spark Streaming
• Enabled by default in CDH
• Cloudera recommends to disable dynamic allocation for Spark Streaming
• Why?
• Dynamic Allocation behavior - executors are removed when idle.
• Data comes in every batch, and executors run whenever data is available.
• Executor idle timeout is less than the batch duration, executors are constantly
being added and removed
• If the executor idle timeout is greater than the batch duration, executors are
never removed
22© Cloudera, Inc. All rights reserved.
Resources: # Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each
23© Cloudera, Inc. All rights reserved.
Decisions, decisions, decisions
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
24© Cloudera, Inc. All rights reserved.
Spark Architecture recap
25© Cloudera, Inc. All rights reserved.
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 6
Executor 5
Executor 4
Executor 3
Executor 2
Executor 1
26© Cloudera, Inc. All rights reserved.
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 6
Executor 5
Executor 4
Executor 3
Executor 2
Executor 1
27© Cloudera, Inc. All rights reserved.
Why?
• Not using benefits of running multiple tasks in same executor.
• Missing benefits of shared broadcast variables. Need more copies of the data
28© Cloudera, Inc. All rights reserved.
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1
29© Cloudera, Inc. All rights reserved.
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1
30© Cloudera, Inc. All rights reserved.
Why?
• Need to leave some memory overhead for OS/Hadoop daemons
31© Cloudera, Inc. All rights reserved.
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
32© Cloudera, Inc. All rights reserved.
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
33© Cloudera, Inc. All rights reserved.
Let’s assume…
• You are running Spark on YARN, from here on…
• 4 other things to keep in mind
34© Cloudera, Inc. All rights reserved.
#1 – Memory overhead
• --executor-memory controls the heap size
• Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory
• Default is max(384MB, . 0.10 * spark.executor.memory)
35© Cloudera, Inc. All rights reserved.
#2 - YARN AM needs a core: Client mode
36© Cloudera, Inc. All rights reserved.
#2 YARN AM needs a core: Cluster mode
37© Cloudera, Inc. All rights reserved.
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.
• Best is to keep under 5 cores per executor
38© Cloudera, Inc. All rights reserved.
#4 Garbage Collection
• Too much executor memory could cause excessive garbage collection delays.
• 64GB is a rough guess as a good upper limit for a single executor.
• When you reach this level, you should start looking at GC tuning
39© Cloudera, Inc. All rights reserved.
Calculations
• 5 cores per executor
• For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in total
after taking out Hadoop/Yarn daemon cores)
• 90 cores / 5 cores/executor
= 18 executors
• Each node has 3 executors
• 63 GB/3 = 21 GB, 21 x (1-0.07)
~ 19 GB
• 1 executor for AM => 17 executors
Overhead
Worker node
Executor 3
Executor 2
Executor 1
40© Cloudera, Inc. All rights reserved.
Correct answer
• 17 executors in total
• 19 GB memory/executor
• 5 cores/executor
* Not etched in stone
Overhead
Worker node
Executor 3
Executor 2
Executor 1
41© Cloudera, Inc. All rights reserved.
Dynamic allocation helps with this though, right?
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
42© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Tuning
43© Cloudera, Inc. All rights reserved.
Memory: Unified Memory Management
https://issues.apache.org/jira/browse/SPARK-10000
44© Cloudera, Inc. All rights reserved.
Memory: Example
• Let’s say you have 64GB Executor.
• Default spark.memory.fraction : 0.6 = 0.6 * 64 = 38.4 GB
• Default spark.memory.storage.fraction: 0.5 = 0.5 * 38.4 = 19.2 GB
• So based on how much data is being spilled, GC pauses and OOME, you can take
following actions
1. Increase number of executors (increasing parallelism)
2. Tweak the spark.yarn.executor.memory.overhead (avoid OOME)
3. Tweak Spark.memory.fraction (reduces memory pressure and spilling)
4. Tweak Spark.memory.storage.fraction (what you think is right, not excessive)
45© Cloudera, Inc. All rights reserved.
Memory: Hidden Caches(GraphX)
org.apache.spark.graphx.lib.PageRank
46© Cloudera, Inc. All rights reserved.
Memory: Hidden Caches(MLlib)
47© Cloudera, Inc. All rights reserved.
Parallelism
• Number of tasks depends on number of partitions
• Too many partitions is usually better than too few partitions
• Very important parameter in determining performance
• Datasets read from HDFS rely on number of HDFS blocks
•Typically each HDFS block becomes a partition in RDD
• User can specify the number of partitions during input or transformations
• What should the X be?
•The most straightforward answer is experimentation
•Look at the number of partitions in the parent RDD and then keep multiplying
that by 1.5 until performance stops improving
val rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)
48© Cloudera, Inc. All rights reserved.
How about the cluster?
• The two main resources that Spark (and YARN) think about are CPU and memory
• Disk and network I/O, of course, play a part in Spark performance as well
• But neither Spark nor YARN currently do anything to actively manage them
49© Cloudera, Inc. All rights reserved.
Further Tuning
• Slimming down your data structures
• In-memory footprint of your data structures impacts performance greatly
• Kryo Serialization preferred over default serialization for custom objects
• Cache the data in memory to figure out the dataset size and can make
estimates on record sizes
• Example: (total cached rdd size)/(number of records in rdd)
• Gives rough estimate on how much memory your records are occupying
• After several transformations, you created some custom object, this is the
easiest way to get the size
• Also can use SizeEstimator’s estimate method to find object’s size
50© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Data Formats
51© Cloudera, Inc. All rights reserved.
Data Formats
• Parquet
• Avro
• JSON
• Avoid if you can
• Needless CPU cycles spent parsing large text files again and again
52© Cloudera, Inc. All rights reserved.
Storage: Parquet
• Popular columnar format for analytical workloads
• Great performance
• Efficient compression
• Partition Discovery & Schema Merging
• Writes files into HDFS
• Small files problem, needs monitoring, manage compactions
• Makes the ETL pipeline complex when handling updates
53© Cloudera, Inc. All rights reserved.
Storage: Kudu
• Open source distributed columnar data store
• Runs on native Linux filesystem
• Currently GA and ships with CDH
• Similar performance to Parquet
• Handles updates
• No need to worry about files anymore
• Scales well
• Spark using KuduContext
https://www.cloudera.com/products/open-source/apache-hadoop/apache-kudu.html
54© Cloudera, Inc. All rights reserved.
Spark Pipeline Essentials: Streaming
55© Cloudera, Inc. All rights reserved.
Streaming: Spark & Kafka Integration
• Use Direct Approach
• Simplified Parallelism
• Efficient and More reliable
• Exactly-once Semantics
• Requires Offset Management
56© Cloudera, Inc. All rights reserved.
Streaming: Kafka Offset Management
• Set Kafka Parameter ‘auto.offset.reset’
• Spark Streaming Checkpoints
• Storing Offsets in HBase
• Storing Offsets in Zookeeper
• Kafka Itself
57© Cloudera, Inc. All rights reserved.
More Resources
• Top 5 Spark Mistakes
• https://spark-summit.org/2016/events/top-5-mistakes-when-writing-spark-
applications/
• Self-paced spark workshop
• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs
• http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
• Tuning & Debugging Spark (with another explanation of internals)
• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spar
58© Cloudera, Inc. All rights reserved.
Questions?
59© Cloudera, Inc. All rights reserved.
Thank you
Email: gmedasani@cloudera.com
Twitter: @gurumedasani

More Related Content

PDF
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
PPTX
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
PDF
Hadoop Operations at LinkedIn
DataWorks Summit
 
PPTX
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
DataWorks Summit
 
PDF
Impala Resource Management - OUTDATED
Matthew Jacobs
 
PDF
Resource Management in Impala - StampedeCon 2016
StampedeCon
 
PDF
Taming YARN @ Hadoop conference Japan 2014
Tsuyoshi OZAWA
 
PPTX
Admission Control in Impala
Cloudera, Inc.
 
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
Hadoop Operations at LinkedIn
DataWorks Summit
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
DataWorks Summit
 
Impala Resource Management - OUTDATED
Matthew Jacobs
 
Resource Management in Impala - StampedeCon 2016
StampedeCon
 
Taming YARN @ Hadoop conference Japan 2014
Tsuyoshi OZAWA
 
Admission Control in Impala
Cloudera, Inc.
 

What's hot (20)

PDF
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
PPTX
Hello OpenStack, Meet Hadoop
DataWorks Summit
 
PDF
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
PDF
Applications on Hadoop
markgrover
 
PDF
How to Increase Performance of Your Hadoop Cluster
Altoros
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
PPTX
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
Why Your Apache Spark Job is Failing
Cloudera, Inc.
 
PDF
Performance evaluation of cloudera impala (with Comparison to Hive)
Yukinori Suda
 
PDF
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
PDF
Hive on spark berlin buzzwords
Szehon Ho
 
PDF
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 
PDF
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
PPTX
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Neelesh Srinivas Salian
 
PPTX
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
PPTX
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
PDF
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
PDF
Apache Spark
Uwe Printz
 
PDF
Strata London 2019 Scaling Impala
Manish Maheshwari
 
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Hello OpenStack, Meet Hadoop
DataWorks Summit
 
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
Applications on Hadoop
markgrover
 
How to Increase Performance of Your Hadoop Cluster
Altoros
 
Spark Tips & Tricks
Jason Hubbard
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Why Your Apache Spark Job is Failing
Cloudera, Inc.
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Yukinori Suda
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Hive on spark berlin buzzwords
Szehon Ho
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Neelesh Srinivas Salian
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Apache Spark
Uwe Printz
 
Strata London 2019 Scaling Impala
Manish Maheshwari
 
Ad

Similar to Chicago spark meetup-april2017-public (20)

PPTX
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
PPTX
Strata London 2019 Scaling Impala.pptx
Manish Maheshwari
 
PPTX
5 Apache Spark Tips in 5 Minutes
Cloudera, Inc.
 
PDF
Kudu austin oct 2015.pptx
Felicia Haggarty
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Apache Spark At Scale in the Cloud
Rose Toomey
 
PPTX
Empower Hive with Spark
DataWorks Summit
 
PPTX
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PPTX
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
PPTX
The Impala Cookbook
Cloudera, Inc.
 
PPTX
Spark One Platform Webinar
Cloudera, Inc.
 
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
PPTX
Spark etl
Imran Rashid
 
PDF
Session 307 ravi pendekanti engineered systems
OUGTH Oracle User Group in Thailand
 
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
PPTX
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
PPTX
SFHUG Kudu Talk
Felicia Haggarty
 
Building Efficient Pipelines in Apache Spark
Jeremy Beard
 
Strata London 2019 Scaling Impala.pptx
Manish Maheshwari
 
5 Apache Spark Tips in 5 Minutes
Cloudera, Inc.
 
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Rose Toomey
 
Empower Hive with Spark
DataWorks Summit
 
Yarns about YARN: Migrating to MapReduce v2
DataWorks Summit
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
The Impala Cookbook
Cloudera, Inc.
 
Spark One Platform Webinar
Cloudera, Inc.
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Spark etl
Imran Rashid
 
Session 307 ravi pendekanti engineered systems
OUGTH Oracle User Group in Thailand
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
SFHUG Kudu Talk
Felicia Haggarty
 
Ad

Recently uploaded (20)

PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 

Chicago spark meetup-april2017-public

  • 1. 1© Cloudera, Inc. All rights reserved. Building Efficient Pipelines in Apache Spark Guru Medasani
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • Introduction • Myself • Cloudera • Spark Pipeline Essentials • Using Spark UI • Resource Allocation • Tuning • Data Formats • Streaming • Questions
  • 3. 3© Cloudera, Inc. All rights reserved. Introduction: Myself • Current: Senior Solutions Architect at Cloudera (Chicago, IL) • Past: BigData Engineer at Monsanto Research & Development (St. Louis, MO)
  • 4. 4© Cloudera, Inc. All rights reserved. Introduction: Cloudera The modern platform for data management, machine learning and advanced analytics Founded 2008, by former employees of Product First commercial distribution of Hadoop CDH – Shipped 2009 World Class Support 24x7 Global Staff & Operations in 27 Countries Proactive & Predictive Support Programs using our EDH Mission Critical Production deployments in run-the-business applications worldwide – Financial Services, Retail, Telecom, Media, Health Care, Energy, Government The Largest Ecosystem 2,500+ Partners Cloudera University Over 45,000 Trained Open Source Leaders Cloudera employees are leading developers & contributors to the complete Apache Hadoop ecosystem of projects
  • 5. 5© Cloudera, Inc. All rights reserved. Spark Pipeline Essentials: Using Spark UI
  • 6. 6© Cloudera, Inc. All rights reserved. UI: Event Timeline
  • 7. 7© Cloudera, Inc. All rights reserved. UI: Job Details - DAG
  • 8. 8© Cloudera, Inc. All rights reserved. UI: Stage Details
  • 9. 9© Cloudera, Inc. All rights reserved. UI: Stage Metrics
  • 10. 10© Cloudera, Inc. All rights reserved. UI: Skewed Data Metrics - Example
  • 11. 11© Cloudera, Inc. All rights reserved. UI: Job Labels and Storage
  • 12. 12© Cloudera, Inc. All rights reserved. UI: Job Labels and RDD Names
  • 13. 13© Cloudera, Inc. All rights reserved. UI: DataFrame and Dataset Names https://issues.apache.org/jira/browse/SPARK-8480
  • 14. 14© Cloudera, Inc. All rights reserved. UI: Skipped Stages http://stackoverflow.com/questions/34580662/what-does-stage-skipped-mean-in-apache-spark-web-ui
  • 15. 15© Cloudera, Inc. All rights reserved. UI: Using Shuffle Metrics
  • 16. 16© Cloudera, Inc. All rights reserved. Lot’s more in the UI • SQL Queries • Environment Variables • Executor Aggregates
  • 17. 17© Cloudera, Inc. All rights reserved. Spark Pipeline Essentials: Resource Allocation
  • 18. 18© Cloudera, Inc. All rights reserved. Resources: Basics • If running Spark on YARN • First Step: Setup proper YARN resource queues and dynamic resource pools
  • 19. 19© Cloudera, Inc. All rights reserved. Resources: Dynamic Allocation • Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload. • Originally just Spark-On-Yarn, now all cluster managers
  • 20. 20© Cloudera, Inc. All rights reserved. Static Allocation vs Dynamic Allocation • Static Allocation • --num-executors NUM • Dynamic Allocation • Enabled by default in CDH • Good starting point • Not the final solution
  • 21. 21© Cloudera, Inc. All rights reserved. Dynamic Allocation in Spark Streaming • Enabled by default in CDH • Cloudera recommends to disable dynamic allocation for Spark Streaming • Why? • Dynamic Allocation behavior - executors are removed when idle. • Data comes in every batch, and executors run whenever data is available. • Executor idle timeout is less than the batch duration, executors are constantly being added and removed • If the executor idle timeout is greater than the batch duration, executors are never removed
  • 22. 22© Cloudera, Inc. All rights reserved. Resources: # Executors, cores, memory !?! • 6 Nodes • 16 cores each • 64 GB of RAM each
  • 23. 23© Cloudera, Inc. All rights reserved. Decisions, decisions, decisions • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 24. 24© Cloudera, Inc. All rights reserved. Spark Architecture recap
  • 25. 25© Cloudera, Inc. All rights reserved. Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 6 Executor 5 Executor 4 Executor 3 Executor 2 Executor 1
  • 26. 26© Cloudera, Inc. All rights reserved. Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 6 Executor 5 Executor 4 Executor 3 Executor 2 Executor 1
  • 27. 27© Cloudera, Inc. All rights reserved. Why? • Not using benefits of running multiple tasks in same executor. • Missing benefits of shared broadcast variables. Need more copies of the data
  • 28. 28© Cloudera, Inc. All rights reserved. Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  • 29. 29© Cloudera, Inc. All rights reserved. Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  • 30. 30© Cloudera, Inc. All rights reserved. Why? • Need to leave some memory overhead for OS/Hadoop daemons
  • 31. 31© Cloudera, Inc. All rights reserved. Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 32. 32© Cloudera, Inc. All rights reserved. Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 33. 33© Cloudera, Inc. All rights reserved. Let’s assume… • You are running Spark on YARN, from here on… • 4 other things to keep in mind
  • 34. 34© Cloudera, Inc. All rights reserved. #1 – Memory overhead • --executor-memory controls the heap size • Need some overhead (controlled by spark.yarn.executor.memory.overhead) for off heap memory • Default is max(384MB, . 0.10 * spark.executor.memory)
  • 35. 35© Cloudera, Inc. All rights reserved. #2 - YARN AM needs a core: Client mode
  • 36. 36© Cloudera, Inc. All rights reserved. #2 YARN AM needs a core: Cluster mode
  • 37. 37© Cloudera, Inc. All rights reserved. #3 HDFS Throughput • 15 cores per executor can lead to bad HDFS I/O throughput. • Best is to keep under 5 cores per executor
  • 38. 38© Cloudera, Inc. All rights reserved. #4 Garbage Collection • Too much executor memory could cause excessive garbage collection delays. • 64GB is a rough guess as a good upper limit for a single executor. • When you reach this level, you should start looking at GC tuning
  • 39. 39© Cloudera, Inc. All rights reserved. Calculations • 5 cores per executor • For max HDFS throughput • Cluster has 6 * 15 = 90 cores in total after taking out Hadoop/Yarn daemon cores) • 90 cores / 5 cores/executor = 18 executors • Each node has 3 executors • 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB • 1 executor for AM => 17 executors Overhead Worker node Executor 3 Executor 2 Executor 1
  • 40. 40© Cloudera, Inc. All rights reserved. Correct answer • 17 executors in total • 19 GB memory/executor • 5 cores/executor * Not etched in stone Overhead Worker node Executor 3 Executor 2 Executor 1
  • 41. 41© Cloudera, Inc. All rights reserved. Dynamic allocation helps with this though, right? • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 42. 42© Cloudera, Inc. All rights reserved. Spark Pipeline Essentials: Tuning
  • 43. 43© Cloudera, Inc. All rights reserved. Memory: Unified Memory Management https://issues.apache.org/jira/browse/SPARK-10000
  • 44. 44© Cloudera, Inc. All rights reserved. Memory: Example • Let’s say you have 64GB Executor. • Default spark.memory.fraction : 0.6 = 0.6 * 64 = 38.4 GB • Default spark.memory.storage.fraction: 0.5 = 0.5 * 38.4 = 19.2 GB • So based on how much data is being spilled, GC pauses and OOME, you can take following actions 1. Increase number of executors (increasing parallelism) 2. Tweak the spark.yarn.executor.memory.overhead (avoid OOME) 3. Tweak Spark.memory.fraction (reduces memory pressure and spilling) 4. Tweak Spark.memory.storage.fraction (what you think is right, not excessive)
  • 45. 45© Cloudera, Inc. All rights reserved. Memory: Hidden Caches(GraphX) org.apache.spark.graphx.lib.PageRank
  • 46. 46© Cloudera, Inc. All rights reserved. Memory: Hidden Caches(MLlib)
  • 47. 47© Cloudera, Inc. All rights reserved. Parallelism • Number of tasks depends on number of partitions • Too many partitions is usually better than too few partitions • Very important parameter in determining performance • Datasets read from HDFS rely on number of HDFS blocks •Typically each HDFS block becomes a partition in RDD • User can specify the number of partitions during input or transformations • What should the X be? •The most straightforward answer is experimentation •Look at the number of partitions in the parent RDD and then keep multiplying that by 1.5 until performance stops improving val rdd2 = rdd1.reduceByKey(_ + _, numPartitions = X)
  • 48. 48© Cloudera, Inc. All rights reserved. How about the cluster? • The two main resources that Spark (and YARN) think about are CPU and memory • Disk and network I/O, of course, play a part in Spark performance as well • But neither Spark nor YARN currently do anything to actively manage them
  • 49. 49© Cloudera, Inc. All rights reserved. Further Tuning • Slimming down your data structures • In-memory footprint of your data structures impacts performance greatly • Kryo Serialization preferred over default serialization for custom objects • Cache the data in memory to figure out the dataset size and can make estimates on record sizes • Example: (total cached rdd size)/(number of records in rdd) • Gives rough estimate on how much memory your records are occupying • After several transformations, you created some custom object, this is the easiest way to get the size • Also can use SizeEstimator’s estimate method to find object’s size
  • 50. 50© Cloudera, Inc. All rights reserved. Spark Pipeline Essentials: Data Formats
  • 51. 51© Cloudera, Inc. All rights reserved. Data Formats • Parquet • Avro • JSON • Avoid if you can • Needless CPU cycles spent parsing large text files again and again
  • 52. 52© Cloudera, Inc. All rights reserved. Storage: Parquet • Popular columnar format for analytical workloads • Great performance • Efficient compression • Partition Discovery & Schema Merging • Writes files into HDFS • Small files problem, needs monitoring, manage compactions • Makes the ETL pipeline complex when handling updates
  • 53. 53© Cloudera, Inc. All rights reserved. Storage: Kudu • Open source distributed columnar data store • Runs on native Linux filesystem • Currently GA and ships with CDH • Similar performance to Parquet • Handles updates • No need to worry about files anymore • Scales well • Spark using KuduContext https://www.cloudera.com/products/open-source/apache-hadoop/apache-kudu.html
  • 54. 54© Cloudera, Inc. All rights reserved. Spark Pipeline Essentials: Streaming
  • 55. 55© Cloudera, Inc. All rights reserved. Streaming: Spark & Kafka Integration • Use Direct Approach • Simplified Parallelism • Efficient and More reliable • Exactly-once Semantics • Requires Offset Management
  • 56. 56© Cloudera, Inc. All rights reserved. Streaming: Kafka Offset Management • Set Kafka Parameter ‘auto.offset.reset’ • Spark Streaming Checkpoints • Storing Offsets in HBase • Storing Offsets in Zookeeper • Kafka Itself
  • 57. 57© Cloudera, Inc. All rights reserved. More Resources • Top 5 Spark Mistakes • https://spark-summit.org/2016/events/top-5-mistakes-when-writing-spark- applications/ • Self-paced spark workshop • https://github.com/deanwampler/spark-workshop • Tips for Better Spark Jobs • http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs • http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ • Tuning & Debugging Spark (with another explanation of internals) • http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spar
  • 58. 58© Cloudera, Inc. All rights reserved. Questions?
  • 59. 59© Cloudera, Inc. All rights reserved. Thank you Email: [email protected] Twitter: @gurumedasani