Profiling & Testing with Spark
Apache Spark 2.0 Improvements, Flame Graphs & Testing
Outline
Overview
Spark 2.0 Improvements
Profiling with Flame Graphs
How-to Flame Graphs
Testing in Spark
Overview
Apache Spark™ is a fast and general engine for large-scale data processing
Speed: Runs in-memory computing, up to 100x faster than MapReduce
Ease of Use: Support for Java, Scala, Python and R binding
Generality: Enabled for SQL, Streaming and complex analytics (ML)
Portable: Runs on Yarn, Mesos, standalone or Cloud
Overview (Big Picture)
Overview (architecture)
Overview (code sample)
Monte-carlo π calculation
“This code estimates π by "throwing darts" at a circle. We pick random
points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit
circle. The fraction should be π / 4, so we use this to get our estimate.”
Main Takeaway
Spark SQL:
Provides parallelism, affordable at scale
Scale out SQL on storage for Big Data volumes
Scale out on CPU for memory-intensive queries
Offloading reports from RDBMS becomes attractive
Spark 2.0 improvements:
Considerable speedup of CPU-intensive queries
Spark 2.0 Improvements
SQL Queries
sqlContext.sql("
SELECT a.bucket, sum(a.val2) tot
FROM t1 a, t1 b
WHERE a.bucket=b.bucket and
a.val1+b.val1<1000
GROUP BY a.bucket
ORDER BY a.bucket").show()
Complex and resource-intensive SELECT statement:
EXPLAIN directive (execution plan)
Execution Plan
The execution plan:
First instrumentation point for SQL tuning
Shows how Spark wants to execute the query (break-down)
Main players:
Catalyst: the query optimizer
Catalyst (query optimizer)
Logical Plan:
Describes computation on data sets without defining how to conduct it
Physical Plan:
Defines which computation to conduct on each dataset
Project Tungsten (Goal)
“Improves the memory and CPU efficiency of Spark backend
execution by pushing performance close to the limits of
modern hardware.”
Project Tungsten
Perform manual memory management instead of relying on Java objects:
Reduce memory footprint
Eliminate garbage collection overheads
Use java.unsafe and off-heap memory
Code generation for expression evaluation:
Reduce virtual function calls and interpretation overhead (JVM)
Project Tungsten (Code-Gen)
Project Tungsten (Code-Gen)
The Volcano Iterator Model:
Standard for 30 years: almost all
databases do it.
Each operator is an “iterator” that
consumes records from its input
operator
Project Tungsten (Code-Gen)
Downside the Volcano Iterator Model:
Too many virtual function calls
at least 3 calls for each row in Aggregate phase
Can’t take advantage of modern CPU features
pipelining, prefetching, branch prediction,
Project Tungsten (Code-Gen)
What if we hire a college freshman to implement this query in Java in 10 mins?
Whole-stage Code-Gen: Spark as a “Compiler”
Project Tungsten (Code-Gen)
A student beating 30 years of science ...
Project Tungsten (Code-Gen)
Volcano
● Many virtual function calls
● Data in memory (or cache)
● No loop unrolling, SIMD
Hand-written code
● No virtual function calls
● Data in CPU registers
● Exploit compiler optimizations
○ loop unrolling, SIMD, pipelining
Take advantage of all the information that is known after query compilation
Execution plan comparison (legacy vs whole stage code-gen)
WholeStageCodeGen
Profiling with Flame Graphs
Root Cause Analysis
Benchmarking:
Run the workload and measure it with the relevant diagnostic tools
Goals: understand the bottleneck(s) and find root causes
Limitations:
Our tools & time available for analysis are limiting factors
Profiling CPU-Bound workloads
Flame graph visualization of stack profiles:
● Brain child of Brendan Gregg (Dec 2011)
● Code: https://github.com/brendangregg/FlameGraph
● Now very popular, available for many languages, also for JVM
Shows which parts of the code are hot
● Very useful to understand where CPU cycles are spent
Flame Graph Visualization
Recipe:
● Gather multiple stack traces
● Aggregate them by sorting alphabetically by function/method name
● Visualization using stacked colored boxes
● Length of the box proportional to time spent there
Flame Graph (Spark 1.6)
Flame Graph (Spark 2.0)
Spark CodeGen vs. Volcano
Code generation improves CPU-intensive workloads
● Replaces loops and virtual function calls with code generated for the query
● The use of vector operations (e.g. SIMD) also beneficial
● Codegen is crucial for modern in-memory DBs
Commercial RDBMS engines
● Typically use the slower volcano model (with loops and virtual function calls)
● In the past optimizing for I/O latency was more important, now CPU cycles matter
more
Flame Graphs
Pros: good to understand where CPU cycles are spent
● Useful for performance troubleshooting
● Functions at the top of the graph are the ones using CPU
● Parent methods/functions provide context
Limitations:
● Off-CPU and wait time not charted (experimental)
● Interpretation of flame graphs requires experience/knowledge
● Not included in Spark monitoring suite
How-to Flame Graphs
CERN Java Flight Recorder Approach (1/2)
Enable Java Flight Recorder (JFR)
● Extra options in spark-defaults.conf or CLI. Example:
Collect data with jcmd:
● Example, sampling for 10 sec:
CERN Java Flight Recorder Approach (2/2)
Process the jfr file:
● From .jfr to merged stacks
● Produce the .svg file with the flame graph
● Find details in Kay Ousterhout’s article:
https://gist.github.com/kayousterhout/7008a8ebf2bab
eedc7ce6f8723fd1bf4
PayPal Approach
https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-
spark-applications-using-flame-graphs/
CERN HProfiler Approach
HProfiler (CERN home-built tool)
● Automates collection and aggregation of stack traces into flame graphs for
distributed applications
● Integrates with YARN to identify the processes to trace across the cluster
Based on Linux perf_events stack sampling (bare metal)
Experimental tool
● Author Joeri Hermans @ CERN
● https://github.com/cerndb/Hadoop-Profiler
● Hadoop-performance-troubleshooting-stack-tracing
Testing in Spark
Testing in Spark
● Why to run Spark outside of a cluster
● What to test
● Running Local
● Running as a Unit Test
● Data Structures
Testing in Spark
Why to run Spark outside of a cluster
● Time
● Trusted Deployment
● Money
Testing in Spark
What to test
● Experiments
● Complex logic
● Data samples
● Business generated scenarios
Testing in Spark (Running Local)
Running Local
● A test doesn’t always need to be a unit test
● UIs like Zeppelin is OK for quick feedback
but lacks from IDE Features
● Running local in your IDE is priceless
Testing in Spark (Running Local)
Example
● Use runLocal flag to set a local SparkContext
● Separate out testable work from driver code
Testing in Spark (Unit Testing)
Example
FunSuite: TDD unit testing suite
for Scala
Testing in Spark (Data Structures)
Working with “hand-written” DataFrames:
Testing in Spark (Hive)
Testing with Hive:
● Spin-up a docker-hive container for Apache Hive (Big Data Europe)
● Enables real interaction allowing to:
○ create, delete, write, ...
Testing in Spark (Hive)
Putting Hive + Spark together:
● Create a custom hive-site.xml
● Start Spark with the provided hive-site.xml
○ spark-shell --files /PATH/hive-site.xml
Testing in Spark (Hive)
Start Spark with the provided hive-site.xml:
Testing in Spark (Mini-Clusters)
Mini-Clusters
● Hadoop-mini-cluster
● Spark-unit-testing-with-hdfs
● Support for:
○ HBase & Hive
○ Kafka & Storm
○ Zookeeper
○ HDFS
○ ... access HDFS files & test code
copy files from localFS to HDFS
Conclusions
Conclusions
Apache Spark 2.0 Improvements (HDP 2.5 in tech preview)
● Scalability and performance on commodity HW
● Spark SQL useful for offloading queries from traditional RDBMS
● code generation speeds up to one order of magnitude on CPU-bound workloads
Diagnostics
● Profiling tools are important in MPP world
● Execution plans analyzed with flame graphs
● Cons: Very immature solutions
Testing
● Testing locally saves time, money and takes advantage of the IDE features
● Elegant ways to test a code by using local SparkContext
● Easy ways to recreate environments for testing real interactions (such Hadoop)
Profiling & Testing with Spark
THANK YOU!
References
● Deep-dive-into-catalyst-apache-spark-2.0
● http://es.slideshare.net/databricks/spark-performance-whats-next
● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
● http://www.brendangregg.com/flamegraphs.html
● http://db-blog.web.cern.ch/
● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska
Q & A

More Related Content

PDF
GPUs in Big Data - StampedeCon 2014
PDF
Device-specific Clang Tooling for Embedded Systems
PDF
Effective testing for spark programs Strata NY 2015
PPTX
Tuning tips for Apache Spark Jobs
PDF
[262] netflix 빅데이터 플랫폼
PPT
Spark stream - Kafka
PPT
whats new in java 8
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
GPUs in Big Data - StampedeCon 2014
Device-specific Clang Tooling for Embedded Systems
Effective testing for spark programs Strata NY 2015
Tuning tips for Apache Spark Jobs
[262] netflix 빅데이터 플랫폼
Spark stream - Kafka
whats new in java 8
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs

What's hot (19)

PDF
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
PDF
The Year of JRuby - RubyC 2018
PDF
Debugging & Tuning in Spark
PDF
Performance Profiling in Rust
PDF
ELK: Moose-ively scaling your log system
PDF
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
PDF
Spark Summit EU talk by Ted Malaska
PDF
Why your Spark job is failing
PDF
Enabling Vectorized Engine in Apache Spark
PPTX
Tale of Kafka Consumer for Spark Streaming
PDF
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
PDF
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
PDF
PostgreSQL with OpenCL
PDF
GPGPU Accelerates PostgreSQL (English)
PDF
icpe2019_ishizaki_public
PDF
Build a Complex, Realtime Data Management App with Postgres 14!
PPTX
Java profiling Do It Yourself (jug.msk.ru 2016)
PDF
Chainer ui v0.3 and imagereport
PPTX
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
The Year of JRuby - RubyC 2018
Debugging & Tuning in Spark
Performance Profiling in Rust
ELK: Moose-ively scaling your log system
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Spark Summit EU talk by Ted Malaska
Why your Spark job is failing
Enabling Vectorized Engine in Apache Spark
Tale of Kafka Consumer for Spark Streaming
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
PostgreSQL with OpenCL
GPGPU Accelerates PostgreSQL (English)
icpe2019_ishizaki_public
Build a Complex, Realtime Data Management App with Postgres 14!
Java profiling Do It Yourself (jug.msk.ru 2016)
Chainer ui v0.3 and imagereport
Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
Ad

Similar to Profiling & Testing with Spark (20)

PPTX
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
PDF
Boosting spark performance: An Overview of Techniques
PDF
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
PDF
Apache Spark: What's under the hood
PDF
A Java Implementer's Guide to Better Apache Spark Performance
PDF
BigDL webinar - Deep Learning Library for Spark
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
PPTX
Seattle Spark Meetup Mobius CSharp API
PDF
Five cool ways the JVM can run Apache Spark faster
PPTX
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
PDF
Apache Spark Performance Observations
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Big Data Beyond the JVM - Strata San Jose 2018
PDF
Spark on YARN
PDF
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Spark streaming , Spark SQL
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
Apache Spark 2.0: Faster, Easier, and Smarter
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Boosting spark performance: An Overview of Techniques
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Apache Spark: What's under the hood
A Java Implementer's Guide to Better Apache Spark Performance
BigDL webinar - Deep Learning Library for Spark
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Seattle Spark Meetup Mobius CSharp API
Five cool ways the JVM can run Apache Spark faster
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
Apache Spark Performance Observations
Spark Summit EU talk by Sameer Agarwal
Big Data Beyond the JVM - Strata San Jose 2018
Spark on YARN
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Stream, stream, stream: Different streaming methods with Spark and Kafka
Spark streaming , Spark SQL
Ad

More from Roger Rafanell Mas (13)

PDF
How to build a self-service data platform and what it can do for your business?
PDF
Activate 2019 - Search and relevance at scale for online classifieds
PPTX
Pensamiento lateral
PPTX
Storm distributed cache workshop
PPT
IS-ENES COMP Superscalar tutorial
PDF
MRI Energy-Efficient Cloud Computing
PDF
SDS Amazon RDS
PPT
EEDC Programming Models
PPT
EEDC Intelligent Placement of Datacenters
PDF
EEDC Everthing as a Service
PPT
EEDC Apache Pig Language
PPT
EEDC Distributed Systems
PPT
EEDC SOAP vs REST
How to build a self-service data platform and what it can do for your business?
Activate 2019 - Search and relevance at scale for online classifieds
Pensamiento lateral
Storm distributed cache workshop
IS-ENES COMP Superscalar tutorial
MRI Energy-Efficient Cloud Computing
SDS Amazon RDS
EEDC Programming Models
EEDC Intelligent Placement of Datacenters
EEDC Everthing as a Service
EEDC Apache Pig Language
EEDC Distributed Systems
EEDC SOAP vs REST

Recently uploaded (20)

PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PPTX
Comprehensive Guide to Digital Image Processing Concepts and Applications
PDF
OpenAssetIO Virtual Town Hall - August 2025.pdf
PDF
OpenEXR Virtual Town Hall - August 2025
PPTX
ESDS_SAP Application Cloud Offerings.pptx
PPT
chapter01_java_programming_object_oriented
PDF
Canva Desktop App With Crack Free Download 2025?
PDF
OpenImageIO Virtual Town Hall - August 2025
PPTX
Advanced Heap Dump Analysis Techniques Webinar Deck
PDF
KidsTale AI Review - Create Magical Kids’ Story Videos in 2 Minutes.pdf
PDF
4K Video Downloader Crack + License Key 2025
PDF
OpenColorIO Virtual Town Hall - August 2025
PDF
IT Advisory Services | Alphavima Technologies – Microsoft Partner
PPTX
TRAVEL SUPPLIER API INTEGRATION | XML BOOKING ENGINE
PPTX
Beige and Black Minimalist Project Deck Presentation (1).pptx
PDF
OpenTimelineIO Virtual Town Hall - August 2025
PPTX
oracle_ebs_12.2_project_cutoveroutage.pptx
PPTX
SAP Business AI_L1 Overview_EXTERNAL.pptx
PPTX
Presentation - Summer Internship at Samatrix.io_template_2.pptx
PDF
Science is Not Enough SPLC2009 Richard P. Gabriel
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
Comprehensive Guide to Digital Image Processing Concepts and Applications
OpenAssetIO Virtual Town Hall - August 2025.pdf
OpenEXR Virtual Town Hall - August 2025
ESDS_SAP Application Cloud Offerings.pptx
chapter01_java_programming_object_oriented
Canva Desktop App With Crack Free Download 2025?
OpenImageIO Virtual Town Hall - August 2025
Advanced Heap Dump Analysis Techniques Webinar Deck
KidsTale AI Review - Create Magical Kids’ Story Videos in 2 Minutes.pdf
4K Video Downloader Crack + License Key 2025
OpenColorIO Virtual Town Hall - August 2025
IT Advisory Services | Alphavima Technologies – Microsoft Partner
TRAVEL SUPPLIER API INTEGRATION | XML BOOKING ENGINE
Beige and Black Minimalist Project Deck Presentation (1).pptx
OpenTimelineIO Virtual Town Hall - August 2025
oracle_ebs_12.2_project_cutoveroutage.pptx
SAP Business AI_L1 Overview_EXTERNAL.pptx
Presentation - Summer Internship at Samatrix.io_template_2.pptx
Science is Not Enough SPLC2009 Richard P. Gabriel

Profiling & Testing with Spark

  • 1. Profiling & Testing with Spark Apache Spark 2.0 Improvements, Flame Graphs & Testing
  • 2. Outline Overview Spark 2.0 Improvements Profiling with Flame Graphs How-to Flame Graphs Testing in Spark
  • 3. Overview Apache Spark™ is a fast and general engine for large-scale data processing Speed: Runs in-memory computing, up to 100x faster than MapReduce Ease of Use: Support for Java, Scala, Python and R binding Generality: Enabled for SQL, Streaming and complex analytics (ML) Portable: Runs on Yarn, Mesos, standalone or Cloud
  • 6. Overview (code sample) Monte-carlo π calculation “This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.”
  • 7. Main Takeaway Spark SQL: Provides parallelism, affordable at scale Scale out SQL on storage for Big Data volumes Scale out on CPU for memory-intensive queries Offloading reports from RDBMS becomes attractive Spark 2.0 improvements: Considerable speedup of CPU-intensive queries
  • 9. SQL Queries sqlContext.sql(" SELECT a.bucket, sum(a.val2) tot FROM t1 a, t1 b WHERE a.bucket=b.bucket and a.val1+b.val1<1000 GROUP BY a.bucket ORDER BY a.bucket").show() Complex and resource-intensive SELECT statement: EXPLAIN directive (execution plan)
  • 10. Execution Plan The execution plan: First instrumentation point for SQL tuning Shows how Spark wants to execute the query (break-down) Main players: Catalyst: the query optimizer
  • 11. Catalyst (query optimizer) Logical Plan: Describes computation on data sets without defining how to conduct it Physical Plan: Defines which computation to conduct on each dataset
  • 12. Project Tungsten (Goal) “Improves the memory and CPU efficiency of Spark backend execution by pushing performance close to the limits of modern hardware.”
  • 13. Project Tungsten Perform manual memory management instead of relying on Java objects: Reduce memory footprint Eliminate garbage collection overheads Use java.unsafe and off-heap memory Code generation for expression evaluation: Reduce virtual function calls and interpretation overhead (JVM)
  • 15. Project Tungsten (Code-Gen) The Volcano Iterator Model: Standard for 30 years: almost all databases do it. Each operator is an “iterator” that consumes records from its input operator
  • 16. Project Tungsten (Code-Gen) Downside the Volcano Iterator Model: Too many virtual function calls at least 3 calls for each row in Aggregate phase Can’t take advantage of modern CPU features pipelining, prefetching, branch prediction,
  • 17. Project Tungsten (Code-Gen) What if we hire a college freshman to implement this query in Java in 10 mins?
  • 18. Whole-stage Code-Gen: Spark as a “Compiler”
  • 19. Project Tungsten (Code-Gen) A student beating 30 years of science ...
  • 20. Project Tungsten (Code-Gen) Volcano ● Many virtual function calls ● Data in memory (or cache) ● No loop unrolling, SIMD Hand-written code ● No virtual function calls ● Data in CPU registers ● Exploit compiler optimizations ○ loop unrolling, SIMD, pipelining Take advantage of all the information that is known after query compilation
  • 21. Execution plan comparison (legacy vs whole stage code-gen) WholeStageCodeGen
  • 23. Root Cause Analysis Benchmarking: Run the workload and measure it with the relevant diagnostic tools Goals: understand the bottleneck(s) and find root causes Limitations: Our tools & time available for analysis are limiting factors
  • 24. Profiling CPU-Bound workloads Flame graph visualization of stack profiles: ● Brain child of Brendan Gregg (Dec 2011) ● Code: https://github.com/brendangregg/FlameGraph ● Now very popular, available for many languages, also for JVM Shows which parts of the code are hot ● Very useful to understand where CPU cycles are spent
  • 25. Flame Graph Visualization Recipe: ● Gather multiple stack traces ● Aggregate them by sorting alphabetically by function/method name ● Visualization using stacked colored boxes ● Length of the box proportional to time spent there
  • 28. Spark CodeGen vs. Volcano Code generation improves CPU-intensive workloads ● Replaces loops and virtual function calls with code generated for the query ● The use of vector operations (e.g. SIMD) also beneficial ● Codegen is crucial for modern in-memory DBs Commercial RDBMS engines ● Typically use the slower volcano model (with loops and virtual function calls) ● In the past optimizing for I/O latency was more important, now CPU cycles matter more
  • 29. Flame Graphs Pros: good to understand where CPU cycles are spent ● Useful for performance troubleshooting ● Functions at the top of the graph are the ones using CPU ● Parent methods/functions provide context Limitations: ● Off-CPU and wait time not charted (experimental) ● Interpretation of flame graphs requires experience/knowledge ● Not included in Spark monitoring suite
  • 31. CERN Java Flight Recorder Approach (1/2) Enable Java Flight Recorder (JFR) ● Extra options in spark-defaults.conf or CLI. Example: Collect data with jcmd: ● Example, sampling for 10 sec:
  • 32. CERN Java Flight Recorder Approach (2/2) Process the jfr file: ● From .jfr to merged stacks ● Produce the .svg file with the flame graph ● Find details in Kay Ousterhout’s article: https://gist.github.com/kayousterhout/7008a8ebf2bab eedc7ce6f8723fd1bf4
  • 34. CERN HProfiler Approach HProfiler (CERN home-built tool) ● Automates collection and aggregation of stack traces into flame graphs for distributed applications ● Integrates with YARN to identify the processes to trace across the cluster Based on Linux perf_events stack sampling (bare metal) Experimental tool ● Author Joeri Hermans @ CERN ● https://github.com/cerndb/Hadoop-Profiler ● Hadoop-performance-troubleshooting-stack-tracing
  • 36. Testing in Spark ● Why to run Spark outside of a cluster ● What to test ● Running Local ● Running as a Unit Test ● Data Structures
  • 37. Testing in Spark Why to run Spark outside of a cluster ● Time ● Trusted Deployment ● Money
  • 38. Testing in Spark What to test ● Experiments ● Complex logic ● Data samples ● Business generated scenarios
  • 39. Testing in Spark (Running Local) Running Local ● A test doesn’t always need to be a unit test ● UIs like Zeppelin is OK for quick feedback but lacks from IDE Features ● Running local in your IDE is priceless
  • 40. Testing in Spark (Running Local) Example ● Use runLocal flag to set a local SparkContext ● Separate out testable work from driver code
  • 41. Testing in Spark (Unit Testing) Example FunSuite: TDD unit testing suite for Scala
  • 42. Testing in Spark (Data Structures) Working with “hand-written” DataFrames:
  • 43. Testing in Spark (Hive) Testing with Hive: ● Spin-up a docker-hive container for Apache Hive (Big Data Europe) ● Enables real interaction allowing to: ○ create, delete, write, ...
  • 44. Testing in Spark (Hive) Putting Hive + Spark together: ● Create a custom hive-site.xml ● Start Spark with the provided hive-site.xml ○ spark-shell --files /PATH/hive-site.xml
  • 45. Testing in Spark (Hive) Start Spark with the provided hive-site.xml:
  • 46. Testing in Spark (Mini-Clusters) Mini-Clusters ● Hadoop-mini-cluster ● Spark-unit-testing-with-hdfs ● Support for: ○ HBase & Hive ○ Kafka & Storm ○ Zookeeper ○ HDFS ○ ... access HDFS files & test code copy files from localFS to HDFS
  • 48. Conclusions Apache Spark 2.0 Improvements (HDP 2.5 in tech preview) ● Scalability and performance on commodity HW ● Spark SQL useful for offloading queries from traditional RDBMS ● code generation speeds up to one order of magnitude on CPU-bound workloads Diagnostics ● Profiling tools are important in MPP world ● Execution plans analyzed with flame graphs ● Cons: Very immature solutions Testing ● Testing locally saves time, money and takes advantage of the IDE features ● Elegant ways to test a code by using local SparkContext ● Easy ways to recreate environments for testing real interactions (such Hadoop)
  • 49. Profiling & Testing with Spark THANK YOU!
  • 50. References ● Deep-dive-into-catalyst-apache-spark-2.0 ● http://es.slideshare.net/databricks/spark-performance-whats-next ● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf ● http://www.brendangregg.com/flamegraphs.html ● http://db-blog.web.cern.ch/ ● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska
  • 51. Q & A

Editor's Notes

  • #47: Could be provided on Maven POM as: <scope>test</scope>
  • #49: MPP = Massive Parallel Processing