SlideShare a Scribd company logo
Beyond Snuffaluffagus
Streaming & Scaling your
Dreams
Beyond Shuffling -
Scaling Spark
And a preview of Structured Streaming!
Apache Spark London Meetup
Now
mostly
“works”*
*See developer for details - excluding structured streaming. Does not imply warranty. :p Does not apply to libraries
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
What is the Spark Technology Center?
● An IBM technology center focused around Spark
● We work on open source Apache Spark to make it more awesome
○ Python, SQL, ML, and more! :)
● Related components as well:
○ Apache Toree [Incubating] (Notebook solution for Spark with Jupyter)
○ spark-testing-base (testing utilites on top of Spark)
○ Apache Bahir
○ System ML - Machine Learning
● Partner with the Scala Foundation and other important players
● Multiple Spark Committers (Nick Pentreath, Xiao (Sean) Li, Prashant Sharma)
● Lots of contributions in Spark 2.0
Streaming & Scaling Spark - London Spark Meetup 2016
What is going to be covered:
● What I think I might know about you
● RDD re-use (caching, persistence levels, and checkpointing)
● Working with key/value data
○ Why group key is evil and what we can do about it
● When Spark SQL can be amazing and wonderful
● A brief introduction to Datasets (new in Spark 1.6)
● Iterator-to-Iterator transformations (or yet another way to go OOM in the night)
● How to test your Spark code :)
● Structured Streaming! :D (new in Spark 2.0)
Torsten Reuschling
Or….
Huang
Yun
Chung
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
○ May or may not like pictures of David Hasselhof?
● Know some Apache Spark
○ If you don’t - its cool I’ll cover some of the basic topics
● Want to scale your Apache Spark jobs
● Don’t overly mind a grab-bag of topics
● Likely no longer distracted with Pokemon GO :(
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
Spark specific terms in this talk
● RDD
○ Resilient Distributed Dataset - Like a distributed collection. Supports
many of the same operations as Seq’s in Scala but automatically
distributed and fault tolerant. Lazily evaluated, and handles faults by
recompute. Any* Java or Kyro serializable object.
● DataFrame
○ Spark DataFrame - not a Pandas or R DataFrame. Distributed,
supports a limited set of operations. Columnar structured, runtime
schema information only. Limited* data types.
● Dataset
○ Compile time typed version of DataFrame (templated)
skdevitt
Spark specific terms in this talk (part 2)
● SparkContext
○ Our “window to the world of Spark” - can load data from different
sources.
○ Start and stop our job
● SQLContext (Spark 2.0 + = SparkSession)
○ Also for loading data - except to DataFrames & Datasets instead of
RDDs. Can register UDFs for SQL queries and be used to start a
ThriftJDBC server.
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
Lets look at some old stand bys:
val rdd = sc.textFile("python/pyspark/*.py", 20)
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
val warnings = rdd.filter(_.toLower.contains("error")).count()
Tomomi
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Considerations for Key/Value Data
● What does the distribution of keys look like?
● What type of aggregations do we need to do?
● Do we want our data in any particular order?
● Are we joining with another RDD?
● Whats our partitioner?
○ If we don’t have an explicit one: what is the partition structure?
eleda 1
What is key skew and why do we care?
● Keys aren’t evenly distributed
○ Sales by zip code, or records by city, etc.
● groupByKey will explode (but it's pretty easy to break)
● We can have really unbalanced partitions
○ If we have enough key skew sortByKey could even fail
○ Stragglers (uneven sharding can make some tasks take much longer)
Mitchell
Joyce
groupByKey - just how evil is it?
● Pretty evil
● Groups all of the records with the same key into a single record
○ Even if we immediately reduce it (e.g. sum it or similar)
○ This can be too big to fit in memory, then our job fails
● Unless we are in SQL then happy pandas
PROgeckoam
So what does that look like?
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)]
Tomomi
Let’s revisit wordcount with groupByKey
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val grouped = wordPairs.groupByKey()
grouped.mapValues(_.sum)
Tomomi
And now back to the “normal” version
val words = rdd.flatMap(_.split(" "))
val wordPairs = words.map((_, 1))
val wordCounts = wordPairs.reduceByKey(_ + _)
wordCounts
GroupByKey
reduceByKey
So what did we do instead?
● reduceByKey
○ Works when the types are the same (e.g. in our summing version)
● aggregateByKey
○ Doesn’t require the types to be the same (e.g. computing stats model or similar)
Allows Spark to pipeline the reduction & skip making the list
We also got a map-side reduction (note the difference in shuffled read)
Can just the shuffle cause problems?
● Sorting by key can put all of the records in the same partition
● We can run into partition size limits (around 2GB)
● Or just get bad performance
● So we can handle data like the above we can add some “junk” to our key
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
PROTodd
Klassy
Shuffle explosions :(
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110, A, B)
(94110, A, C)
(94110, E, F)
(94110, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(94110, T, R)
(94110, T, R)
(67843, T, R)(10003, A, R)
(10003, D, E)
javier_artiles
100% less explosions
(94110, A, B)
(94110, A, C)
(10003, D, E)
(94110, E, F)
(94110, A, R)
(10003, A, R)
(94110, D, R)
(94110, E, R)
(94110, E, R)
(67843, T, R)
(94110, T, R)
(94110, T, R)
(94110_A, A, B)
(94110_A, A, C)
(94110_A, A, R)
(94110_D, D, R)
(94110_T, T, R)
(10003_A, A, R)
(10003_D, D, E)
(67843_T, T, R)
(94110_E, E, R)
(94110_E, E, R)
(94110_E, E, F)
(94110_T, T, R)
Jennifer Williams
Well there is a bit of magic in the shuffle….
● We can reuse shuffle files
● But it can (and does) explode*
Sculpture by Flaming Lotus Girls
Photo by Zaskoda
Photo by Christian Heilmann
Iterator to Iterator transformations
● Iterator to Iterator transformations are super useful
○ They allow Spark to spill to disk if reading an entire partition is too much
○ Not to mention better pipelining when we put multiple transformations together
● Most of the default transformations are already set up for this
○ map, filter, flatMap, etc.
● But when we start working directly with the iterators
○ Sometimes to save setup time on expensive objects
○ e.g. mapPartitions, mapPartitionsWithIndex etc.
● Reading into a list or other structure (implicitly or otherwise) can even cause
OOMs :(
Christian Heilmann
tl;dr: be careful with mapPartitions
Christian Heilmann
Introducing Datasets
● New in Spark 1.6
● Awesome optimizer and storage
● Provide templated compile time strongly typed version of DataFrames
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
○ Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Houser Wolf
Using Datasets to mix functional & relational style
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
A typed query (specifies the
return type). Without the as[]
will return a DataFrame
(Dataset[Row])
Traditional functional
reduction:
arbitrary scala code :)
Robert Couse-Baker
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
How much faster can it be?
Andrew Skudder
How much faster can it be? (Python)
Andrew Skudder
*Note: do not compare absolute #s with previous graph -
different dataset sizes because I forgot to write it down when I
made the first one.
But where will it explode?
● Iterative algorithms - large plans
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data (200 partitions)
● Default partition size when reading in is also sad
How to avoid lineage explosions:
/**
* Cut the lineage of a DataFrame which has too long a query
plan.
*/
def cutLineage(df: DataFrame): DataFrame = {
val sqlCtx = df.sqlContext
//tag::cutLineage[]
val rdd = df.rdd
rdd.cache()
sqlCtx.createDataFrame(rdd, df.schema)
//end::cutLineage[]
}
karmablue
And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● Extends the Dataset & DataFrame APIs to represent continous tables
● Still very early stages - but lots of really cool optimizations possible now
● We can build a machine learning pipeline with it together :)
○ Well we have to use some hacks - but ssssssh don’t tell TD
https://github.com/holdenk/spark-structured-streaming-ml
ALPHA =~ Please don’t use this
Built with
experimental
APIs :)
Get a streaming dataframe
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
source
scan
Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
source
scan
groupBy
avg
Start a continuous query
val query = happinessByCoffee
.writeStream
.format(“parquet”)
.outputMode(“complete”)
.trigger(ProcessingTime(5.seconds))
.start()
StreamingQuery
source
scan
groupBy
avglogicalPlan =
Cool - lets build some ML with it!
Lauren Coolman
Getting a micro-batch view with distributed
collection*
case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink
{
override def addBatch(batchId: Long, data: DataFrame): Unit = {
func(data)
}
}
https://github.com/holdenk/spark-structured-streaming-ml
And doing some ML with it:
def evilTrain(df: DataFrame): StreamingQuery = {
val sink = new ForeachDatasetSink({df: DataFrame => update(df)})
val sparkSession = df.sparkSession
val evilStreamingQueryManager =
EvilStreamingQueryManager(sparkSession.streams)
evilStreamingQueryManager.startQuery(
Some("snb-train"),
None,
df,
sink,
OutputMode.Append())
}
And doing some ML with it:
def update(batch: Dataset[_]): Unit = {
val newCountsByClass = add(batch)
model.update(newCountsByClass)
} Aggregate new batch
Merge with previous aggregates
And doing some ML with it*
(Algorithm specific)
def update(updates: Array[(Double, (Long, DenseVector))]): Unit = {
updates.foreach { case (label, (numDocs, termCounts)) =>
countsByClass.get(label) match {
case Some((n, c)) =>
axpy(1.0, termCounts, c)
countsByClass(label) = (n + numDocs, c)
case None =>
// new label encountered
countsByClass += (label -> (numDocs, termCounts))
}
}
}
Non-Evil alternatives to our Evil:
● ForeachWriter exists
● Since everything runs on the executors it's difficult to update the model
● You could:
○ Use accumulators
○ Write the updates to Kafka
○ Send the updates to a param server of some type with RPC
○ Or do the evil things we did instead :)
● Wait for the “future?”: https://github.com/apache/spark/pull/15178
_torne
Working with the results - foreach (1 of 2)
val foreachWriter: ForeachWriter[T] =
new ForeachWriter[T] {
def open(partitionId: Long, version: Long): Boolean = {
True // always open
}
def close(errorOrNull: Throwable): Unit = {
// No close logic - if we wanted to copy updates per-batch
}
def process(record: T): Unit = {
db.update(record)
}
}
Working with the results - foreach (2 of 2)
// Apply foreach
happinessByCoffee.writeStream.outputMode(OutputMode.Complete())
foreach(foreachWriter).start()
Structured Streaming in Review:
● Structured Streaming still uses Spark’s Microbatch approach
● JIRA discussion indicates an interest in swapping out the execution engine
(but no public design document has emerged yet)
● One of the areas that Matei is researching
○ Researching ==~ future , research !~ today
Windell Oskay
Ok but where can we not use it?
● A lot of random methods on DataFrames & Datasets won’t work
● They will fail at runtime rather than compile time - so have tests!
● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail)
● Need to run a query inside of a sink? That is not going to work
● Need a complex receiver type? Most receivers are not ported yet
● Also you will need distinct query names - even if you stop the previous query.
● Aggregations and Append output mode (and the only file sink requires
Append)
● DataFrame/Dataset transformations inside of a sink
Additional Spark Testing Resources
● Libraries
○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck)
example-spark (unit)
○ Java: spark-testing-base (unit)
○ Python: spark-testing-base (unittest2), pyspark.test (pytest)
● Strata San Jose Talk (up on YouTube)
● Blog posts
○ Unit Testing Spark with Java by Jesse Anderson
○ Making Apache Spark Testing Easy with Spark Testing Base
○ Unit testing Apache Spark with py.test
raider of gin
Additional Spark Resources
● Programming guide (along with JavaDoc, PyDoc,
ScalaDoc, etc.)
○ http://spark.apache.org/docs/latest/
● Books
● Videos
● Spark Office Hours
○ Normally in the bay area - will do Google Hangouts ones soon
○ follow me on twitter for future ones - https://twitter.com/holdenkarau
Structured Streaming Resources
● Programming guide (RC2 +) (along with JavaDoc,
PyDoc, ScalaDoc, etc.)
○ http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-do
cs/structured-streaming-programming-guide.html
● https://github.com/holdenk/spark-structured-streaming-ml
● TD
https://spark-summit.org/2016/events/a-deep-dive-into-structured-st
reaming/
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Coming soon:
High Performance Spark
And the next book…..
First six chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
Spark Videos
● Apache Spark Youtube Channel
● My Spark videos on YouTube -
○ http://bit.ly/holdenSparkVideos
● Spark Summit 2014 training
● Paco’s Introduction to Apache Spark
Paul Anderson
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

More Related Content

PDF
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
PDF
Introduction to and Extending Spark ML
Holden Karau
 
PDF
Spark ML for custom models - FOSDEM HPC 2017
Holden Karau
 
PDF
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
PDF
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
PPTX
Beyond shuffling - Strata London 2016
Holden Karau
 
PDF
Extending spark ML for custom models now with python!
Holden Karau
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Introduction to and Extending Spark ML
Holden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Holden Karau
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Beyond shuffling - Strata London 2016
Holden Karau
 
Extending spark ML for custom models now with python!
Holden Karau
 

What's hot (20)

PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
PDF
Getting started contributing to Apache Spark
Holden Karau
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PDF
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
PDF
Debugging PySpark - PyCon US 2018
Holden Karau
 
PDF
Holden Karau - Spark ML for Custom Models
sparktc
 
PDF
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
PDF
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
PDF
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
PDF
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Holden Karau
 
PDF
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
PDF
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Getting started contributing to Apache Spark
Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Debugging PySpark - PyCon US 2018
Holden Karau
 
Holden Karau - Spark ML for Custom Models
sparktc
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Holden Karau
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...
Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Holden Karau
 
Ad

Similar to Streaming & Scaling Spark - London Spark Meetup 2016 (20)

PDF
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
PDF
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
PPTX
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PPTX
Dive into spark2
Gal Marder
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PDF
Spark Programming Basic Training Handout
yanuarsinggih1
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Holden Karau
 
Getting The Best Performance With PySpark
Spark Summit
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
Holden Karau
 
Beyond shuffling global big data tech conference 2015 sj
Holden Karau
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Dive into spark2
Gal Marder
 
Spark real world use cases and optimizations
Gal Marder
 
Spark Programming Basic Training Handout
yanuarsinggih1
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Ad

Recently uploaded (20)

PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Artificial Intelligence (AI)
Mukul
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 

Streaming & Scaling Spark - London Spark Meetup 2016

  • 1. Beyond Snuffaluffagus Streaming & Scaling your Dreams
  • 2. Beyond Shuffling - Scaling Spark And a preview of Structured Streaming! Apache Spark London Meetup Now mostly “works”* *See developer for details - excluding structured streaming. Does not imply warranty. :p Does not apply to libraries
  • 3. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 4. What is the Spark Technology Center? ● An IBM technology center focused around Spark ● We work on open source Apache Spark to make it more awesome ○ Python, SQL, ML, and more! :) ● Related components as well: ○ Apache Toree [Incubating] (Notebook solution for Spark with Jupyter) ○ spark-testing-base (testing utilites on top of Spark) ○ Apache Bahir ○ System ML - Machine Learning ● Partner with the Scala Foundation and other important players ● Multiple Spark Committers (Nick Pentreath, Xiao (Sean) Li, Prashant Sharma) ● Lots of contributions in Spark 2.0
  • 6. What is going to be covered: ● What I think I might know about you ● RDD re-use (caching, persistence levels, and checkpointing) ● Working with key/value data ○ Why group key is evil and what we can do about it ● When Spark SQL can be amazing and wonderful ● A brief introduction to Datasets (new in Spark 1.6) ● Iterator-to-Iterator transformations (or yet another way to go OOM in the night) ● How to test your Spark code :) ● Structured Streaming! :D (new in Spark 2.0) Torsten Reuschling
  • 8. Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ○ May or may not like pictures of David Hasselhof? ● Know some Apache Spark ○ If you don’t - its cool I’ll cover some of the basic topics ● Want to scale your Apache Spark jobs ● Don’t overly mind a grab-bag of topics ● Likely no longer distracted with Pokemon GO :(
  • 9. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 10. Spark specific terms in this talk ● RDD ○ Resilient Distributed Dataset - Like a distributed collection. Supports many of the same operations as Seq’s in Scala but automatically distributed and fault tolerant. Lazily evaluated, and handles faults by recompute. Any* Java or Kyro serializable object. ● DataFrame ○ Spark DataFrame - not a Pandas or R DataFrame. Distributed, supports a limited set of operations. Columnar structured, runtime schema information only. Limited* data types. ● Dataset ○ Compile time typed version of DataFrame (templated) skdevitt
  • 11. Spark specific terms in this talk (part 2) ● SparkContext ○ Our “window to the world of Spark” - can load data from different sources. ○ Start and stop our job ● SQLContext (Spark 2.0 + = SparkSession) ○ Also for loading data - except to DataFrames & Datasets instead of RDDs. Can register UDFs for SQL queries and be used to start a ThriftJDBC server.
  • 12. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages
  • 13. The different pieces of Spark: 2.0+ Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 14. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 15. Lets look at some old stand bys: val rdd = sc.textFile("python/pyspark/*.py", 20) val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) val warnings = rdd.filter(_.toLower.contains("error")).count() Tomomi
  • 16. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 17. Considerations for Key/Value Data ● What does the distribution of keys look like? ● What type of aggregations do we need to do? ● Do we want our data in any particular order? ● Are we joining with another RDD? ● Whats our partitioner? ○ If we don’t have an explicit one: what is the partition structure? eleda 1
  • 18. What is key skew and why do we care? ● Keys aren’t evenly distributed ○ Sales by zip code, or records by city, etc. ● groupByKey will explode (but it's pretty easy to break) ● We can have really unbalanced partitions ○ If we have enough key skew sortByKey could even fail ○ Stragglers (uneven sharding can make some tasks take much longer) Mitchell Joyce
  • 19. groupByKey - just how evil is it? ● Pretty evil ● Groups all of the records with the same key into a single record ○ Even if we immediately reduce it (e.g. sum it or similar) ○ This can be too big to fit in memory, then our job fails ● Unless we are in SQL then happy pandas PROgeckoam
  • 20. So what does that look like? (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (94110, [(A, B), (A, C), (E, F), (A, R), (D, R), (E, R), (E, R), (T, R) (T, R)] Tomomi
  • 21. Let’s revisit wordcount with groupByKey val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val grouped = wordPairs.groupByKey() grouped.mapValues(_.sum) Tomomi
  • 22. And now back to the “normal” version val words = rdd.flatMap(_.split(" ")) val wordPairs = words.map((_, 1)) val wordCounts = wordPairs.reduceByKey(_ + _) wordCounts
  • 25. So what did we do instead? ● reduceByKey ○ Works when the types are the same (e.g. in our summing version) ● aggregateByKey ○ Doesn’t require the types to be the same (e.g. computing stats model or similar) Allows Spark to pipeline the reduction & skip making the list We also got a map-side reduction (note the difference in shuffled read)
  • 26. Can just the shuffle cause problems? ● Sorting by key can put all of the records in the same partition ● We can run into partition size limits (around 2GB) ● Or just get bad performance ● So we can handle data like the above we can add some “junk” to our key (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) PROTodd Klassy
  • 27. Shuffle explosions :( (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110, A, B) (94110, A, C) (94110, E, F) (94110, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (94110, T, R) (94110, T, R) (67843, T, R)(10003, A, R) (10003, D, E) javier_artiles
  • 28. 100% less explosions (94110, A, B) (94110, A, C) (10003, D, E) (94110, E, F) (94110, A, R) (10003, A, R) (94110, D, R) (94110, E, R) (94110, E, R) (67843, T, R) (94110, T, R) (94110, T, R) (94110_A, A, B) (94110_A, A, C) (94110_A, A, R) (94110_D, D, R) (94110_T, T, R) (10003_A, A, R) (10003_D, D, E) (67843_T, T, R) (94110_E, E, R) (94110_E, E, R) (94110_E, E, F) (94110_T, T, R) Jennifer Williams
  • 29. Well there is a bit of magic in the shuffle…. ● We can reuse shuffle files ● But it can (and does) explode* Sculpture by Flaming Lotus Girls Photo by Zaskoda
  • 30. Photo by Christian Heilmann
  • 31. Iterator to Iterator transformations ● Iterator to Iterator transformations are super useful ○ They allow Spark to spill to disk if reading an entire partition is too much ○ Not to mention better pipelining when we put multiple transformations together ● Most of the default transformations are already set up for this ○ map, filter, flatMap, etc. ● But when we start working directly with the iterators ○ Sometimes to save setup time on expensive objects ○ e.g. mapPartitions, mapPartitionsWithIndex etc. ● Reading into a list or other structure (implicitly or otherwise) can even cause OOMs :( Christian Heilmann
  • 32. tl;dr: be careful with mapPartitions Christian Heilmann
  • 33. Introducing Datasets ● New in Spark 1.6 ● Awesome optimizer and storage ● Provide templated compile time strongly typed version of DataFrames ● Make it easier to intermix functional & relational code ○ Do you hate writing UDFS? So do I! ● Still an experimental component (API will change in future versions) ○ Although the next major version seems likely to be 2.0 anyways so lots of things may change regardless Houser Wolf
  • 34. Using Datasets to mix functional & relational style val ds: Dataset[RawPanda] = ... val happiness = ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 35. So what was that? ds.filter($"happy" === true). select($"attributes"(0).as[Double]). reduce((x, y) => x + y) A typed query (specifies the return type). Without the as[] will return a DataFrame (Dataset[Row]) Traditional functional reduction: arbitrary scala code :) Robert Couse-Baker
  • 36. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} }
  • 37. How much faster can it be? Andrew Skudder
  • 38. How much faster can it be? (Python) Andrew Skudder *Note: do not compare absolute #s with previous graph - different dataset sizes because I forgot to write it down when I made the first one.
  • 39. But where will it explode? ● Iterative algorithms - large plans ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 40. How to avoid lineage explosions: /** * Cut the lineage of a DataFrame which has too long a query plan. */ def cutLineage(df: DataFrame): DataFrame = { val sqlCtx = df.sqlContext //tag::cutLineage[] val rdd = df.rdd rdd.cache() sqlCtx.createDataFrame(rdd, df.schema) //end::cutLineage[] } karmablue
  • 41. And now we can use it for streaming too! ● StructuredStreaming - new to Spark 2.0 ○ Emphasis on new - be cautious when using ● Extends the Dataset & DataFrame APIs to represent continous tables ● Still very early stages - but lots of really cool optimizations possible now ● We can build a machine learning pipeline with it together :) ○ Well we have to use some hacks - but ssssssh don’t tell TD https://github.com/holdenk/spark-structured-streaming-ml
  • 42. ALPHA =~ Please don’t use this Built with experimental APIs :)
  • 43. Get a streaming dataframe // Read a streaming dataframe val schema = new StructType() .add("happiness", "double") .add("coffees", "integer") val streamingDS = spark .readStream .schema(schema) .format(“parquet”) .load(path) Dataset isStreaming = true source scan
  • 44. Build the recipe for each query val happinessByCoffee = streamingDS .groupBy($"coffees") .agg(avg($"happiness")) Dataset isStreaming = true source scan groupBy avg
  • 45. Start a continuous query val query = happinessByCoffee .writeStream .format(“parquet”) .outputMode(“complete”) .trigger(ProcessingTime(5.seconds)) .start() StreamingQuery source scan groupBy avglogicalPlan =
  • 46. Cool - lets build some ML with it! Lauren Coolman
  • 47. Getting a micro-batch view with distributed collection* case class ForeachDatasetSink(func: DataFrame => Unit) extends Sink { override def addBatch(batchId: Long, data: DataFrame): Unit = { func(data) } } https://github.com/holdenk/spark-structured-streaming-ml
  • 48. And doing some ML with it: def evilTrain(df: DataFrame): StreamingQuery = { val sink = new ForeachDatasetSink({df: DataFrame => update(df)}) val sparkSession = df.sparkSession val evilStreamingQueryManager = EvilStreamingQueryManager(sparkSession.streams) evilStreamingQueryManager.startQuery( Some("snb-train"), None, df, sink, OutputMode.Append()) }
  • 49. And doing some ML with it: def update(batch: Dataset[_]): Unit = { val newCountsByClass = add(batch) model.update(newCountsByClass) } Aggregate new batch Merge with previous aggregates
  • 50. And doing some ML with it* (Algorithm specific) def update(updates: Array[(Double, (Long, DenseVector))]): Unit = { updates.foreach { case (label, (numDocs, termCounts)) => countsByClass.get(label) match { case Some((n, c)) => axpy(1.0, termCounts, c) countsByClass(label) = (n + numDocs, c) case None => // new label encountered countsByClass += (label -> (numDocs, termCounts)) } } }
  • 51. Non-Evil alternatives to our Evil: ● ForeachWriter exists ● Since everything runs on the executors it's difficult to update the model ● You could: ○ Use accumulators ○ Write the updates to Kafka ○ Send the updates to a param server of some type with RPC ○ Or do the evil things we did instead :) ● Wait for the “future?”: https://github.com/apache/spark/pull/15178 _torne
  • 52. Working with the results - foreach (1 of 2) val foreachWriter: ForeachWriter[T] = new ForeachWriter[T] { def open(partitionId: Long, version: Long): Boolean = { True // always open } def close(errorOrNull: Throwable): Unit = { // No close logic - if we wanted to copy updates per-batch } def process(record: T): Unit = { db.update(record) } }
  • 53. Working with the results - foreach (2 of 2) // Apply foreach happinessByCoffee.writeStream.outputMode(OutputMode.Complete()) foreach(foreachWriter).start()
  • 54. Structured Streaming in Review: ● Structured Streaming still uses Spark’s Microbatch approach ● JIRA discussion indicates an interest in swapping out the execution engine (but no public design document has emerged yet) ● One of the areas that Matei is researching ○ Researching ==~ future , research !~ today Windell Oskay
  • 55. Ok but where can we not use it? ● A lot of random methods on DataFrames & Datasets won’t work ● They will fail at runtime rather than compile time - so have tests! ● Anything which roundtrips through an rdd() is going to be pretty sad (aka fail) ● Need to run a query inside of a sink? That is not going to work ● Need a complex receiver type? Most receivers are not ported yet ● Also you will need distinct query names - even if you stop the previous query. ● Aggregations and Append output mode (and the only file sink requires Append) ● DataFrame/Dataset transformations inside of a sink
  • 56. Additional Spark Testing Resources ● Libraries ○ Scala: spark-testing-base (scalacheck & unit) sscheck (scalacheck) example-spark (unit) ○ Java: spark-testing-base (unit) ○ Python: spark-testing-base (unittest2), pyspark.test (pytest) ● Strata San Jose Talk (up on YouTube) ● Blog posts ○ Unit Testing Spark with Java by Jesse Anderson ○ Making Apache Spark Testing Easy with Spark Testing Base ○ Unit testing Apache Spark with py.test raider of gin
  • 57. Additional Spark Resources ● Programming guide (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://spark.apache.org/docs/latest/ ● Books ● Videos ● Spark Office Hours ○ Normally in the bay area - will do Google Hangouts ones soon ○ follow me on twitter for future ones - https://twitter.com/holdenkarau
  • 58. Structured Streaming Resources ● Programming guide (RC2 +) (along with JavaDoc, PyDoc, ScalaDoc, etc.) ○ http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-do cs/structured-streaming-programming-guide.html ● https://github.com/holdenk/spark-structured-streaming-ml ● TD https://spark-summit.org/2016/events/a-deep-dive-into-structured-st reaming/
  • 59. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action Coming soon: High Performance Spark
  • 60. And the next book….. First six chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 61. Spark Videos ● Apache Spark Youtube Channel ● My Spark videos on YouTube - ○ http://bit.ly/holdenSparkVideos ● Spark Summit 2014 training ● Paco’s Introduction to Apache Spark Paul Anderson
  • 62. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout ([email protected]) if you feel comfortable doing so :)