Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatterns with Russell Spitzer

Spark and Cassandra
Anti-Patterns
© DataStax, All Rights Reserved.
Russell Spitzer

Russell (left) and Cara (right)
• Software Engineer 
 
• Spark-Cassandra
Integration since  
Spark 0.9
• Cassandra since
Cassandra 1.2
• 3 Year Scala Convert
• Still not comfortable
talking about Monads
in public

Avoiding the Sharp Edges
•Out of Memory Errors
•RPC Failures
•"It is Slow"
•Serialization
•Understanding what Catalyst does
After working with customers for several years,
most problems boil down to a few common scenarios.

Most Common Performance Pitfall
val there = rdd.map(doStuff).collect() 
val backAgain = someWork.map(otherStuff) 
val thereAgain = sc.parallelize(backAgain)
The Hobbit (1977)
OOM Slow RPC Failures

There and Back Again: 
Don't Collect and Parallelize
Don't Do
val thereAgain = sc.parallelize(backAgain)
Instead
val there = rdd 
.map(doStuff) 
.map(otherStuff)
The Hobbit (1977)

Lord of the Rings, 2001-2003
Driver JVM
Your Cluster
Parallelize
Collect
Why Not?
1. You are using Spark for a Reason

Driver JVM
Your Cluster
Parallelize
Collect
Dependable 
Easy to work with 
Easy to understand
Not very big
Only 1
Why Not?

Driver JVM
Your Cluster
Parallelize
Collect
Dependable 
Easy to understand
Not very big
Only 1
The Entire Reason
Behind Using Spark
Why Not?

Driver JVM
Your Cluster
Parallelize
Collect
Dependable 
Easy to understand
Not very big
Only 1
The Entire Reason
Behind Using Spark
Why Not?
OOM

Why Not?
2. Moving data between machines is slow
Jim Gray,  
http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm

Why Not?
2. Moving data between machines is slow
Jim Gray,  
http://loci.cs.utk.edu/dsi/netstore99/docs/presentations/keynote/sld023.htm
The Lord of the Rings, 1978

Why Not?
3. Parallelize sends data in task metadata
parallelize()

Why Not?
List[Dwarves] -> RDD[Dwarves]
ENIAC Programmers, 1946, University of Pennsylvania
?

Why Not?
List[Dwarves] -> RDD[Dwarves]
Minimum of one
Dwarf per
Partition
RPC Warns on Task
Metadata over
100kb

Why Not?
scala> val treasure = 1 to 100 map (_ => "x" * 1024)
scala> sc.parallelize(Seq(treasure)).count
WARN 2018-05-21 14:13:08,035 org.apache.spark.scheduler.TaskSetManager:
Stage 0 contains a task of very large size (105 KB). The maximum recommended task size is 100 KB.
res0: Long = 1
Storing indeﬁnitely growing state in a single object will
continue growing in size until you run into heap problems.
J.R.R. Tolkien, “Conversation with Smaug” (The Hobbit, 1937)

Keep the work Distributed
Don't Do
val thereAgain = sc.parallelize(backAgain) 
 
1.We won't be doing distributed work
2.We end up sending things over the wire
3.Parallelize doesn't handle large objects well
4.We don't need to!
Everyday
val there = rdd 
.map(doStuff) 
.map(otherStuff)
The Hobbit (1977)

Start Distributed if Possible
Other alternatives to Parallelize
Start Data out Distributed
(Cassandra, HDFS, S3, …)
The Hobbit (1977)

Predicate Pushdowns Failing!
SELECT * FROM escapes WHERE time = 11-27-1977
No
Pushdown
Slow

What have I got in my pocket? 
Make your literals' types explicit!
SELECT * FROM escapes  
WHERE time = 11-27-1977
Catalyst
No precious
predicate pushdowns

Catalyst Transforms SQL
into Distributed Work
Distributed Work
?
?
?
SO 
MYSTERY
MUCH MAGICCatalyst

'Project [*]
'Filter ('time = 1977-11-27)
'UnresolvedRelation `test`.`escapes`
Logical Plan Describes
What Needs to Happen

'Project [*]
'Filter ('time = 1977-11-27)
It is transformed

*Filter (cast(time#5 as string) = 1977-11-27)
*Scan
CassandraSourceRelation
test.escapes[time#5,method#6]
ReadSchema:
struct<time:date,method:string>
Into a Physical Plan which deﬁnes
How it will be accomplished

*Filter (cast(time#5 as string) = 1977-11-27)
*Scan
CassandraSourceRelation
test.escapes[time#5,method#6]
ReadSchema:
struct<time:date,method:string>
What happened to  
predicate pushdown?
Into a Physical Plan which deﬁnes
How it will be accomplished

Catalyst Needs to make
Types Match
'1977-11-27'
Compare this to time#5

'1977-11-27'
This is a string?
time#5
This is a date?
Types Match

'1977-11-27'
This is a string?
time#5
This is a date?
Cast(time#5 as String)
MAKE THEM BOTH
STRINGS
Types Match

'1977-11-27'
This is a string?
time#5
This is a date?
Cast(time#5 as String)
Functions Cannot be
Pushed to  
Datasources
MAKE THEM BOTH
STRINGS
Types Match

Let's try again with
explicitly typed literals
SELECT * FROM test.escapes
WHERE time = cast('1977-11-27' as date)
Catalyst
Hmmmm….

Distributed Work
?
?
?
SO 
MYSTERY
MUCH MAGICCatalyst
SELECT * FROM test.escapes WHERE time = cast('1977-11-27' as date)
Catalyst Transforms SQL
into Distributed Work

Same Logical Plan
'Project [*]
'Filter ('time = 1977-11-27)

Transform
'Project [*]
'Filter ('time = 1977-11-27)

Different Physical Plan
*Scan  
CassandraSourceRelation  
test.escapes[time#5,method#6]  
PushedFilters: [*EqualTo(time,1977-11-27)],  
ReadSchema: struct<time:date,method:string>

*Scan  
CassandraSourceRelation  
test.escapes[time#5,method#6]  
PushedFilters: [*EqualTo(time,1977-11-27)],  
ReadSchema: struct<time:date,method:string>
1. PushedFilters is populated
2. There is no Spark Side Filter at all
*Means that the Filter is Handled By the Datasource and not Catalyst
Different Physical Plan

Succesful Pushdown
SELECT * FROM test.escapes
WHERE time = cast('1977-11-27' as date)
Catalyst
+----------+-----------------------------+
|time |method |
+----------+-----------------------------+
|1977-11-27|Ask a totally not fair riddle|
+----------+-----------------------------+

Writing to X is Slow
Slow
Bad Resource
Utilization
RDD.foreach(x => SlowIO(x))

You shall not pass! 
Concurrency in Spark

Functions are applied to iterators
Iterator[Balrog]
.map( balrog => moveAcrossBridge(balrog))

Iterator[Balrog]
.map( balrog => moveAcrossBridge(balrog))
No other elements will have a
function applied to them until
the current element is done
One Item is Processed at a Time

Native Spark Parallelism is
Based on Cores
Itera
.map( balrog
Itera
.map( balrog
Itera
.map( balrog
Core 1
Core 2
Core 3
Max Number of Balrogs
crossing in parallel
is limited by the number
of cores

Increase Parallelism
without Increasing Cores
Iterator[Balrog]
.map( balrog => moveAcrossBridge(balrog)) 
.foreach( balrog => balrog.eat(nearestHobbit))
Slow, Bottleneck
Fast

Grouping or Futures
Iterator[Balrog]Process  
in groups
grouped.map(balrogGroup => moveGroup(balrogGroup))
Slow elements
Will slow down the group

Grouping or Futures
Iterator[Balrog]
Iterator[Future[MovedBalrog]]
Process  
in groups
Return Futures
map(balrog => asyncMove(balrog))
grouped.map(balrogGroup => moveGroup(balrogGroup))
Still need to draw
elements multiple at a
time (if not forEach)
Slow elements
Will slow down the group

DSE Spark Connector's Sliding Iterator
Buﬀer Futures
/** Prefetches a batchSize of elements at a time **/
protected def slidingPrefetchIterator[T](it: Iterator[Future[T]], batchSize: Int): Iterator[T] = {
val (firstElements, lastElement) = it
.grouped(batchSize)
.sliding(2)
.span(_ => it.hasNext)
(firstElements.map(_.head) ++ lastElement.flatten).flatten.map(_.get)
}
Group

Buﬀer Futures
.grouped(batchSize)
.sliding(2)
}
Group

Buﬀer Futures
.grouped(batchSize)
.sliding(2)
}
Sliding(2)
Group
Buﬀer

Buﬀer Futures
.grouped(batchSize)
.sliding(2)
}
Sliding(2)
Group
Flatten
get
Buﬀer

Slow Transformations
Slow
Bad Resource
Utilization
rdd.cache.map.cache.map.cache.map

My Precious! 
Don't Cache without Reuse
The Hobbit, 1966

Cache is not Free
scala> time(sc.cassandraTable("ks", "test").map( r => r ).count)
Elapsed time: 35.773478836
res54: Long = 15436998
scala> time(sc.cassandraTable("ks", "test").map( r => r ).cache.count)
res55: Long = 15436998

scala> time(sc.cassandraTable("ks", "test").map( r => r ).count)
res54: Long = 15436998
scala> time(sc.cassandraTable("ks", "test").map( r => r ).cache.count)
res55: Long = 15436998
Cache is not Free

When does Caching for Resilience Make Sense?
Let's MATH

Let's MATH
Lets Assume our Shuﬄe/Read partially fails 1/10 times
 
Cache costs c seconds
Normal run costs r seconds
Failures happen at a rate of f 
If (c + r < r + r * f) 
Caching helps us out

Let's MATH
 
If (c + r < (r +1) * f) 

Let's MATH
 
If ( (c + r) / (r + 1) < f ) 

Let's MATH
For Our Example Caching is worth it.
if (.6 > failures)
If ( (c + r) / (r + 1) < f ) 

My Precious! 
Why is Caching so Expensive?
1. Serialize Everything
2. Hold all the data at once
3. Expensive disk access
But it's so pretty

My Precious! 
Why is Caching so Expensive?
But it's so pretty
1. Your pre-cache computation is 
very  
very 
very  
expensive
2. You are re-using the data

What have we learned?
Don't Do
Parallelize and Collect 
Rely on Spark to do our Types in Literals 
Do slow blocking actions in map/foreach 
Cache all the time
Instead
Keep in Distributed Actions 
Specify our Types 
Do concurrent actions when it makes sense 
Cache only when we re-use data
The Hobbit (1977)

Thank you
!64
© DataStax, All Rights Reserved.
http://www.russellspitzer.com/ 
@RussSpitzer
Come chat with us at DataStax Academy:  
https://academy.datastax.com/slack

Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatterns with Russell Spitzer

More Related Content

What's hot (20)

Similar to Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatterns with Russell Spitzer (20)

More from Databricks (20)

Recently uploaded (20)

Apply Hammer Directly to Thumb; Avoiding Apache Spark and Cassandra AntiPatterns with Russell Spitzer