A look under the hood at Apache Spark's API and engine evolutions

A look under the hood at Apache
Spark's API and engine evolutions
Reynold Xin @rxin
2017-02-08, Amsterdam Meetup

About Databricks
Founded by creators of Spark
Cloud data platform
- Spark
- Interactive analysis
- Cluster management
- Production pipelines
- Data governance & security

Databricks Amsterdam R&D Center
Started in January
Hiring distributed systems &
database engineers!
Email me: rxin@databricks.com

SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram

Frontend
(user facing APIs)
Backend
(execution)
Spark stack diagram
(a different take)

Frontend
(RDD, DataFrame, ML pipelines, …)
Backend
(scheduler, shuffle, operators, …)
Spark stack diagram
(a different take)

Today’s Talk
Some archaeology
- IMS, relational databases
- MapReduce
- data frames
Last 6 years of Spark evolution

IBM IMS hierarchical database (1966)
Image from https://stratechery.com/2016/oracles-cloudy-future/

Hierarchical Database
- Improvement over file system: query language & catalog
- Lack of flexibility
- Difficult to query items in different parts of the hierarchy
- Relationships are pre-determined and difficult to change

A look under the hood at Apache Spark's API and engine evolutions

“Future users of large data banks must be protected from having to
know how the data is organized in the machine. …
most application programs should remain unaffected when the
internal representation of data is changed and even when some
aspects of the external representation are changed.”

Era of relational databases (late 60s)
Two “new” important ideas
Physical Data Independence: The ability to change the physical data
layout without having to change the logical schema.
Declarative Query Language: Programmer specifies “what” rather than
“how”.

Why?
Business applications outlive the environments they were created in:
- New requirements might surface
- Underlying hardware might change
- Require physical layout changes (indexing, different storage medium, etc)
Enabled tremendous amount of innovation:
- Indexes, compression, column stores, etc

Relational Database Pros vs Cons
- Declarative and data independent
- SQL is the universal interface everybody knows
- Difficult to compose & build complex applications
- Too opinionated and inflexible
- Require data modeling before putting any data in
- SQL is the only programming language

Challenges Google faced
Data size growing (volume & velocity)
- Processing has to scale out over large clusters
Complexity of analysis increasing (variety)
- Massive ETL (web crawling)
- Machine learning, graph processing

The Big Data Problem
Semi-/Un-structured data doesn’t fit well with databases
Single machine can no longer process or even store all the data!
Only solution is to distribute general storage & processing over
clusters.

Google Datacenter
How do we program this thing?
19

Data-Parallel Models
Restrict the programming interface so that the system can do more
automatically
“Here’s an operation, run it on all of the data”
- I don’t care where it runs (you schedule that)
- In fact, feel free to run it twice on different nodes
- Siimlar to “declarative programming” in databases

MapReduce Pros vs Cons
- Massively parallel
- Very flexible programming model & schema-on-read
- Extremely verbose & difficult to learn
- Most real applications require multiple MR steps
- 21 MR steps -> 21 mapper and reducer classes
- Lots of boilerplate code per step
- Bad performance

Data frames in R / Python
> head(filter(df, df$waiting < 50)) # an example in R
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Developed by stats community & concise syntax for ad-hoc analysis
Procedural (not declarative)

R data frames Pros and Cons
- Easy to learn
- Pretty fast on a laptop (or one server)
- No parallelism & doesn’t work well on big data
- Lack sophisticated query optimization

“Are you going to talk
about Spark at all
tonight!?”

Which one is better?
Databases, R, MapReduce?
Declarative, procedural, data independence?

Spark’s initial focus: a better MapReduce
Language-integrated API (RDD): similar to Scala’s collection library
using functional programming; incredibly powerful and composable
lines = spark.textFile(“hdfs://...”) // RDD[String]
points = lines.map(line => parsePoint(line)) // RDD[Point]
points.filter(p => p.x > 100).count()
Better performance: through a more general DAG abstraction, faster
scheduling, and in-memory caching

Programmability
WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark

Challenge with Functional API
Looks high-level, but hides many semantics of computation
• Functions are arbitrary blocks of Java bytecode
• Data stored is arbitrary Java objects
Users can mix APIs in suboptimal ways

map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
groupByKey
Which Operator Causes Most Tickets?

Example Problem
pairs = data.map(word => (word, 1))
groups = pairs.groupByKey()
groups.map((k, vs) => (k, vs.sum))
Physical API:
Materializes all groups
as Seq[Int] objects
Then promptly
aggregates them

Challenge: Data Representation
Java objects often many times larger than underlying fields
class User(name: String, friends: Array[Int])
new User(“Bobby”, Array(1, 2))
User 0x… 0x…
String
3
0
1 2
Bobby
5 0x…
int[]
char[] 5

Recap: two primary issues
1. Many APIs specifies the “physical” behavior, rather than the
“logical” intent, aka not declarative enough.
2. Closures (user-defined functions and types) are opaque to the
engine, an as a result little room for improvement.

Sort Benchmark
Originally sponsored by Jim Gray to measure advancements in
software and hardware in 1987
Participants often used purpose-built hardware/software to compete
• Large companies: IBM, Microsoft, Yahoo, …
• Academia: UC Berkeley, UCSD, MIT, …

Sort Benchmark
• Past winners: Microsoft, Yahoo, Samsung, UCSD, …
1MB -> 100MB -> 1TB (1998) -> 100TB (2009)

Winning Attempt
Built on low-level Spark API and:
- Put all data in off-heap memory using sun.misc.Unsafe
- Use tight low level while loops rather than iterators
~ 3000 lines of low level code on Spark written by Reynold

On-Disk Sort Record:
Time to sort 100TB
2100 machines2013 Record:
Hadoop
2014 Record:
Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours

How do we enable the
average users to win a
world record, using a few
lines of code?

Goals of last two year’s API evolution
1. Simpler APIs bridging the gap between big data engineering and
data science.
2. Higher level, declarative APIs that are future proof (engine can
optimize programs automatically).
Taking the best ideas from databases, big data, and data science

Structured APIs:
DataFrames + Spark SQL

DataFrames and Spark SQL
Efficient library for structured data (data with a known schema)
• Two interfaces: SQL for analysts + apps, DataFrames for programmers
Optimized computation and storage, similar to RDBMS
SIGMOD 2015

Execution Steps
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
Data
Source
API
SQL
Code
Generator
Data
Frames

DataFrame API
DataFrames hold rows with a known schema and offer relational
operations on them through a DSL
val users = spark.sql(“select * from users”)
val massUsers = users(users(“country”) === “NL”)
massUsers.count()
massUsers.groupBy(“name”).avg(“age”)
Expression AST

Spark RDD Execution
Java/Scala
frontend
JVM
backend
Python
frontend
Python
backend
opaque closures
(user-defined functions)

Spark DataFrame Execution
DataFrame
frontend
Logical Plan
Physical
execution
Catalyst
optimizer
Intermediate representation for computation

Spark DataFrame Execution
Python
DF
Logical Plan
Physical
execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representation for computation
Simple wrappers to create logical plan

Structured API Example
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
DataFrame API Optimized Plan Specialized Code
SCAN logs SCAN users
JOIN
AGG
FILTER
while(logs.hasNext) {
e = logs.next
if(e.status == “ERR”) {
u = users.get(e.uid)
key = (u.loc, e.status)
sum(key) += e.duration
count(key) += 1
}
}
...

Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add new language bindings (Julia, Clojure, …)

Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD

Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD

What are Spark’s structured APIs?
Combination of:
- data frame from R as the “interface” – easy to learn
- declarativity & data independence from databases -- easy to optimize &
future-proof
- flexibility & parallelism from MapReduce -- massively scalable & flexible

Future possibilities
Spark as a fast, multi-core data collection library
Spark as a performant streaming engine
Spark as a GPU computation framework
All using the same API

Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM SIMD GPUs
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…

Recap
We learn from previous generation systems to understand what works,
and what can be improved on, and evolve Spark
Latest APIs take the best ideas out of earlier systems
- data frame from R as the “interface” – easy to learn
- declarativity & data independence from databases -- easy to optimize &
future-proof
- flexibility & parallelism from MapReduce -- massively scalable & flexible

A look under the hood at Apache Spark's API and engine evolutions

More Related Content

What's hot (20)

Similar to A look under the hood at Apache Spark's API and engine evolutions (20)

More from Databricks (20)

Recently uploaded (20)

A look under the hood at Apache Spark's API and engine evolutions