2016 Spark Summit East Keynote: Matei Zaharia

Spark 2.0
Matei Zaharia
February 17, 2016

2015: A Great Year for Spark
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Total
Contributors
3900
1100
66K
12K
500
1000

Meetup Groups: January 2015
source: meetup.com

Meetup Groups: January 2016
source: meetup.com

New Components
DataFrames
SparkR
Data Sources
Project Tungsten
Streaming ML
Kafka Connector
ML Pipelines
Debug UI
Dataset API

Spark 2.0
Next major release, coming in April / May
Builds on all we learned in past 2 years

Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor some dependency conflicts(e.g.Guava)
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)

Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames

Background on Project Tungsten
CPU speedshave not kept up with I/O in past 5 years
Bring Spark performance closerto bare metal, through:
• Native memory management
• Runtime code generation

Tungsten So Far
Spark 1.4–1.6 added binary storage and basic code gen
DataFrame + Dataset APIs enable Tungstenin userprograms
• Alsoused underSpark SQL + parts of MLlib

New in 2.0
Whole-stage code generation
• Remove expensive iteratorcalls
• Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Optimized input / output
• Parquet + built-incache
Automatically applies to SQL, DataFrames, Datasets

Background
Real-time processingis increasinglyimportant
Most apps needto combine it with batch & interactive queries
• Trackstate using a stream, then run SQL queries
• Train an ML model offline, then update it
Spark is very well-suitedto do this

Structured Streaming
High-levelstreaming APIbuilt on Spark SQL engine
• Declarative API that extendsDataFrames / Datasets
• Eventtime, windowing,sessions,sources& sinks
Also supports interactive & batch queries
• Aggregate datain a stream,then serve using JDBC
• Change queriesat runtime
• Build and apply ML models Not just streaming, but
“continuous applications”

Goal: end-to-end continuous applications
Example
Reporting Applications
ML Model
Ad-hoc Queries
Traditionalstreaming
Other processingtypes
Kafka DatabaseETL

Details on Structured Streaming
Spark 2.0 will have a first version focusedon ETL [SPARK-8360]
Later versions will add more operators & libraries
See Reynold’s keynote tomorrow for a deep dive!

Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets

Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams

Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames

2016 Spark Summit East Keynote: Matei Zaharia

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 2016 Spark Summit East Keynote: Matei Zaharia (20)

More from Databricks (20)

Recently uploaded (20)

2016 Spark Summit East Keynote: Matei Zaharia