Spark 2.0
Matei Zaharia
February 17, 2016
2015: A Great Year for Spark
2014 2015
Summit
Attendees
2014 2015
Meetup
Members
2014 2015
Total
Contributors
3900
1100
66K
12K
500
1000
Meetup Groups: January 2015
source: meetup.com
Meetup Groups: January 2016
source: meetup.com
New Components
DataFrames
SparkR
Data Sources
Project Tungsten
Streaming ML
Kafka Connector
ML Pipelines
Debug UI
Dataset API
Spark 2.0
Next major release, coming in April / May
Builds on all we learned in past 2 years
Versioning in Spark
In reality, we hate breaking APIs!
Will notdo so exceptfor some dependency conflicts(e.g.Guava)
1.6.0
Patch version (only bug fixes)
Major version (may change APIs)
Minor version (addsAPIs/ features)
Major Features in 2.0
TungstenPhase 2
speedupsof 5-10x
StructuredStreaming
real-time engine
on SQL/DataFrames
Unifying Datasets
and DataFrames
Tungsten Phase 2
Background on Project Tungsten
CPU speedshave not kept up with I/O in past 5 years
Bring Spark performance closerto bare metal, through:
• Native memory management
• Runtime code generation
Tungsten So Far
Spark 1.4–1.6 added binary storage and basic code gen
DataFrame + Dataset APIs enable Tungstenin userprograms
• Alsoused underSpark SQL + parts of MLlib
New in 2.0
Whole-stage code generation
• Remove expensive iteratorcalls
• Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Optimized input / output
• Parquet + built-incache
Automatically applies to SQL, DataFrames, Datasets
Structured Streaming
Background
Real-time processingis increasinglyimportant
Most apps needto combine it with batch & interactive queries
• Trackstate using a stream, then run SQL queries
• Train an ML model offline, then update it
Spark is very well-suitedto do this
Structured Streaming
High-levelstreaming APIbuilt on Spark SQL engine
• Declarative API that extendsDataFrames / Datasets
• Eventtime, windowing,sessions,sources& sinks
Also supports interactive & batch queries
• Aggregate datain a stream,then serve using JDBC
• Change queriesat runtime
• Build and apply ML models Not just streaming, but
“continuous applications”
Goal: end-to-end continuous applications
Example
Reporting Applications
ML Model
Ad-hoc Queries
Traditionalstreaming
Other processingtypes
Kafka DatabaseETL
Details on Structured Streaming
Spark 2.0 will have a first version focusedon ETL [SPARK-8360]
Later versions will add more operators & libraries
See Reynold’s keynote tomorrow for a deep dive!
Datasets & DataFrames
Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs
• DataFrames are collections of rows with a schema
• Datasets add static types,e.g. Dataset[Person]
• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
Example
case class User(name: String, id: Int)
case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]
messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”))
.map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets
Benefits
Simpler to understand
• Onlykept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams
Long-Term
RDD will remain the low-levelAPIin Spark
Datasets & DataFrames give richer semanticsand optimizations
• New libraries will increasingly use these as interchange format
• Examples: Structured Streaming,MLlib, GraphFrames
Thank you!
Enjoy Spark Summit

More Related Content

PPTX
Keynote at spark summit east anjul
PPTX
Apache Spark and Online Analytics
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
PPTX
Apache Spark MLlib
PDF
A look under the hood at Apache Spark's API and engine evolutions
Keynote at spark summit east anjul
Apache Spark and Online Analytics
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
What to Expect for Big Data and Apache Spark in 2017
Spark and Bloomberg by Sudarshan Kadambi and Partha Nageswaran
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache Spark MLlib
A look under the hood at Apache Spark's API and engine evolutions

What's hot (20)

PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Distributed ML in Apache Spark
PDF
Spark DataFrames and ML Pipelines
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PDF
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Introduction to Apache Spark 2.0
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PDF
Apache Spark Usage in the Open Source Ecosystem
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PDF
Spark streaming state of the union
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
Composable Parallel Processing in Apache Spark and Weld
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Jump Start with Apache Spark 2.0 on Databricks
Distributed ML in Apache Spark
Spark DataFrames and ML Pipelines
Exceptions are the Norm: Dealing with Bad Actors in ETL
Extending Apache Spark APIs Without Going Near Spark Source or a Compiler wi...
Large-Scale Data Science in Apache Spark 2.0
Introduction to Apache Spark 2.0
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Spark Summit EU talk by Shay Nativ and Dvir Volk
Building a Data Pipeline from Scratch - Joe Crobak
Apache Spark Usage in the Open Source Ecosystem
Spark Summit EU talk by Stephan Kessler
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
An Introduction to Sparkling Water by Michal Malohlava
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Writing Continuous Applications with Structured Streaming PySpark API
Spark streaming state of the union
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Composable Parallel Processing in Apache Spark and Weld
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Ad

Viewers also liked (20)

PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Realtime Analytical Query Processing and Predictive Model Building on High Di...
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
The Future of Real-Time in Spark
PPTX
Flink vs. Spark
PPTX
What's New in Spark 2?
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PDF
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
Spark and the Future of Advanced Analytics by Thomas Dinsmore
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Apache Spark RDDs
PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
PDF
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
PDF
Lessons Learned From Running Spark On Docker
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Apache Spark 2.0: Faster, Easier, and Smarter
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Parallelizing Existing R Packages with SparkR
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
The Future of Real-Time in Spark
Flink vs. Spark
What's New in Spark 2?
Introduction to Spark (Intern Event Presentation)
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark SQL Deep Dive @ Melbourne Spark Meetup
Apache Spark RDDs
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Real-time Platform for Second Look Business Use Case Using Spark and Kafka: S...
Lessons Learned From Running Spark On Docker
Ad

Similar to 2016 Spark Summit East Keynote: Matei Zaharia (20)

PPTX
The structured streaming upgrade to Apache Spark and how enterprises can bene...
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PPTX
What’s new in Apache Spark 2.3
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
PDF
New directions for Apache Spark in 2015
PDF
Spark + AI Summit 2020 イベント概要
PPT
Semantic Web Servers
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Spark and machine learning in microservices architecture
PDF
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
PDF
Introduction to Datasource V2 API
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
MLeap: Release Spark ML Pipelines
PPTX
Apache Flink: Past, Present and Future
The structured streaming upgrade to Apache Spark and how enterprises can bene...
Simplifying Big Data Applications with Apache Spark 2.0
What’s new in Apache Spark 2.3
Jump Start with Apache Spark 2.0 on Databricks
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
New directions for Apache Spark in 2015
Spark + AI Summit 2020 イベント概要
Semantic Web Servers
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Spark and machine learning in microservices architecture
Spark Streaming @ Berlin Apache Spark Meetup, March 2015
Introduction to Datasource V2 API
Spark Streaming and MLlib - Hyderabad Spark Group
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
MLeap: Release Spark ML Pipelines
Apache Flink: Past, Present and Future

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
TRAVEL SUPPLIER API INTEGRATION | XML BOOKING ENGINE
PDF
OpenAssetIO Virtual Town Hall - August 2025.pdf
PDF
OpenEXR Virtual Town Hall - August 2025
PPTX
Comprehensive Guide to Digital Image Processing Concepts and Applications
PDF
solman-7.0-ehp1-sp21-incident-management
PPTX
oracle_ebs_12.2_project_cutoveroutage.pptx
PPTX
Independent Consultants’ Biggest Challenges in ERP Projects – and How Apagen ...
PDF
Module 1 - Introduction to Generative AI.pdf
PDF
IT Advisory Services | Alphavima Technologies – Microsoft Partner
PPTX
Relevance Tuning with Genetic Algorithms
PPTX
HackYourBrain__UtrechtJUG__11092025.pptx
PDF
10 Mistakes Agile Project Managers Still Make
PDF
Science is Not Enough SPLC2009 Richard P. Gabriel
PDF
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
PPTX
SIH2024_IDEA_dy_dx_deepfakedetection.pptx
PDF
SBOM Document Quality Guide - OpenChain SBOM Study Group
PPT
introduction of sql, sql commands(DD,DML,DCL))
PDF
KidsTale AI Review - Create Magical Kids’ Story Videos in 2 Minutes.pdf
PDF
IDM Crack Activation Key 2025 Free Download
PDF
How to Set Realistic Project Milestones and Deadlines
TRAVEL SUPPLIER API INTEGRATION | XML BOOKING ENGINE
OpenAssetIO Virtual Town Hall - August 2025.pdf
OpenEXR Virtual Town Hall - August 2025
Comprehensive Guide to Digital Image Processing Concepts and Applications
solman-7.0-ehp1-sp21-incident-management
oracle_ebs_12.2_project_cutoveroutage.pptx
Independent Consultants’ Biggest Challenges in ERP Projects – and How Apagen ...
Module 1 - Introduction to Generative AI.pdf
IT Advisory Services | Alphavima Technologies – Microsoft Partner
Relevance Tuning with Genetic Algorithms
HackYourBrain__UtrechtJUG__11092025.pptx
10 Mistakes Agile Project Managers Still Make
Science is Not Enough SPLC2009 Richard P. Gabriel
WhatsApp Chatbots The Key to Scalable Customer Support.pdf
SIH2024_IDEA_dy_dx_deepfakedetection.pptx
SBOM Document Quality Guide - OpenChain SBOM Study Group
introduction of sql, sql commands(DD,DML,DCL))
KidsTale AI Review - Create Magical Kids’ Story Videos in 2 Minutes.pdf
IDM Crack Activation Key 2025 Free Download
How to Set Realistic Project Milestones and Deadlines

2016 Spark Summit East Keynote: Matei Zaharia

  • 2. 2015: A Great Year for Spark 2014 2015 Summit Attendees 2014 2015 Meetup Members 2014 2015 Total Contributors 3900 1100 66K 12K 500 1000
  • 3. Meetup Groups: January 2015 source: meetup.com
  • 4. Meetup Groups: January 2016 source: meetup.com
  • 5. New Components DataFrames SparkR Data Sources Project Tungsten Streaming ML Kafka Connector ML Pipelines Debug UI Dataset API
  • 6. Spark 2.0 Next major release, coming in April / May Builds on all we learned in past 2 years
  • 7. Versioning in Spark In reality, we hate breaking APIs! Will notdo so exceptfor some dependency conflicts(e.g.Guava) 1.6.0 Patch version (only bug fixes) Major version (may change APIs) Minor version (addsAPIs/ features)
  • 8. Major Features in 2.0 TungstenPhase 2 speedupsof 5-10x StructuredStreaming real-time engine on SQL/DataFrames Unifying Datasets and DataFrames
  • 10. Background on Project Tungsten CPU speedshave not kept up with I/O in past 5 years Bring Spark performance closerto bare metal, through: • Native memory management • Runtime code generation
  • 11. Tungsten So Far Spark 1.4–1.6 added binary storage and basic code gen DataFrame + Dataset APIs enable Tungstenin userprograms • Alsoused underSpark SQL + parts of MLlib
  • 12. New in 2.0 Whole-stage code generation • Remove expensive iteratorcalls • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Parquet + built-incache Automatically applies to SQL, DataFrames, Datasets
  • 14. Background Real-time processingis increasinglyimportant Most apps needto combine it with batch & interactive queries • Trackstate using a stream, then run SQL queries • Train an ML model offline, then update it Spark is very well-suitedto do this
  • 15. Structured Streaming High-levelstreaming APIbuilt on Spark SQL engine • Declarative API that extendsDataFrames / Datasets • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”
  • 16. Goal: end-to-end continuous applications Example Reporting Applications ML Model Ad-hoc Queries Traditionalstreaming Other processingtypes Kafka DatabaseETL
  • 17. Details on Structured Streaming Spark 2.0 will have a first version focusedon ETL [SPARK-8360] Later versions will add more operators & libraries See Reynold’s keynote tomorrow for a deep dive!
  • 19. Datasets and DataFrames In 2015, we added DataFrames & Datasets as structured data APIs • DataFrames are collections of rows with a schema • Datasets add static types,e.g. Dataset[Person] • Both run on Tungsten Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
  • 20. Example case class User(name: String, id: Int) case class Message(user: User, text: String) dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row] messages = dataframe.as[Message] // Dataset[Message] users = messages.filter(m => m.text.contains(“Spark”)) .map(m => m.user) // Dataset[User] pipeline.train(users) // MLlib takes either DataFrames or Datasets
  • 21. Benefits Simpler to understand • Onlykept Dataset separate to keep binary compatibility in 1.x Libraries can take data of both forms With Streaming, same API will also work on streams
  • 22. Long-Term RDD will remain the low-levelAPIin Spark Datasets & DataFrames give richer semanticsand optimizations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames