SlideShare a Scribd company logo
From DataFrames to Tungsten:
A Peek into Spark’s Future
Reynold Xin @rxin
Spark Summit, San Francisco
June 16th, 2015
DataFrame
noun
Making Spark accessible to everyone (data
scientists, engineers, statisticians, …)
Tungsten
noun
Making Spark faster & prepare for the next
five years.
How do DataFrames and
Tungsten relate to each other?
Google Trends for “dataframe”
Single-node tabular data structure, with API for
relational algebra (filter, join, …)
math and stats
input/output (CSV, JSON, …)
ad infinitum
Data frame: lingua franca for “small data”
	
  
head(flights)	
  
#>	
  Source:	
  local	
  data	
  frame	
  [6	
  x	
  16]	
  
#>	
  	
  
#>	
  	
  	
  	
  year	
  month	
  day	
  dep_time	
  dep_delay	
  arr_time	
  arr_delay	
  carrier	
  tailnum	
  
#>	
  1	
  	
  2013	
  	
  	
  	
  	
  1	
  	
  	
  1	
  	
  	
  	
  	
  	
  517	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  	
  	
  830	
  	
  	
  	
  	
  	
  	
  	
  11	
  	
  	
  	
  	
  	
  UA	
  	
  N14228	
  
#>	
  2	
  	
  2013	
  	
  	
  	
  	
  1	
  	
  	
  1	
  	
  	
  	
  	
  	
  533	
  	
  	
  	
  	
  	
  	
  	
  	
  4	
  	
  	
  	
  	
  	
  850	
  	
  	
  	
  	
  	
  	
  	
  20	
  	
  	
  	
  	
  	
  UA	
  	
  N24211	
  
#>	
  3	
  	
  2013	
  	
  	
  	
  	
  1	
  	
  	
  1	
  	
  	
  	
  	
  	
  542	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  	
  	
  	
  	
  	
  923	
  	
  	
  	
  	
  	
  	
  	
  33	
  	
  	
  	
  	
  	
  AA	
  	
  N619AA	
  
#>	
  4	
  	
  2013	
  	
  	
  	
  	
  1	
  	
  	
  1	
  	
  	
  	
  	
  	
  544	
  	
  	
  	
  	
  	
  	
  	
  -­‐1	
  	
  	
  	
  	
  1004	
  	
  	
  	
  	
  	
  	
  -­‐18	
  	
  	
  	
  	
  	
  B6	
  	
  N804JB	
  
#>	
  ..	
  	
  ...	
  	
  	
  ...	
  ...	
  	
  	
  	
  	
  	
  ...	
  	
  	
  	
  	
  	
  	
  ...	
  	
  	
  	
  	
  	
  ...	
  	
  	
  	
  	
  	
  	
  ...	
  	
  	
  	
  	
  ...	
  	
  	
  	
  	
  ...	
  
	
  
Spark DataFrame
>	
  head(filter(df,	
  df$waiting	
  <	
  50))	
  	
  #	
  an	
  example	
  in	
  R	
  
##	
  	
  eruptions	
  waiting	
  
##1	
  	
  	
  	
  	
  1.750	
  	
  	
  	
  	
  	
  47	
  
##2	
  	
  	
  	
  	
  1.750	
  	
  	
  	
  	
  	
  47	
  
##3	
  	
  	
  	
  	
  1.867	
  	
  	
  	
  	
  	
  48	
  
	
  
Distributed data frame for Java, Python, R, Scala
Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn
data size
KB MB GB TB PB
Existing
Single-node
Data Frames
Spark
DataFrame
It is not Spark vs Python/R,
but Spark and Python/R.
Spark and Python/R
Spark
DF
scalability
multi-core
multi-machines
Python/R
DF
Viz
Machine
Learning
Stats
wealth
of
libraries
Spark RDD Execution
Java/Scala
API
JVM
Execution
Python
API
Python
Execution
opaque closures
(user-defined functions)
Spark DataFrame Execution
DataFrame
Logical Plan
Physical
Execution
Catalyst
optimizer
Intermediate representation for computation
Spark DataFrame Execution
Python
DF
Logical Plan
Physical
Execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representation for computation
Simple wrappers to create logical plan
Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add new language bindings (Julia, Clojure, …)
Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD
Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
What about Tungsten?
Hardware Trends
Storage
Network
CPU
Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz
Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
Tungsten: Preparing Spark for Next 5 Years
Substantially speed up execution by optimizing CPU efficiency, via:
(1)  Runtime code generation
(2)  Exploiting cache locality
(3)  Off-heap memory management
From DataFrame to Tungsten
Python
DF
Logical Plan
Java/Scala
DF
R
DF
Tungsten
Execution
5PM
Deep Dive into Project Tungsten
Developer Track by Josh Rosen
Initial Performance Results
0
200
400
600
800
1000
1200
1x 2x 4x 8x
Runtime(seconds)
Data set size (relative)
Tungsten-off
Tungsten-on
Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM GPU NVRAM
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…
Tungsten Execution
PythonSQL R Streaming
DataFrame
Advanced
Analytics
Spark Office Hours Today
Databricks booth A1
Topic Area
1:00-1:45 Core, YARN, Ops
1:45-2:30 Core/SQL/Data Science
3:00-3:40 Streaming
3:40-4:15 Core, Python, R
4:30-5:15 Machine Learning
5:15-6:00 Matei Zaharia

More Related Content

What's hot (20)

PDF
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
PDF
Operational Tips for Deploying Spark
Databricks
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Spark Meetup at Uber
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
PDF
New Developments in Spark
Databricks
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
PPTX
Spark r under the hood with Hossein Falaki
Databricks
 
PDF
Spark DataFrames and ML Pipelines
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
PDF
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Operational Tips for Deploying Spark
Databricks
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Spark Meetup at Uber
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
New Developments in Spark
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Spark r under the hood with Hossein Falaki
Databricks
 
Spark DataFrames and ML Pipelines
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Spark Summit EU 2015: Reynold Xin Keynote
Databricks
 
Python and Bigdata - An Introduction to Spark (PySpark)
hiteshnd
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 

Viewers also liked (20)

PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Anatomy of spark catalyst
datamantra
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
PDF
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
PDF
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
datamantra
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Building Robust, Adaptive Streaming Apps with Spark Streaming
Databricks
 
Spark sql meetup
Michael Zhang
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Ad

Similar to From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015 (20)

PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
PDF
Big data beyond the JVM - DDTX 2018
Holden Karau
 
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
PDF
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Spark Programming Basic Training Handout
yanuarsinggih1
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Are general purpose big data systems eating the world?
Holden Karau
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PPTX
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Holden Karau
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PDF
Getting The Best Performance With PySpark
Spark Summit
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Big data beyond the JVM - DDTX 2018
Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
A look ahead at spark 2.0
Databricks
 
Spark Programming Basic Training Handout
yanuarsinggih1
 
Introduction to Spark with Python
Gokhan Atil
 
Are general purpose big data systems eating the world?
Holden Karau
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Holden Karau
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Getting The Best Performance With PySpark
Spark Summit
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PPTX
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
PDF
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Coefficient of Variance in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
AI Prompts Cheat Code prompt engineering
Avijit Kumar Roy
 
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Francisco 2015