SlideShare a Scribd company logo
Spark DataFrames:
Simple and Fast Analytics
on Structured Data
Michael Armbrust
Spark Summit 2015 - June, 15th
Graduated
from Alpha
in 1.3
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
SQL!About Me and
2
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
2
3
SELECT&COUNT(*)&
FROM&hiveTable&
WHERE&hive_udf(data)&&
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
SQL!About Me and
Improved
multi-version
support in 1.4
4
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
5
SQL!About Me and
• Spark SQL
•  Part of the core distribution since Spark 1.0 (April 2014)
•  Runs SQL / HiveQL queries, optionally alongside or
replacing existing Hive deployments
•  Connect existing BI tools to Spark through JDBC
•  Bindings in Python, Scala, Java, and R
• @michaelarmbrust
•  Lead developer of Spark SQL @databricks
6
SQL!About Me and
The not-so-secret truth...
7
is about more than SQL.
!
SQL!
Spark SQL: The whole story
Creating and Running Spark Programs Faster:
•  Write less code
•  Read less data
•  Let the optimizer do the hard work
8
DataFrame
noun – [dey-tuh-freym]
9
1.  A distributed collection of rows organized into
named columns.
2.  An abstraction for selecting, filtering, aggregating
and plotting structured data (cf. R, Pandas).
3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3).
!
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 10
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
!
read and write&
functions create
new builders for
doing I/O
11
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
Builder methods
specify:
•  Format
•  Partitioning
•  Handling of
existing data
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 12
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats:
load(…), save(…) or
saveAsTable(…)&
finish the I/O
specification
df&=&sqlContext.read&&
&&.format("json")&&
&&.option("samplingRatio",&"0.1")&&
&&.load("/home/michael/data.json")&
&
df.write&&
&&.format("parquet")&&
&&.mode("append")&&
&&.partitionBy("year")&&
&&.saveAsTable("fasterData")!
! 13
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames
using a variety of formats.
14
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
ETL Using Custom Data Sources
sqlContext.read&
&&.format("com.databricks.spark.jira")&
&&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")&
&&.option("user",&"marmbrus")&
&&.option("password",&"*******")&
&&.option("query",&"""&
&&&&|project&=&SPARK&AND&&
&&&&|component&=&SQL&AND&&
&&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)&
&&.load()&
&&.repartition(1)&
&&.write&
&&.format("parquet")&
&&.saveAsTable("sparkSqlJira")&
15
Write Less Code: High-Level Operations
Solve common problems concisely using DataFrame functions:
•  Selecting columns and filtering
•  Joining different data sources
•  Aggregation (count, sum, average, etc)
•  Plotting results with Pandas
16
Write Less Code: Compute an Average
private&IntWritable&one&=&&
&&new&IntWritable(1)&
private&IntWritable&output&=&
&&new&IntWritable()&
proctected&void&map(&
&&&&LongWritable&key,&
&&&&Text&value,&
&&&&Context&context)&{&
&&String[]&fields&=&value.split("t")&
&&output.set(Integer.parseInt(fields[1]))&
&&context.write(one,&output)&
}&
&
IntWritable&one&=&new&IntWritable(1)&
DoubleWritable&average&=&new&DoubleWritable()&
&
protected&void&reduce(&
&&&&IntWritable&key,&
&&&&Iterable<IntWritable>&values,&
&&&&Context&context)&{&
&&int&sum&=&0&
&&int&count&=&0&
&&for(IntWritable&value&:&values)&{&
&&&&&sum&+=&value.get()&
&&&&&count++&
&&&&}&
&&average.set(sum&/&(double)&count)&
&&context.Write(key,&average)&
}&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[x.[1],&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
17
Write Less Code: Compute an Average
Using RDDs
&
data&=&sc.textFile(...).split("t")&
data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&&
&&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&&
&&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&&
&&&.collect()&
&
&
&Using DataFrames
&
sqlCtx.table("people")&&
&&&.groupBy("name")&&
&&&.agg("name",&avg("age"))&&
&&&.collect()&&
!
Full API Docs
•  Python
•  Scala
•  Java
•  R
18
Not Just Less Code: Faster Implementations
19
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
20
Demo
Combine data from with data from
Running in
•  Hosted Spark in the cloud
•  Notebooks with integrated visualization
•  Scheduled production jobs
https://accounts.cloud.databricks.com/
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 1/5
> 
Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB)
> 
> 
%run /home/michael/ss.2015.demo/spark.sql.lib ...
%sql SELECT * FROM sparkSqlJira
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 2/5
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 3/5
Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB)
> 
rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea
n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam
e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i
ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme
nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line
s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j
iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string]
Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB)
> 
val rawPRs = sqlContext.read
.format("com.databricks.spark.rest")
.option("url", "https://spark-prs.appspot.com/search-open-prs")
.load()
display(rawPRs)
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 4/5
Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB)
> 
import org.apache.spark.sql.functions._
sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i
ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d
ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai
d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl
e: boolean]
import org.apache.spark.sql.functions._
val sparkPRs = rawPRs
.select(
  // "Explode" nested array to create one row per item.
  explode($"components").as("component"),
 
  // Use a built-in function to construct the full 'SPARK-XXXX' key
  concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"),
// Other required columns.
  $"parsed_title.title",
$"jira_issuetype_icon_url",
$"jira_priority_icon_url",
$"number",
$"commenters",
$"user",
$"last_jenkins_outcome",
$"is_mergeable")
.where($"component" === "SQL") // Select only SQL PRs
6/15/2015 demo - Databricks
https://demo.cloud.databricks.com/#notebook/43587 5/5
Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB)
> 
✗
✗
✗
✗
table("sparkSqlJira")
.join(sparkPRs, $"key" === $"pr_jira")
.jiraTable
Plan Optimization & Execution
21
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Seamlessly Integrated
Intermix DataFrame operations with
custom Python, Java, R, or Scala code
zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)&
&
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&
Augments any
DataFrame
that contains
user_id&
22
Optimize Entire Pipelines
Optimization happens as late as possible, therefore
Spark SQL can optimize even across functions.
23
events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&
&&
training_data&=&events&&
&&.where(events.city&==&"San&Francisco")&&
&&.select(events.timestamp)&&
&&.collect()&&
24
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&events&=&add_demographics(sqlCtx.load("/data/events",&"json"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
expensive
only join
relevent users
Physical Plan
join
scan
(events)
filter
scan
(users)
24
25
def&add_demographics(events):&
&&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table&
&&&events&&
&&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&&
&&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column&
&
Optimized Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&&
training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&&
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
25
Machine Learning Pipelines
26
tokenizer&=&Tokenizer(inputCol="text",!outputCol="words”)&
hashingTF&=&HashingTF(inputCol="words",!outputCol="features”)&
lr&=&LogisticRegression(maxIter=10,&regParam=0.01)&
pipeline&=&Pipeline(stages=[tokenizer,&hashingTF,&lr])&
&
df&=&sqlCtx.load("/path/to/data")!
model&=&pipeline.fit(df)!	
ds0 ds1 ds2 ds3tokenizer hashingTF lr.model
lr
Pipeline Model!
Find out more during Joseph’s Talk: 3pm Today
Project Tungsten: Initial Results
27
0
50
100
150
200
1x 2x 4x 8x 16x
Average GC
time per
node
(seconds)
Data set size (relative)
Default
Code Gen
Tungsten onheap
Tungsten offheap
Find out more during Josh’s Talk: 5pm Tomorrow
Questions?
Spark SQL Office Hours Today
-  Michael Armbrust 1:45-2:30
-  Yin Huai 3:40-4:15
Spark SQL Office Hours Tomorrow
-  Reynold 1:45-2:30

More Related Content

What's hot (20)

PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
PDF
Parallelize R Code Using Apache Spark
Databricks
 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
PDF
Spark on YARN
Adarsh Pannu
 
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
PPT
Spark stream - Kafka
Dori Waldman
 
PDF
Reactive app using actor model & apache spark
Rahul Kumar
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PPTX
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark Summit
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Spark Summit
 
Parallelize R Code Using Apache Spark
Databricks
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Spark on YARN
Adarsh Pannu
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Spark Summit
 
Spark stream - Kafka
Dori Waldman
 
Reactive app using actor model & apache spark
Rahul Kumar
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Memory Management in Apache Spark
Databricks
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 

Similar to Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015 (20)

PDF
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
PDF
Spark SQL
Joud Khattab
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
 
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...
Databricks
 
Spark SQL
Joud Khattab
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Spark sql
Zahra Eskandari
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Intro to Spark and Spark SQL
jeykottalam
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Learning spark ch09 - Spark SQL
phanleson
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
BinarySearchTree in datastructures in detail
kichokuttu
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summit 2015

  • 1. Spark DataFrames: Simple and Fast Analytics on Structured Data Michael Armbrust Spark Summit 2015 - June, 15th
  • 2. Graduated from Alpha in 1.3 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) SQL!About Me and 2 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors 2
  • 3. 3 SELECT&COUNT(*)& FROM&hiveTable& WHERE&hive_udf(data)&& • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments SQL!About Me and Improved multi-version support in 1.4
  • 4. 4 • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC SQL!About Me and
  • 5. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R 5 SQL!About Me and
  • 6. • Spark SQL •  Part of the core distribution since Spark 1.0 (April 2014) •  Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments •  Connect existing BI tools to Spark through JDBC •  Bindings in Python, Scala, Java, and R • @michaelarmbrust •  Lead developer of Spark SQL @databricks 6 SQL!About Me and
  • 7. The not-so-secret truth... 7 is about more than SQL. ! SQL!
  • 8. Spark SQL: The whole story Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Let the optimizer do the hard work 8
  • 9. DataFrame noun – [dey-tuh-freym] 9 1.  A distributed collection of rows organized into named columns. 2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas). 3.  Archaic: Previously SchemaRDD (cf. Spark < 1.3). !
  • 10. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 10
  • 11. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! read and write& functions create new builders for doing I/O 11
  • 12. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: Builder methods specify: •  Format •  Partitioning •  Handling of existing data df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 12
  • 13. Write Less Code: Input & Output Unified interface to reading/writing data in a variety of formats: load(…), save(…) or saveAsTable(…)& finish the I/O specification df&=&sqlContext.read&& &&.format("json")&& &&.option("samplingRatio",&"0.1")&& &&.load("/home/michael/data.json")& & df.write&& &&.format("parquet")&& &&.mode("append")&& &&.partitionBy("year")&& &&.saveAsTable("fasterData")! ! 13
  • 14. Write Less Code: Input & Output Spark SQL’s Data Source API can read and write DataFrames using a variety of formats. 14 { JSON } Built-In External JDBC and more… Find more sources at http://spark-packages.org/
  • 15. ETL Using Custom Data Sources sqlContext.read& &&.format("com.databricks.spark.jira")& &&.option("url",&"https://issues.apache.org/jira/rest/api/latest/search")& &&.option("user",&"marmbrus")& &&.option("password",&"*******")& &&.option("query",&"""& &&&&|project&=&SPARK&AND&& &&&&|component&=&SQL&AND&& &&&&|(status&=&Open&OR&status&=&"In&Progress"&OR&status&=&Reopened)""".stripMargin)& &&.load()& &&.repartition(1)& &&.write& &&.format("parquet")& &&.saveAsTable("sparkSqlJira")& 15
  • 16. Write Less Code: High-Level Operations Solve common problems concisely using DataFrame functions: •  Selecting columns and filtering •  Joining different data sources •  Aggregation (count, sum, average, etc) •  Plotting results with Pandas 16
  • 17. Write Less Code: Compute an Average private&IntWritable&one&=&& &&new&IntWritable(1)& private&IntWritable&output&=& &&new&IntWritable()& proctected&void&map(& &&&&LongWritable&key,& &&&&Text&value,& &&&&Context&context)&{& &&String[]&fields&=&value.split("t")& &&output.set(Integer.parseInt(fields[1]))& &&context.write(one,&output)& }& & IntWritable&one&=&new&IntWritable(1)& DoubleWritable&average&=&new&DoubleWritable()& & protected&void&reduce(& &&&&IntWritable&key,& &&&&Iterable<IntWritable>&values,& &&&&Context&context)&{& &&int&sum&=&0& &&int&count&=&0& &&for(IntWritable&value&:&values)&{& &&&&&sum&+=&value.get()& &&&&&count++& &&&&}& &&average.set(sum&/&(double)&count)& &&context.Write(key,&average)& }& data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[x.[1],&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& 17
  • 18. Write Less Code: Compute an Average Using RDDs & data&=&sc.textFile(...).split("t")& data.map(lambda&x:&(x[0],&[int(x[1]),&1]))&& &&&.reduceByKey(lambda&x,&y:&[x[0]&+&y[0],&x[1]&+&y[1]])&& &&&.map(lambda&x:&[x[0],&x[1][0]&/&x[1][1]])&& &&&.collect()& & & &Using DataFrames & sqlCtx.table("people")&& &&&.groupBy("name")&& &&&.agg("name",&avg("age"))&& &&&.collect()&& ! Full API Docs •  Python •  Scala •  Java •  R 18
  • 19. Not Just Less Code: Faster Implementations 19 0 2 4 6 8 10 RDD Scala RDD Python DataFrame Scala DataFrame Python DataFrame R DataFrame SQL Time to Aggregate 10 million int pairs (secs)
  • 20. 20 Demo Combine data from with data from Running in •  Hosted Spark in the cloud •  Notebooks with integrated visualization •  Scheduled production jobs https://accounts.cloud.databricks.com/
  • 21. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 1/5 >  Command took 2.08s -- by dbadmin at 6/15/2015, 1:17:07 PM on michael (54 GB) >  >  %run /home/michael/ss.2015.demo/spark.sql.lib ... %sql SELECT * FROM sparkSqlJira
  • 22. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 2/5
  • 23. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 3/5 Command took 1.95s -- by dbadmin at 6/15/2015, 1:18:46 PM on michael (54 GB) >  rawPRs: org.apache.spark.sql.DataFrame = [commenters: array<struct<data:struct<asked_to_close:boolea n,avatar:string,body:string,date:array<string>,diff_hunk:string,said_lgtm:boolean,url:string>,usernam e:string>>, components: array<string>, is_mergeable: boolean, jira_issuetype_icon_url: string, jira_i ssuetype_name: string, jira_priority_icon_url: string, jira_priority_name: string, last_jenkins_comme nt: struct<body:string,html_url:string,user:struct<login:string>>, last_jenkins_outcome: string, line s_added: bigint, lines_changed: bigint, lines_deleted: bigint, number: bigint, parsed_title: struct<j iras:array<bigint>,metadata:string,title:string>, state: string, updated_at: string, user: string] Command took 2.01s -- by dbadmin at 6/15/2015, 1:19:08 PM on michael (54 GB) >  val rawPRs = sqlContext.read .format("com.databricks.spark.rest") .option("url", "https://spark-prs.appspot.com/search-open-prs") .load() display(rawPRs)
  • 24. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 4/5 Command took 0.26s -- by dbadmin at 6/15/2015, 1:19:39 PM on michael (54 GB) >  import org.apache.spark.sql.functions._ sparkPRs: org.apache.spark.sql.DataFrame = [component: string, pr_jira: string, title: string, jira_i ssuetype_icon_url: string, jira_priority_icon_url: string, number: bigint, commenters: array<struct<d ata:struct<asked_to_close:boolean,avatar:string,body:string,date:array<string>,diff_hunk:string,sai d_lgtm:boolean,url:string>,username:string>>, user: string, last_jenkins_outcome: string, is_mergeabl e: boolean] import org.apache.spark.sql.functions._ val sparkPRs = rawPRs .select(   // "Explode" nested array to create one row per item.   explode($"components").as("component"),     // Use a built-in function to construct the full 'SPARK-XXXX' key   concat("SPARK-", $"parsed_title.jiras"(0)).as("pr_jira"), // Other required columns.   $"parsed_title.title", $"jira_issuetype_icon_url", $"jira_priority_icon_url", $"number", $"commenters", $"user", $"last_jenkins_outcome", $"is_mergeable") .where($"component" === "SQL") // Select only SQL PRs
  • 25. 6/15/2015 demo - Databricks https://demo.cloud.databricks.com/#notebook/43587 5/5 Command took 7.55s -- by dbadmin at 6/15/2015, 1:20:15 PM on michael (54 GB) >  ✗ ✗ ✗ ✗ table("sparkSqlJira") .join(sparkPRs, $"key" === $"pr_jira") .jiraTable
  • 26. Plan Optimization & Execution 21 SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 27. Seamlessly Integrated Intermix DataFrame operations with custom Python, Java, R, or Scala code zipToCity&=&udf(lambda&zipCode:&<custom&logic&here>)& & def&add_demographics(events):& &&&u&=&sqlCtx.table("users")& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))& Augments any DataFrame that contains user_id& 22
  • 28. Optimize Entire Pipelines Optimization happens as late as possible, therefore Spark SQL can optimize even across functions. 23 events&=&add_demographics(sqlCtx.load("/data/events",&"json"))& && training_data&=&events&& &&.where(events.city&==&"San&Francisco")&& &&.select(events.timestamp)&& &&.collect()&&
  • 30. 25 def&add_demographics(events):& &&&u&=&sqlCtx.table("users")&&&&&&&&&&&&&&&&&&&&&#&Load&partitioned&Hive&table& &&&events&& &&&&&.join(u,&events.user_id&==&u.user_id)&&&&&&#&Join&on&user_id&&&&&& &&&&&.withColumn("city",&zipToCity(df.zip))&&&&&&#&Run&udf&to&add&city&column& & Optimized Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events&=&add_demographics(sqlCtx.load("/data/events",&"parquet"))&& training_data&=&events.where(events.city&==&"San&Francisco").select(events.timestamp).collect()&& Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users) 25
  • 32. Project Tungsten: Initial Results 27 0 50 100 150 200 1x 2x 4x 8x 16x Average GC time per node (seconds) Data set size (relative) Default Code Gen Tungsten onheap Tungsten offheap Find out more during Josh’s Talk: 5pm Tomorrow
  • 33. Questions? Spark SQL Office Hours Today -  Michael Armbrust 1:45-2:30 -  Yin Huai 3:40-4:15 Spark SQL Office Hours Tomorrow -  Reynold 1:45-2:30