SlideShare a Scribd company logo
DataFrames for Large-scale
Data Science
Reynold Xin @rxin
Feb 17, 2015 (Spark User Meetup)
2
Year of the lamb, goat, sheep, and ram …?
A slide from 2013 …
3
From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Spark’s Growth
5
Google Trends for “Apache Spark”
Beyond Hadoop Users
6
Early adopters
Data Scientists
Statisticians
R users …
PyData
Users
Understands
MapReduce
& functional APIs
RDD API
• Most data is structured (JSON, CSV, Avro, Parquet, Hive …)
–  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …)
• Functional transformations (e.g. map/reduce) are not as
intuitive
7
8
DataFrames in Spark
• Distributed collection of data grouped into named
columns (i.e. RDD with schema)
• Domain-specific functions designed for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
• Available in Python, Scala, Java, and R (via SparkR)
9
10
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime performance of aggregating 10 million int pairs
(secs)
Agenda
• Introduction
• Learn by demo
• Design & internals
–  API design
–  Plan optimization
–  Integration with data sources
11
Learn by Demo (in a Databricks Cloud
Notebook)
• Creation
• Project
• Filter
• Aggregations
• Join
• SQL
• UDFs
• Pandas
12
For the purpose of distributing the slides online,
I’m attaching screenshots of the notebooks.
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Machine Learning Integration
27
tokenizer = Tokenizer(inputCol="text", outputCol="words”)
hashingTF = HashingTF(inputCol="words", outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
df = context.load("/path/to/data")
model = pipeline.fit(df)
Design Philosophy
Simple tasks easy
-  DSL for common operations
-  Infer schema automatically (CSV,
Parquet, JSON, …)
-  MLlib pipeline integration
Performance
-  Catalyst optimizer
-  Code generation
Complex tasks possible
-  RDD API
-  Full expression library
Interoperability
-  Various data sources and formats
-  Pandas, R, Hive …
28
DataFrame Internals
• Represented internally as a “logical plan”
• Execution is lazy, allowing it to be optimized by Catalyst
29
Plan Optimization & Execution
30
SQL	
  AST	
  
DataFrame	
  
Unresolved	
  
Logical	
  Plan	
  
Logical	
  Plan	
  
Op;mized	
  
Logical	
  Plan	
  
Physical	
  Plans	
  Physical	
  Plans	
   RDDs	
  
Selected	
  
Physical	
  Plan	
  
Analysis	
  
Logical	
  
Op;miza;on	
  
Physical	
  
Planning	
  
Cost	
  Model	
  
Physical	
  Plans	
  
Code	
  
Genera;on	
  
Catalog	
  
DataFrames and SQL share the same optimization/execution pipeline
31
32
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à
Data Sources supported by DataFrames
33
{ JSON }
built-in external
JDBC
and more …
More Than Naïve Scans
• Data Sources API can automatically prune columns and
push filters to the source
–  Parquet: skip irrelevant columns and blocks of data; turn
string comparison into integer comparisons for dictionary
encoded data
–  JDBC: Rewrite queries to push predicates down
34
35
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)
DataFrames in Spark
• APIs in Python, Java, Scala, and R (via SparkR)
• For new users: make it easier to program Big Data
• For existing users: make Spark programs simpler & easier to
understand, while improving performance
• Experimental API in Spark 1.3 (early March)
36
Our Vision
37
Thank you! Questions?
More Information
Blog post introducing DataFrames:
http://tinyurl.com/spark-dataframes
Build from source:
http://github.com/apache/spark (branch-1.3)

More Related Content

What's hot (20)

PDF
Spark SQL
Joud Khattab
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Introduction to PySpark
Russell Jurney
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Spark shuffle introduction
colorant
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Spark streaming , Spark SQL
Yousun Jeong
 
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPTX
Apache Spark Components
Girish Khanzode
 
Spark SQL
Joud Khattab
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Introduction to Apache Spark
Rahul Jain
 
Introduction to PySpark
Russell Jurney
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Core
Girish Khanzode
 
Spark shuffle introduction
colorant
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Memory Management in Apache Spark
Databricks
 
The Apache Spark File Format Ecosystem
Databricks
 
Spark architecture
GauravBiswas9
 
Spark streaming , Spark SQL
Yousun Jeong
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Programming in Spark using PySpark
Mostafa
 
Apache Spark Components
Girish Khanzode
 

Similar to Introducing DataFrames in Spark for Large Scale Data Science (20)

PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
PDF
Spark what's new what's coming
Databricks
 
PDF
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
PDF
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PDF
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
Introduction to Spark
Li Ming Tsai
 
PPTX
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Spark what's new what's coming
Databricks
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Apache Spark: What? Why? When?
Massimo Schenone
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark Introduction
sudhakara st
 
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Spark meetup TCHUG
Ryan Bosshart
 
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Introduction to Spark
Li Ming Tsai
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Import Data Form Excel to Tally Services
Tally xperts
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 

Introducing DataFrames in Spark for Large Scale Data Science

  • 1. DataFrames for Large-scale Data Science Reynold Xin @rxin Feb 17, 2015 (Spark User Meetup)
  • 2. 2 Year of the lamb, goat, sheep, and ram …?
  • 3. A slide from 2013 … 3
  • 4. From MapReduce to Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 5. Spark’s Growth 5 Google Trends for “Apache Spark”
  • 6. Beyond Hadoop Users 6 Early adopters Data Scientists Statisticians R users … PyData Users Understands MapReduce & functional APIs
  • 7. RDD API • Most data is structured (JSON, CSV, Avro, Parquet, Hive …) –  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …) • Functional transformations (e.g. map/reduce) are not as intuitive 7
  • 8. 8
  • 9. DataFrames in Spark • Distributed collection of data grouped into named columns (i.e. RDD with schema) • Domain-specific functions designed for common tasks –  Metadata –  Sampling –  Project, filter, aggregation, join, … –  UDFs • Available in Python, Scala, Java, and R (via SparkR) 9
  • 10. 10 0 2 4 6 8 10 RDD Scala RDD Python Spark Scala DF Spark Python DF Runtime performance of aggregating 10 million int pairs (secs)
  • 11. Agenda • Introduction • Learn by demo • Design & internals –  API design –  Plan optimization –  Integration with data sources 11
  • 12. Learn by Demo (in a Databricks Cloud Notebook) • Creation • Project • Filter • Aggregations • Join • SQL • UDFs • Pandas 12 For the purpose of distributing the slides online, I’m attaching screenshots of the notebooks.
  • 13. 13
  • 14. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. Machine Learning Integration 27 tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = context.load("/path/to/data") model = pipeline.fit(df)
  • 28. Design Philosophy Simple tasks easy -  DSL for common operations -  Infer schema automatically (CSV, Parquet, JSON, …) -  MLlib pipeline integration Performance -  Catalyst optimizer -  Code generation Complex tasks possible -  RDD API -  Full expression library Interoperability -  Various data sources and formats -  Pandas, R, Hive … 28
  • 29. DataFrame Internals • Represented internally as a “logical plan” • Execution is lazy, allowing it to be optimized by Catalyst 29
  • 30. Plan Optimization & Execution 30 SQL  AST   DataFrame   Unresolved   Logical  Plan   Logical  Plan   Op;mized   Logical  Plan   Physical  Plans  Physical  Plans   RDDs   Selected   Physical  Plan   Analysis   Logical   Op;miza;on   Physical   Planning   Cost  Model   Physical  Plans   Code   Genera;on   Catalog   DataFrames and SQL share the same optimization/execution pipeline
  • 31. 31
  • 32. 32 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date >= ”2015-01-01”) logical plan filter join scan (users) scan (events) physical plan join scan (users) filter scan (events) this join is expensive à
  • 33. Data Sources supported by DataFrames 33 { JSON } built-in external JDBC and more …
  • 34. More Than Naïve Scans • Data Sources API can automatically prune columns and push filters to the source –  Parquet: skip irrelevant columns and blocks of data; turn string comparison into integer comparisons for dictionary encoded data –  JDBC: Rewrite queries to push predicates down 34
  • 35. 35 joined = users.join(events, users.id == events.uid) filtered = joined.filter(events.date > ”2015-01-01”) logical plan filter join scan (users) scan (events) optimized plan join scan (users) filter scan (events) optimized plan with intelligent data sources join scan (users) filter scan (events)
  • 36. DataFrames in Spark • APIs in Python, Java, Scala, and R (via SparkR) • For new users: make it easier to program Big Data • For existing users: make Spark programs simpler & easier to understand, while improving performance • Experimental API in Spark 1.3 (early March) 36
  • 39. More Information Blog post introducing DataFrames: http://tinyurl.com/spark-dataframes Build from source: http://github.com/apache/spark (branch-1.3)