Introducing DataFrames in Spark for Large Scale Data Science

DataFrames for Large-scale
Data Science
Reynold Xin @rxin
Feb 17, 2015 (Spark User Meetup)

2
Year of the lamb, goat, sheep, and ram …?

From MapReduce to Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Spark’s Growth
5
Google Trends for “Apache Spark”

Beyond Hadoop Users
6
Early adopters
Data Scientists
Statisticians
R users …
PyData
Users
Understands
MapReduce
& functional APIs

RDD API
• Most data is structured (JSON, CSV, Avro, Parquet, Hive …)
–  Programming RDDs inevitably ends up with a lot of tuples (_1, _2, …)
• Functional transformations (e.g. map/reduce) are not as
intuitive
7

DataFrames in Spark
• Distributed collection of data grouped into named
columns (i.e. RDD with schema)
• Domain-specific functions designed for common tasks
–  Metadata
–  Sampling
–  Project, filter, aggregation, join, …
–  UDFs
• Available in Python, Scala, Java, and R (via SparkR)
9

10
0 2 4 6 8 10
RDD Scala
RDD Python
Spark Scala DF
Spark Python DF
Runtime performance of aggregating 10 million int pairs
(secs)

Agenda
• Introduction
• Learn by demo
• Design & internals
–  API design
–  Plan optimization
–  Integration with data sources
11

Learn by Demo (in a Databricks Cloud
Notebook)
• Creation
• Project
• Filter
• Aggregations
• Join
• SQL
• UDFs
• Pandas
12
For the purpose of distributing the slides online,
I’m attaching screenshots of the notebooks.

Machine Learning Integration
27
tokenizer = Tokenizer(inputCol="text", outputCol="words”)
hashingTF = HashingTF(inputCol="words", outputCol="features”)
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
df = context.load("/path/to/data")
model = pipeline.fit(df)

Design Philosophy
Simple tasks easy
-  DSL for common operations
-  Infer schema automatically (CSV,
Parquet, JSON, …)
-  MLlib pipeline integration
Performance
-  Catalyst optimizer
-  Code generation
Complex tasks possible
-  RDD API
-  Full expression library
Interoperability
-  Various data sources and formats
-  Pandas, R, Hive …
28

DataFrame Internals
• Represented internally as a “logical plan”
• Execution is lazy, allowing it to be optimized by Catalyst
29

Plan Optimization & Execution
30
SQL
AST

DataFrame

Unresolved

Logical
Plan

Logical
Plan

Op;mized

Logical
Plan

Physical
Plans
Physical
Plans
RDDs

Selected

Physical
Plan

Analysis

Logical

Op;miza;on

Physical

Planning

Cost
Model

Physical
Plans

Code

Genera;on

Catalog

DataFrames and SQL share the same optimization/execution pipeline

32
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date >= ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
physical plan
join
scan
(users)
filter
scan
(events)
this join is expensive à

Data Sources supported by DataFrames
33
{ JSON }
built-in external
JDBC
and more …

More Than Naïve Scans
• Data Sources API can automatically prune columns and
push filters to the source
–  Parquet: skip irrelevant columns and blocks of data; turn
string comparison into integer comparisons for dictionary
encoded data
–  JDBC: Rewrite queries to push predicates down
34

35
joined = users.join(events, users.id == events.uid)
filtered = joined.filter(events.date > ”2015-01-01”)
logical plan
filter
join
scan
(users)
scan
(events)
optimized plan
join
scan
(users)
filter
scan
(events)
optimized plan
with intelligent data sources
join
scan
(users)
filter scan
(events)

DataFrames in Spark
• APIs in Python, Java, Scala, and R (via SparkR)
• For new users: make it easier to program Big Data
• For existing users: make Spark programs simpler & easier to
understand, while improving performance
• Experimental API in Spark 1.3 (early March)
36

More Information
Blog post introducing DataFrames:
http://tinyurl.com/spark-dataframes
Build from source:
http://github.com/apache/spark (branch-1.3)

Introducing DataFrames in Spark for Large Scale Data Science

More Related Content

What's hot (20)

Similar to Introducing DataFrames in Spark for Large Scale Data Science (20)

More from Databricks (20)

Recently uploaded (20)

Introducing DataFrames in Spark for Large Scale Data Science