Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

Apache Spark
Easy and Fast Big Data Analytics
Pat McDonough

Founded by the creators of Apache Spark
out of UC Berkeley’s AMPLab
Fully committed to 100% open source
Apache Spark
Support and Grow the
Spark Community and Ecosystem
Building Databricks Cloud

Databricks & Datastax
Apache Spark is packaged as part of Datastax
Enterprise Analytics 4.5
Databricks & Datstax Have Partnered for
Apache Spark Engineering and Support

Big Data Analytics
Where We’ve Been
• 2003 & 2004 - Google
GFS & MapReduce Papers
are Precursors to Hadoop
• 2006 & 2007 - Google
BigTable and Amazon
DynamoDB Paper
Precursor to Cassandra,
HBase, Others

Big Data Analytics
A Zoo of Innovation

What's Working?
Many Excellent Innovations Have Come From Big Data Analytics:
• Distributed & Data Parallel is disruptive ... because we needed it
• We Now Have Massive throughput… Solved the ETL Problem
• The Data Hub/Lake Is Possible

What Needs to Improve?
Go Beyond MapReduce
MapReduce is a Very Powerful
and Flexible Engine
Processing Throughput
Previously Unobtainable on
Commodity Equipment
But MapReduce Isn’t Enough:
• Essentially Batch-only
• Inefficient with respect to
memory use, latency
• Too Hard to Program

Go Beyond (S)QL
SQL Support Has Been A
Welcome Interface on Many
Platforms
And in many cases, a faster
alternative
But SQL Is Often Not Enough:
• Sometimes you want to write real programs
(Loops, variables, functions, existing
libraries) but don’t want to build UDFs.
• Machine Learning (see above, plus iterative)
• Multi-step pipelines
• Often an Additional System

Ease of Use
Big Data Distributions Provide a
number of Useful Tools and
Systems
Choices are Good to Have
But This Is Often Unsatisfactory:
• Each new system has it’s own configs,
APIs, and management, coordination of
multiple systems is challenging
• A typical solution requires stringing
together disparate systems - we need
unification
• Developers want the full power of their
programming language

Latency
Big Data systems are
throughput-oriented
Some new SQL Systems
provide interactivity
But We Need More:
• Interactivity beyond SQL
interfaces
• Repeated access of the same
datasets (i.e. caching)

Can Spark Solve These
Problems?

Apache Spark
Originally developed in 2009 in UC Berkeley’s
AMPLab
Fully open sourced in 2010 – now at Apache
Software Foundation
http://spark.apache.org

Project Activity
June 2013 June 2014
total
contributors 68 255
companies
contributing 17 50
total lines
of code 63,000 175,000

Compared to Other Projects
1200
900
600
300
0
300000
225000
150000
75000
0
Commits Lines of Code Changed
Activity
in past 6
months

Compared to Other Projects
1200
900
600
300
0
300000
225000
150000
75000
0
Commits Lines of Code Changed
Activity
in past 6
months
Spark is now the most active project in the
Hadoop ecosystem

Spark on Github
So active on Github, sometimes we break it
Over 1200 Forks (can’t display Network Graphs)
~80 commits to master each week
So many PRs We Built our own PR UI

Apache Spark - Easy to
Use And Very Fast
Fast and general cluster computing system interoperable with Big Data
Systems Like Hadoop and Cassandra
Improved Efficiency:
• In-memory computing primitives
• General computation graphs
Improved Usability:
• Rich APIs
• Interactive shell

Apache Spark - Easy to
Use And Very Fast
Fast and general cluster computing system interoperable with Big Data
Systems Like Hadoop and Cassandra
Improved Efficiency:
• Up to 100× faster
In-memory computing primitives
• (2-10× on disk)
General computation graphs
Improved Usability:
• Rich APIs
2-5× less code
• Interactive shell

Apache Spark - A
Robust SDK for Big
Data Applications
SQL
Machine
Learning
Streaming Graph
Core
Unified System With Libraries to
Build a Complete Solution
!
Full-featured Programming
Environment in Scala, Java, Python…
Very developer-friendly, Functional
API for working with Data
!
Runtimes available on several
platforms

Spark Is A Part Of Most
Big Data Platforms
• All Major Hadoop Distributions Include
Spark
• Spark Is Also Integrated With Non-Hadoop
Big Data Platforms like DSE
• Spark Applications Can Be Written Once
and Deployed Anywhere
SQL
Machine
Learning
Streaming Graph
Core
Deploy Spark Apps Anywhere

Easy: Get Started
Immediately
Interactive Shell Multi-language support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();

Easy: Clean API
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)

Easy: Expressive API
map reduce

Easy: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save ...

Easy: Example – Word Count
Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
!
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
!
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
!
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
!
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Spark
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

Easy: Works Well With
Hadoop
Data Compatibility
• Access your existing Hadoop
Data
• Use the same data formats
• Adheres to data locality for
efficient processing
!
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing Hadoop
cluster or side-by-side

Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
!
w = numpy.random.rand(D)
!
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
!
print “Final w: %s” % w

Fast: Using RAM, Operator
Graphs
In-memory Caching
• Data Partitions read from RAM
instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
=
RDD
=
cached
partition
join
A: B:
groupBy
C: D: E:
filter
Stage
3
Stage
1
Stage
2
F:
map

Fast: Logistic Regression
Performance
Running Time (s)
4000
3000
2000
1000
0
1 5 10 20 30
Number of Iterations
110
s
/
iteration
Hadoop Spark
first
iteration
80
s
further
iterations
1
s

Fast: Scales Down Seamlessly
Execution
time
(s)
100
75
50
25
0
Cache
disabled 25% 50% 75% Fully
cached
%
of
working
set
in
cache
11.5304
29.7471
40.7407
58.0614
68.8414

Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD
Mapped
filter RDD
(func
=
startsWith(…))
map
(func
=
split(...))

Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs
RDRDDD RDRDDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Working With RDDs
RDRDDD RDRDDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
!
linesWithSpark.first()
# Apache Spark

Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns

various patterns
Worker
Example: Log Mining
Worker
Worker
Driver

various patterns
Worker
Example: Log Mining
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)

Example: Log Mining
various patterns
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver

Example: Log Mining
various patterns
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining
various patterns
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Action

Example: Log Mining
various patterns
messages.cache()
Worker
Worker
Worker
Driver
Block 1
Block 2
Block 3

Example: Log Mining
various patterns
messages.cache()
Worker
Driver
Worker
Worker
Block 1
Block 2
Block 3
tasks
tasks
tasks

messages.cache()
Worker
Driver
Worker
Read
HDFS
Block
Worker
Block 1
Block 2
Block 3
Read
HDFS
Block
Read
HDFS
Block
Example: Log Mining
various patterns

messages.cache()
Worker
Driver
Worker
Process
& Cache
Data
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Example: Log Mining
various patterns

Example: Log Mining
various patterns
messages.cache()
Worker
Driver
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
results
results
results

Example: Log Mining
various patterns
messages.cache()
Worker
Driver
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()

Example: Log Mining
various patterns
messages.cache()
Worker
Driver
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
tasks
tasks
tasks

messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
Example: Log Mining
various patterns

Example: Log Mining
various patterns
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
results
results
results

Example: Log Mining
various patterns
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Cache your data ➔ Faster Results
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from cache vs. 20s for on-disk

Cassandra + Spark:
A Great Combination
Both are Easy to Use
Spark Can Help You Bridge Your Hadoop and
Cassandra Systems
Use Spark Libraries, Caching on-top of Cassandra-stored
Data
Combine Spark Streaming with Cassandra Storage Datastax
spark-cassandra-connector:
https://github.com/datastax/
spark-cassandra-connector

Schema RDDs (Spark SQL)
• Built-in Mechanism for recognizing Structured data in Spark
• Allow for systems to apply several data access and relational
optimizations (e.g. predicate push-down, partition pruning, broadcast
joins)
• Columnar in-memory representation when cached
• Native Support for structured formats like parquet, JSON
• Great Compatibility with the Rest of the Stack (python, libraries, etc.)

Thank You!
Visit http://databricks.com:
Blogs, Tutorials and more
!
Questions?

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms (20)

More from DataStax Academy (20)

Recently uploaded (20)

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms