SlideShare a Scribd company logo
Apache Spark 
Easy and Fast Big Data Analytics 
Pat McDonough
Founded by the creators of Apache Spark 
out of UC Berkeley’s AMPLab 
Fully committed to 100% open source 
Apache Spark 
Support and Grow the 
Spark Community and Ecosystem 
Building Databricks Cloud
Databricks & Datastax 
Apache Spark is packaged as part of Datastax 
Enterprise Analytics 4.5 
Databricks & Datstax Have Partnered for 
Apache Spark Engineering and Support
Big Data Analytics 
Where We’ve Been 
• 2003 & 2004 - Google 
GFS & MapReduce Papers 
are Precursors to Hadoop 
• 2006 & 2007 - Google 
BigTable and Amazon 
DynamoDB Paper 
Precursor to Cassandra, 
HBase, Others
Big Data Analytics 
A Zoo of Innovation
Big Data Analytics 
A Zoo of Innovation
Big Data Analytics 
A Zoo of Innovation
Big Data Analytics 
A Zoo of Innovation
What's Working? 
Many Excellent Innovations Have Come From Big Data Analytics: 
• Distributed & Data Parallel is disruptive ... because we needed it 
• We Now Have Massive throughput… Solved the ETL Problem 
• The Data Hub/Lake Is Possible
What Needs to Improve? 
Go Beyond MapReduce 
MapReduce is a Very Powerful 
and Flexible Engine 
Processing Throughput 
Previously Unobtainable on 
Commodity Equipment 
But MapReduce Isn’t Enough: 
• Essentially Batch-only 
• Inefficient with respect to 
memory use, latency 
• Too Hard to Program
What Needs to Improve? 
Go Beyond (S)QL 
SQL Support Has Been A 
Welcome Interface on Many 
Platforms 
And in many cases, a faster 
alternative 
But SQL Is Often Not Enough: 
• Sometimes you want to write real programs 
(Loops, variables, functions, existing 
libraries) but don’t want to build UDFs. 
• Machine Learning (see above, plus iterative) 
• Multi-step pipelines 
• Often an Additional System
What Needs to Improve? 
Ease of Use 
Big Data Distributions Provide a 
number of Useful Tools and 
Systems 
Choices are Good to Have 
But This Is Often Unsatisfactory: 
• Each new system has it’s own configs, 
APIs, and management, coordination of 
multiple systems is challenging 
• A typical solution requires stringing 
together disparate systems - we need 
unification 
• Developers want the full power of their 
programming language
What Needs to Improve? 
Latency 
Big Data systems are 
throughput-oriented 
Some new SQL Systems 
provide interactivity 
But We Need More: 
• Interactivity beyond SQL 
interfaces 
• Repeated access of the same 
datasets (i.e. caching)
Can Spark Solve These 
Problems?
Apache Spark 
Originally developed in 2009 in UC Berkeley’s 
AMPLab 
Fully open sourced in 2010 – now at Apache 
Software Foundation 
http://spark.apache.org
Project Activity 
June 2013 June 2014 
total 
contributors 68 255 
companies 
contributing 17 50 
total lines 
of code 63,000 175,000
Project Activity 
June 2013 June 2014 
total 
contributors 68 255 
companies 
contributing 17 50 
total lines 
of code 63,000 175,000
Compared to Other Projects 
1200 
900 
600 
300 
0 
300000 
225000 
150000 
75000 
0 
Commits Lines of Code Changed 
Activity 
in past 6 
months
Compared to Other Projects 
1200 
900 
600 
300 
0 
300000 
225000 
150000 
75000 
0 
Commits Lines of Code Changed 
Activity 
in past 6 
months 
Spark is now the most active project in the 
Hadoop ecosystem
Spark on Github 
So active on Github, sometimes we break it 
Over 1200 Forks (can’t display Network Graphs) 
~80 commits to master each week 
So many PRs We Built our own PR UI
Apache Spark - Easy to 
Use And Very Fast 
Fast and general cluster computing system interoperable with Big Data 
Systems Like Hadoop and Cassandra 
Improved Efficiency: 
• In-memory computing primitives 
• General computation graphs 
Improved Usability: 
• Rich APIs 
• Interactive shell
Apache Spark - Easy to 
Use And Very Fast 
Fast and general cluster computing system interoperable with Big Data 
Systems Like Hadoop and Cassandra 
Improved Efficiency: 
• Up to 100× faster 
In-memory computing primitives 
• (2-10× on disk) 
General computation graphs 
Improved Usability: 
• Rich APIs 
2-5× less code 
• Interactive shell
Apache Spark - A 
Robust SDK for Big 
Data Applications 
SQL 
Machine 
Learning 
Streaming Graph 
Core 
Unified System With Libraries to 
Build a Complete Solution 
! 
Full-featured Programming 
Environment in Scala, Java, Python… 
Very developer-friendly, Functional 
API for working with Data 
! 
Runtimes available on several 
platforms
Spark Is A Part Of Most 
Big Data Platforms 
• All Major Hadoop Distributions Include 
Spark 
• Spark Is Also Integrated With Non-Hadoop 
Big Data Platforms like DSE 
• Spark Applications Can Be Written Once 
and Deployed Anywhere 
SQL 
Machine 
Learning 
Streaming Graph 
Core 
Deploy Spark Apps Anywhere
Easy: Get Started 
Immediately 
Interactive Shell Multi-language support 
Python 
lines = sc.textFile(...) 
lines.filter(lambda s: “ERROR” in s).count() 
Scala 
val lines = sc.textFile(...) 
lines.filter(x => x.contains(“ERROR”)).count() 
Java 
JavaRDD<String> lines = sc.textFile(...); 
lines.filter(new Function<String, Boolean>() { 
Boolean call(String s) { 
return s.contains(“error”); 
} 
}).count();
Easy: Clean API 
Write programs in terms of transformations on 
distributed datasets 
Resilient Distributed Datasets 
• Collections of objects spread 
across a cluster, stored in RAM 
or on Disk 
• Built through parallel 
transformations 
• Automatically rebuilt on failure 
Operations 
• Transformations 
(e.g. map, filter, groupBy) 
• Actions 
(e.g. count, collect, save)
Easy: Expressive API 
map reduce
Easy: Expressive API 
map 
filter 
groupBy 
sort 
union 
join 
leftOuterJoin 
rightOuterJoin 
reduce 
count 
fold 
reduceByKey 
groupByKey 
cogroup 
cross 
zip 
sample 
take 
first 
partitionBy 
mapWith 
pipe 
save ...
Easy: Example – Word Count 
Hadoop MapReduce 
public static class WordCountMapClass extends MapReduceBase 
implements Mapper<LongWritable, Text, Text, IntWritable> { 
! 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
! 
public void map(LongWritable key, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer itr = new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
output.collect(word, one); 
} 
} 
} 
! 
public static class WorkdCountReduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
! 
public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
int sum = 0; 
while (values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, new IntWritable(sum)); 
} 
} 
Spark 
val spark = new SparkContext(master, appName, [sparkHome], [jars]) 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...")
Easy: Example – Word Count 
Hadoop MapReduce 
public static class WordCountMapClass extends MapReduceBase 
implements Mapper<LongWritable, Text, Text, IntWritable> { 
! 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
! 
public void map(LongWritable key, Text value, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
String line = value.toString(); 
StringTokenizer itr = new StringTokenizer(line); 
while (itr.hasMoreTokens()) { 
word.set(itr.nextToken()); 
output.collect(word, one); 
} 
} 
} 
! 
public static class WorkdCountReduce extends MapReduceBase 
implements Reducer<Text, IntWritable, Text, IntWritable> { 
! 
public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, 
Reporter reporter) throws IOException { 
int sum = 0; 
while (values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, new IntWritable(sum)); 
} 
} 
Spark 
val spark = new SparkContext(master, appName, [sparkHome], [jars]) 
val file = spark.textFile("hdfs://...") 
val counts = file.flatMap(line => line.split(" ")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...")
Easy: Works Well With 
Hadoop 
Data Compatibility 
• Access your existing Hadoop 
Data 
• Use the same data formats 
• Adheres to data locality for 
efficient processing 
! 
Deployment Models 
• “Standalone” deployment 
• YARN-based deployment 
• Mesos-based deployment 
• Deploy on existing Hadoop 
cluster or side-by-side
Example: Logistic Regression 
data = spark.textFile(...).map(readPoint).cache() 
! 
w = numpy.random.rand(D) 
! 
for i in range(iterations): 
gradient = data 
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) 
* p.y * p.x) 
.reduce(lambda x, y: x + y) 
w -= gradient 
! 
print “Final w: %s” % w
Fast: Using RAM, Operator 
Graphs 
In-memory Caching 
• Data Partitions read from RAM 
instead of disk 
Operator Graphs 
• Scheduling Optimizations 
• Fault Tolerance 
= 
RDD 
= 
cached 
partition 
join 
A: B: 
groupBy 
C: D: E: 
filter 
Stage 
3 
Stage 
1 
Stage 
2 
F: 
map
Fast: Logistic Regression 
Performance 
Running Time (s) 
4000 
3000 
2000 
1000 
0 
1 5 10 20 30 
Number of Iterations 
110 
s 
/ 
iteration 
Hadoop Spark 
first 
iteration 
80 
s 
further 
iterations 
1 
s
Fast: Scales Down Seamlessly 
Execution 
time 
(s) 
100 
75 
50 
25 
0 
Cache 
disabled 25% 50% 75% Fully 
cached 
% 
of 
working 
set 
in 
cache 
11.5304 
29.7471 
40.7407 
58.0614 
68.8414
Easy: Fault Recovery 
RDDs track lineage information that can be used to 
efficiently recompute lost data 
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) 
.map(lambda s: s.split(“t”)[2]) 
HDFS File Filtered RDD 
Mapped 
filter RDD 
(func 
= 
startsWith(…)) 
map 
(func 
= 
split(...))
How Spark Works
Working With RDDs
Working With RDDs 
RDD 
textFile = sc.textFile(”SomeFile.txt”)
Working With RDDs 
RDRDDD RDRDDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
Working With RDDs 
RDRDDD RDRDDD 
Transformations 
textFile = sc.textFile(”SomeFile.txt”) 
Action Value 
linesWithSpark = textFile.filter(lambda line: "Spark” in line) 
linesWithSpark.count() 
74 
! 
linesWithSpark.first() 
# Apache Spark
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
Load error messages from a log into memory, then interactively search for 
various patterns 
Worker 
Example: Log Mining 
Worker 
Worker 
Driver
Load error messages from a log into memory, then interactively search for 
various patterns 
Worker 
Example: Log Mining 
Worker 
Worker 
Driver 
lines = spark.textFile(“hdfs://...”)
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
Worker 
Worker 
Worker 
Driver
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
Worker 
Worker 
Worker 
Driver
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “mysql” in s).count()
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “mysql” in s).count() Action
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
Driver 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
tasks 
tasks 
tasks
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Read 
HDFS 
Block 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Read 
HDFS 
Block 
Read 
HDFS 
Block 
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Process 
& Cache 
Data 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
Process 
& Cache 
Data 
Process 
& Cache 
Data 
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
results 
results 
results
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count()
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Driver 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
tasks 
tasks 
tasks
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
Driver 
Process 
from 
Cache 
Process 
from 
Cache 
Process 
from 
Cache 
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
Driver 
results 
results 
results
Example: Log Mining 
Load error messages from a log into memory, then interactively search for 
various patterns 
lines = spark.textFile(“hdfs://...”) 
errors = lines.filter(lambda s: s.startswith(“ERROR”)) 
messages = errors.map(lambda s: s.split(“t”)[2]) 
messages.cache() 
Worker 
Worker 
Worker 
messages.filter(lambda s: “mysql” in s).count() 
Block 1 
Block 2 
Block 3 
Cache 1 
Cache 2 
Cache 3 
messages.filter(lambda s: “php” in s).count() 
Driver 
Cache your data ➔ Faster Results 
Full-text search of Wikipedia 
• 60GB on 20 EC2 machines 
• 0.5 sec from cache vs. 20s for on-disk
Cassandra + Spark: 
A Great Combination 
Both are Easy to Use 
Spark Can Help You Bridge Your Hadoop and 
Cassandra Systems 
Use Spark Libraries, Caching on-top of Cassandra-stored 
Data 
Combine Spark Streaming with Cassandra Storage Datastax 
spark-cassandra-connector: 
https://github.com/datastax/ 
spark-cassandra-connector
Schema RDDs (Spark SQL) 
• Built-in Mechanism for recognizing Structured data in Spark 
• Allow for systems to apply several data access and relational 
optimizations (e.g. predicate push-down, partition pruning, broadcast 
joins) 
• Columnar in-memory representation when cached 
• Native Support for structured formats like parquet, JSON 
• Great Compatibility with the Rest of the Stack (python, libraries, etc.)
Thank You! 
Visit http://databricks.com: 
Blogs, Tutorials and more 
! 
Questions?

More Related Content

What's hot (20)

PPT
Hands on Mahout!
OSCON Byrum
 
PDF
Apache Mahout
Save Manos
 
PDF
Tutorial Mahout - Recommendation
Cataldo Musto
 
PPTX
Apache Mahout 於電子商務的應用
James Chen
 
PDF
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
PDF
Mahout
Edureka!
 
PPTX
Intro to Apache Mahout
Grant Ingersoll
 
PPTX
Hadoop and Machine Learning
joshwills
 
PDF
Mahout classification presentation
Naoki Nakatani
 
PDF
Introduction to Collaborative Filtering with Apache Mahout
sscdotopen
 
PPTX
mahout introduction
changgeng Zhang
 
PPTX
Apache mahout
Puneet Gupta
 
PPTX
Introduction to Apache Mahout
Aman Adhikari
 
PDF
Next directions in Mahout's recommenders
sscdotopen
 
PPT
Mahout part2
Yasmine Gaber
 
PPT
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
PPT
Logistic Regression using Mahout
tanuvir
 
PPTX
Using the search engine as recommendation engine
Lars Marius Garshol
 
PPTX
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
PDF
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Robin Anil
 
Hands on Mahout!
OSCON Byrum
 
Apache Mahout
Save Manos
 
Tutorial Mahout - Recommendation
Cataldo Musto
 
Apache Mahout 於電子商務的應用
James Chen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
Mahout
Edureka!
 
Intro to Apache Mahout
Grant Ingersoll
 
Hadoop and Machine Learning
joshwills
 
Mahout classification presentation
Naoki Nakatani
 
Introduction to Collaborative Filtering with Apache Mahout
sscdotopen
 
mahout introduction
changgeng Zhang
 
Apache mahout
Puneet Gupta
 
Introduction to Apache Mahout
Aman Adhikari
 
Next directions in Mahout's recommenders
sscdotopen
 
Mahout part2
Yasmine Gaber
 
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Logistic Regression using Mahout
tanuvir
 
Using the search engine as recommendation engine
Lars Marius Garshol
 
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Robin Anil
 

Viewers also liked (20)

PPTX
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
DataStax Academy
 
PDF
kafka
Ariel Moskovich
 
PPTX
Bring the Spark To Your Eyes
Demi Ben-Ari
 
PDF
Introduction to big data and apache spark
Mohammed Guller
 
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PPTX
Thing you didn't know you could do in Spark
SnappyData
 
PDF
Using Spark with Tachyon by Gene Pang
Spark Summit
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PPTX
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
PDF
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
PDF
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
PDF
Getting Started Running Apache Spark on Apache Mesos
Paco Nathan
 
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PDF
What's new with Apache Spark?
Paco Nathan
 
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
DataStax Academy
 
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Introduction to big data and apache spark
Mohammed Guller
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Thing you didn't know you could do in Spark
SnappyData
 
Using Spark with Tachyon by Gene Pang
Spark Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Spark 101 - First steps to distributed computing
Demi Ben-Ari
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Chris Fregly
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Spark + Cassandra = Real Time Analytics on Operational Data
Victor Coustenoble
 
Getting Started Running Apache Spark on Apache Mesos
Paco Nathan
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
What's new with Apache Spark?
Paco Nathan
 
Ad

Similar to Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms (20)

PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
PDF
Apache Spark & Hadoop
MapR Technologies
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PPTX
Spark Study Notes
Richard Kuo
 
PPTX
Big data week presentation
Joseph Adler
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PPTX
Hadoop Big Data A big picture
J S Jodha
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PPT
Hadoop basics
Antonio Silveira
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
Spark Programming Basic Training Handout
yanuarsinggih1
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
20170126 big data processing
Vienna Data Science Group
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Apache Spark & Hadoop
MapR Technologies
 
Spark meetup TCHUG
Ryan Bosshart
 
BDM25 - Spark runtime internal
David Lauzon
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Meetup ml spark_ppt
Snehal Nagmote
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Spark Study Notes
Richard Kuo
 
Big data week presentation
Joseph Adler
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Hadoop Big Data A big picture
J S Jodha
 
An introduction To Apache Spark
Amir Sedighi
 
Unified Big Data Processing with Apache Spark
C4Media
 
Hadoop basics
Antonio Silveira
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Spark Programming Basic Training Handout
yanuarsinggih1
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
PPTX
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PPTX
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
PDF
Data Modeling for Apache Cassandra
DataStax Academy
 
PDF
Coursera Cassandra Driver
DataStax Academy
 
PDF
Production Ready Cassandra
DataStax Academy
 
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
PDF
Standing Up Your First Cluster
DataStax Academy
 
PDF
Real Time Analytics with Dse
DataStax Academy
 
PDF
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Cassandra Core Concepts
DataStax Academy
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PPTX
Bad Habits Die Hard
DataStax Academy
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
Advanced Cassandra
DataStax Academy
 
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
DataStax Academy
 
Introduction to DataStax Enterprise Graph Database
DataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
DataStax Academy
 
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Data Modeling for Apache Cassandra
DataStax Academy
 
Coursera Cassandra Driver
DataStax Academy
 
Production Ready Cassandra
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
DataStax Academy
 
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Core Concepts
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Bad Habits Die Hard
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Cassandra
DataStax Academy
 

Recently uploaded (20)

PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
July Patch Tuesday
Ivanti
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
July Patch Tuesday
Ivanti
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

  • 1. Apache Spark Easy and Fast Big Data Analytics Pat McDonough
  • 2. Founded by the creators of Apache Spark out of UC Berkeley’s AMPLab Fully committed to 100% open source Apache Spark Support and Grow the Spark Community and Ecosystem Building Databricks Cloud
  • 3. Databricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have Partnered for Apache Spark Engineering and Support
  • 4. Big Data Analytics Where We’ve Been • 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop • 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others
  • 5. Big Data Analytics A Zoo of Innovation
  • 6. Big Data Analytics A Zoo of Innovation
  • 7. Big Data Analytics A Zoo of Innovation
  • 8. Big Data Analytics A Zoo of Innovation
  • 9. What's Working? Many Excellent Innovations Have Come From Big Data Analytics: • Distributed & Data Parallel is disruptive ... because we needed it • We Now Have Massive throughput… Solved the ETL Problem • The Data Hub/Lake Is Possible
  • 10. What Needs to Improve? Go Beyond MapReduce MapReduce is a Very Powerful and Flexible Engine Processing Throughput Previously Unobtainable on Commodity Equipment But MapReduce Isn’t Enough: • Essentially Batch-only • Inefficient with respect to memory use, latency • Too Hard to Program
  • 11. What Needs to Improve? Go Beyond (S)QL SQL Support Has Been A Welcome Interface on Many Platforms And in many cases, a faster alternative But SQL Is Often Not Enough: • Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs. • Machine Learning (see above, plus iterative) • Multi-step pipelines • Often an Additional System
  • 12. What Needs to Improve? Ease of Use Big Data Distributions Provide a number of Useful Tools and Systems Choices are Good to Have But This Is Often Unsatisfactory: • Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging • A typical solution requires stringing together disparate systems - we need unification • Developers want the full power of their programming language
  • 13. What Needs to Improve? Latency Big Data systems are throughput-oriented Some new SQL Systems provide interactivity But We Need More: • Interactivity beyond SQL interfaces • Repeated access of the same datasets (i.e. caching)
  • 14. Can Spark Solve These Problems?
  • 15. Apache Spark Originally developed in 2009 in UC Berkeley’s AMPLab Fully open sourced in 2010 – now at Apache Software Foundation http://spark.apache.org
  • 16. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines of code 63,000 175,000
  • 17. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines of code 63,000 175,000
  • 18. Compared to Other Projects 1200 900 600 300 0 300000 225000 150000 75000 0 Commits Lines of Code Changed Activity in past 6 months
  • 19. Compared to Other Projects 1200 900 600 300 0 300000 225000 150000 75000 0 Commits Lines of Code Changed Activity in past 6 months Spark is now the most active project in the Hadoop ecosystem
  • 20. Spark on Github So active on Github, sometimes we break it Over 1200 Forks (can’t display Network Graphs) ~80 commits to master each week So many PRs We Built our own PR UI
  • 21. Apache Spark - Easy to Use And Very Fast Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra Improved Efficiency: • In-memory computing primitives • General computation graphs Improved Usability: • Rich APIs • Interactive shell
  • 22. Apache Spark - Easy to Use And Very Fast Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra Improved Efficiency: • Up to 100× faster In-memory computing primitives • (2-10× on disk) General computation graphs Improved Usability: • Rich APIs 2-5× less code • Interactive shell
  • 23. Apache Spark - A Robust SDK for Big Data Applications SQL Machine Learning Streaming Graph Core Unified System With Libraries to Build a Complete Solution ! Full-featured Programming Environment in Scala, Java, Python… Very developer-friendly, Functional API for working with Data ! Runtimes available on several platforms
  • 24. Spark Is A Part Of Most Big Data Platforms • All Major Hadoop Distributions Include Spark • Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE • Spark Applications Can Be Written Once and Deployed Anywhere SQL Machine Learning Streaming Graph Core Deploy Spark Apps Anywhere
  • 25. Easy: Get Started Immediately Interactive Shell Multi-language support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 26. Easy: Clean API Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save)
  • 27. Easy: Expressive API map reduce
  • 28. Easy: Expressive API map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...
  • 29. Easy: Example – Word Count Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Spark val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 30. Easy: Example – Word Count Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } ! public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Spark val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 31. Easy: Works Well With Hadoop Data Compatibility • Access your existing Hadoop Data • Use the same data formats • Adheres to data locality for efficient processing ! Deployment Models • “Standalone” deployment • YARN-based deployment • Mesos-based deployment • Deploy on existing Hadoop cluster or side-by-side
  • 32. Example: Logistic Regression data = spark.textFile(...).map(readPoint).cache() ! w = numpy.random.rand(D) ! for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient ! print “Final w: %s” % w
  • 33. Fast: Using RAM, Operator Graphs In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance = RDD = cached partition join A: B: groupBy C: D: E: filter Stage 3 Stage 1 Stage 2 F: map
  • 34. Fast: Logistic Regression Performance Running Time (s) 4000 3000 2000 1000 0 1 5 10 20 30 Number of Iterations 110 s / iteration Hadoop Spark first iteration 80 s further iterations 1 s
  • 35. Fast: Scales Down Seamlessly Execution time (s) 100 75 50 25 0 Cache disabled 25% 50% 75% Fully cached % of working set in cache 11.5304 29.7471 40.7407 58.0614 68.8414
  • 36. Easy: Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped filter RDD (func = startsWith(…)) map (func = split(...))
  • 39. Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
  • 40. Working With RDDs RDRDDD RDRDDD Transformations textFile = sc.textFile(”SomeFile.txt”) linesWithSpark = textFile.filter(lambda line: "Spark” in line)
  • 41. Working With RDDs RDRDDD RDRDDD Transformations textFile = sc.textFile(”SomeFile.txt”) Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 ! linesWithSpark.first() # Apache Spark
  • 42. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 43. Load error messages from a log into memory, then interactively search for various patterns Worker Example: Log Mining Worker Worker Driver
  • 44. Load error messages from a log into memory, then interactively search for various patterns Worker Example: Log Mining Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  • 45. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 46. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 47. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  • 48. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  • 49. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  • 50. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 tasks tasks tasks
  • 51. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Read HDFS Block Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Read HDFS Block Read HDFS Block Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 52. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Process & Cache Data Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 53. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 results results results
  • 54. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  • 55. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Driver Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks
  • 56. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
  • 57. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  • 58. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
  • 59. Cassandra + Spark: A Great Combination Both are Easy to Use Spark Can Help You Bridge Your Hadoop and Cassandra Systems Use Spark Libraries, Caching on-top of Cassandra-stored Data Combine Spark Streaming with Cassandra Storage Datastax spark-cassandra-connector: https://github.com/datastax/ spark-cassandra-connector
  • 60. Schema RDDs (Spark SQL) • Built-in Mechanism for recognizing Structured data in Spark • Allow for systems to apply several data access and relational optimizations (e.g. predicate push-down, partition pruning, broadcast joins) • Columnar in-memory representation when cached • Native Support for structured formats like parquet, JSON • Great Compatibility with the Rest of the Stack (python, libraries, etc.)
  • 61. Thank You! Visit http://databricks.com: Blogs, Tutorials and more ! Questions?