Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

It’s All Happening On-line User Generated
(Web, Social & Mobile)
Every:
Click
Ad impression
Billing event
…..
Fast Forward, pause,…
Friend Request
Transaction
Network message
Fault
…

Internet of Things / M2M Scientific Computing

Volume Petabytes+

Variety Unstructured

Velocity Real-Time

Our view: More data should mean better answers

• Must balance Cost, Time, and Answer Quality
3

UC BERKELEY

Algorithms: Machine
Learning and
Analytics

Massive
and Diverse
Data

People:
Machines:
CrowdSourcing &
Cloud Computing
Human Computation

5

throughout the entire analytics lifecycle
6

Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy)
Ken Goldberg (Crowdsourcing) Randy Katz (Systems)
*Michael Franklin (Databases) Dave Patterson (Systems)
Armando Fox (Systems) *Ion Stoica (Systems)
*Mike Jordan (Machine Learning) Scott Shenker (Networking)

Organized for Collaboration:

7

• Sequencing costs (150X) Big Data $100,000.0
$K per genome

$10,000.0

• UCSF cancer researchers + UCSC cancer genetic $1,000.0
$100.0

database + AMP Lab + Intel Cluster $10.0
$1.0
@TCGA: 5 PB = 20 cancers x 1000 genomes $0.1
2001 - 2014

• See Dave Patterson’s Talk: Thursday 3-4, BDT205
David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times,
10 12/5/2011

MLBase (Declarative Machine Learning)
Hadoop MR
MPI BlinkDB (approx QP)
Graphlab Shark (SQL) + Streaming
etc. Spark Streaming
Shared RDDs (distributed memory)
Mesos (cluster resource manager)
HDFS

3rd party AMPLab (released) AMPLab (in progress)

11

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Lightning-Fast Cluster Computing

Base RDD Cache 1
lines = spark.textFile(“hdfs://...”) Transformed RDD
Worker
results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2)) tasks Block 1
Driver
cachedMsgs = messages.cache()

Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count Cache 2
Worker
Cache 3
Worker Block 2
Result: full-text search TBWikipedia in sec sec
Result: scaled to 1 of data in 5-7 <1
(vs 170sec for on-disk data)
(vs 20 sec for on-disk data) Block 3

messages = textFile(...).filter(_.contains(“error”))
.map(_.split(„t‟)(2))

HadoopRDD FilteredRDD MappedRDD
path = hdfs://… func = _.contains(...) func = _.split(…)

random initial line

target

map readPoint cache

Load data in memory once
Initial parameter vector

map p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
reduce _ + _
Repeated MapReduce steps
to do gradient descent

60

50
Running Time (min)

110 s / iteration

40
Hadoop
30
Spark
20

10
first iteration 80 s
further iterations 1 s
0
1 10 20 30
Number of Iterations

Java API JavaRDD<String> lines = sc.textFile(...);
(out now)
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();

PySpark lines = sc.textFile(...)
(coming soon)
lines.filter(lambda x: x.contains('error'))
.count()

Hive 20

Spark 0.5
Time (hours)
0 5 10 15 20

Client
CLI JDBC

Driver

Meta store SQL Query Physical Plan
Parser Optimizer Execution

MapReduce

HDFS

Client
CLI JDBC

Driver Cache Mgr.

Meta store SQL Query Physical Plan
Parser Optimizer Execution

Spark

HDFS

Row Storage Column Storage
1 john 4.1 1 2 3

2 mike 3.5 john mike sally

3 sally 6.4 4.1 3.5 6.4

Shark Shark (disk) Hive

100
90
80
70
60
50
40
30

100 m2.4xlarge nodes 20

2.1 TB benchmark (Pavlo et al) 10

1.1
0
Selection

600

500

400

300

200

100 m2.4xlarge nodes 100

32
2.1 TB benchmark (Pavlo et al)
0
Group By

1800
Shark (copartitioned)
Shark
1500
Shark (disk)
Hive
1200

900

600

300

105
100 m2.4xlarge nodes
2.1 TB benchmark (Pavlo et al) 0
Join

70 70 100
90
60 60
80
50 50 70
60
40 40
50
30 30 40
30
20 20
20 100 m2.4xlarge
10 10 10 nodes, 1.7 TB

1.0
0.8

0.7

0 Conviva dataset
0 0
Query 1 Query 2 Query 3

spark-project.org
amplab.cs.berkeley.edu

UC BERKELEY

We are sincerely eager to
hear your feedback on this
presentation and on re:Invent.

Please fill out an evaluation
form when you have a
chance.

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

More Related Content

What's hot (20)

Similar to Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305 (20)

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

Editor's Notes