SlideShare a Scribd company logo
UC BERKELEY
It’s All Happening On-line          User Generated
                                  (Web, Social & Mobile)
          Every:
          Click
          Ad impression
          Billing event
                                                           …..
          Fast Forward, pause,…
          Friend Request
          Transaction
          Network message
          Fault
          …


Internet of Things / M2M          Scientific Computing
Volume     Petabytes+



                               Variety    Unstructured




                               Velocity   Real-Time



Our view: More data should mean better answers


    • Must balance Cost, Time, and Answer Quality
3
4
UC BERKELEY



                    Algorithms: Machine
                       Learning and
                          Analytics




                         Massive
                        and Diverse
                           Data


         People:
                                             Machines:
     CrowdSourcing &
                                          Cloud Computing
    Human Computation

5
throughout the entire analytics lifecycle
6
Alex Bayen (Mobile Sensing)       Anthony Joseph (Sec./ Privacy)
   Ken Goldberg (Crowdsourcing)      Randy Katz (Systems)
   *Michael Franklin (Databases)     Dave Patterson (Systems)
   Armando Fox (Systems)             *Ion Stoica (Systems)
   *Mike Jordan (Machine Learning)   Scott Shenker (Networking)



Organized for Collaboration:




   7
8
> 450,000
    downloads




9
• Sequencing costs                    (150X)               Big Data                $100,000.0
                                                                                                 $K per genome

                                                                                     $10,000.0

 • UCSF cancer researchers + UCSC cancer genetic                                      $1,000.0
                                                                                       $100.0

   database + AMP Lab + Intel Cluster                                                   $10.0
                                                                                          $1.0
    @TCGA: 5 PB = 20 cancers x 1000 genomes                                               $0.1
                                                                                                   2001 - 2014


• See Dave Patterson’s Talk: Thursday 3-4, BDT205
        David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times,
   10   12/5/2011
MLBase (Declarative Machine Learning)
     Hadoop MR
        MPI                         BlinkDB (approx QP)
      Graphlab                        Shark (SQL) + Streaming
        etc.                  Spark                       Streaming
                    Shared RDDs (distributed memory)
                     Mesos (cluster resource manager)
                                HDFS

        3rd party      AMPLab (released)          AMPLab (in progress)


11
12
13
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Lightning-Fast Cluster Computing
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Base RDD                                              Cache 1
lines = spark.textFile(“hdfs://...”)              Transformed RDD
                                                                                            Worker
                                                                         results
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(„t‟)(2))                                            tasks    Block 1
                                                                    Driver
cachedMsgs = messages.cache()

                                                    Action
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count                                                       Cache 2
                                                                                           Worker
                                                                     Cache 3
                                                               Worker                      Block 2
 Result: full-text search TBWikipedia in sec sec
    Result: scaled to 1 of data in 5-7 <1
         (vs 170sec for on-disk data)
          (vs 20 sec for on-disk data)                         Block 3
messages = textFile(...).filter(_.contains(“error”))
                        .map(_.split(„t‟)(2))




HadoopRDD                FilteredRDD              MappedRDD
 path = hdfs://…        func = _.contains(...)    func = _.split(…)
random initial line




target
map readPoint     cache

                                                       Load data in memory once
                               Initial parameter vector

                  map p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
reduce _ + _
                                    Repeated MapReduce steps
                                      to do gradient descent
60

                     50
Running Time (min)



                                                            110 s / iteration

                     40
                                                             Hadoop
                     30
                                                             Spark
                     20

                     10
                                                            first iteration 80 s
                                                          further iterations 1 s
                     0
                          1     10            20     30
                              Number of Iterations
Java API        JavaRDD<String> lines = sc.textFile(...);
(out now)
                lines.filter(new Function<String, Boolean>() {
                  Boolean call(String s) {
                    return s.contains(“error”);
                  }
                }).count();




PySpark         lines = sc.textFile(...)
(coming soon)
                lines.filter(lambda x: x.contains('error')) 
                     .count()
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Hive                            20

Spark       0.5
                                     Time (hours)
        0         5   10   15   20
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Client
                                 CLI          JDBC

                               Driver

Meta store      SQL       Query         Physical Plan
               Parser    Optimizer       Execution


                            MapReduce

                        HDFS
Client
                                 CLI          JDBC

                               Driver     Cache Mgr.

Meta store      SQL       Query         Physical Plan
               Parser    Optimizer       Execution


                               Spark

                        HDFS
Row Storage       Column Storage
1   john    4.1    1      2      3

2   mike    3.5   john   mike   sally

3   sally   6.4   4.1    3.5    6.4
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Shark   Shark (disk)   Hive

                                 100
                                 90
                                 80
                                 70
                                 60
                                 50
                                 40
                                 30

100 m2.4xlarge nodes             20

2.1 TB benchmark (Pavlo et al)   10




                                           1.1
                                  0
                                             Selection
Shark   Shark (disk)   Hive
                                 600


                                 500


                                 400


                                 300


                                 200


100 m2.4xlarge nodes             100




                                           32
2.1 TB benchmark (Pavlo et al)
                                  0
                                            Group By
1800
                                        Shark (copartitioned)
                                        Shark
                                 1500
                                        Shark (disk)
                                        Hive
                                 1200


                                 900


                                 600


                                 300




                                          105
100 m2.4xlarge nodes
2.1 TB benchmark (Pavlo et al)     0
                                                Join
Shark   Shark (disk)   Hive
70                             70               100
                                                90
60                             60
                                                80
50                             50               70
                                                60
40                             40
                                                50
30                             30               40
                                                30
20                             20
                                                20                100 m2.4xlarge
10                             10               10                nodes, 1.7 TB




                                                      1.0
         0.8




                                    0.7




                                                 0                Conviva dataset
 0                              0
           Query 1                    Query 2           Query 3
spark-project.org
amplab.cs.berkeley.edu

                         UC BERKELEY
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

What's hot (20)

PPTX
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
PDF
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Big Data Spain
 
PDF
PyData Paris 2015 - Closing keynote Francesc Alted
Pôle Systematic Paris-Region
 
PDF
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Hong Ong
 
PDF
Bayesian Counters
DataWorks Summit
 
PDF
Hadoop pig
Wei-Yu Chen
 
PPS
Ado.net session08
Niit Care
 
PPTX
Become a Java GC Hero - All Day Devops
Tier1app
 
PPT
Session 46 - Principles of workflow management and execution
ISSGC Summer School
 
PDF
ModuLab DLC-Medical3
Dongheon Lee
 
PPTX
Getting your hands dirty with deep learning in java
Dave Snowdon
 
PDF
Cassandra data structures and algorithms
Duyhai Doan
 
PDF
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
Fulvio Corno
 
PDF
Gur1009
Cdiscount
 
PDF
Integrate Solr with real-time stream processing applications
lucenerevolution
 
PDF
Java Future S Ritter
catherinewall
 
PDF
Marco Cattaneo "Event data processing in LHCb"
Yandex
 
PDF
H2O Distributed Deep Learning by Arno Candel 071614
Sri Ambati
 
PDF
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
PDF
Easy Scaling with Open Source Data Structures, by Talip Ozturk
ZeroTurnaround
 
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Big Data Spain
 
PyData Paris 2015 - Closing keynote Francesc Alted
Pôle Systematic Paris-Region
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Hong Ong
 
Bayesian Counters
DataWorks Summit
 
Hadoop pig
Wei-Yu Chen
 
Ado.net session08
Niit Care
 
Become a Java GC Hero - All Day Devops
Tier1app
 
Session 46 - Principles of workflow management and execution
ISSGC Summer School
 
ModuLab DLC-Medical3
Dongheon Lee
 
Getting your hands dirty with deep learning in java
Dave Snowdon
 
Cassandra data structures and algorithms
Duyhai Doan
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
Fulvio Corno
 
Gur1009
Cdiscount
 
Integrate Solr with real-time stream processing applications
lucenerevolution
 
Java Future S Ritter
catherinewall
 
Marco Cattaneo "Event data processing in LHCb"
Yandex
 
H2O Distributed Deep Learning by Arno Candel 071614
Sri Ambati
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
Easy Scaling with Open Source Data Structures, by Talip Ozturk
ZeroTurnaround
 

Similar to Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305 (20)

PDF
Spark and shark
DataWorks Summit
 
PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PPTX
Scalable Parallel Computing on Clouds
Thilina Gunarathne
 
PPTX
The Hadoop Ecosystem
J Singh
 
KEY
Hadoop london
Yahoo Developer Network
 
PDF
Omaha Java Users Group - Introduction to HBase and Hadoop
Shawn Hermans
 
PDF
Hadoop Overview kdd2011
Milind Bhandarkar
 
PDF
Pig and Python to Process Big Data
Shawn Hermans
 
PDF
Data-Intensive Text Processing with MapReduce
George Ang
 
PDF
Data-Intensive Text Processing with MapReduce
George Ang
 
PPTX
Above the cloud: Big Data and BI
Denny Lee
 
PDF
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
David Gleich
 
PDF
Introduction to Hadoop
Ovidiu Dimulescu
 
PDF
Hadoop Overview & Architecture
EMC
 
PDF
Hadoop for shanghai dev meetup
Roby Chen
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
PDF
Hadoop at JavaZone 2010
Matthew McCullough
 
PPTX
Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop
Yahoo Developer Network
 
Spark and shark
DataWorks Summit
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Scalable Parallel Computing on Clouds
Thilina Gunarathne
 
The Hadoop Ecosystem
J Singh
 
Omaha Java Users Group - Introduction to HBase and Hadoop
Shawn Hermans
 
Hadoop Overview kdd2011
Milind Bhandarkar
 
Pig and Python to Process Big Data
Shawn Hermans
 
Data-Intensive Text Processing with MapReduce
George Ang
 
Data-Intensive Text Processing with MapReduce
George Ang
 
Above the cloud: Big Data and BI
Denny Lee
 
Massive MapReduce Matrix Computations & Multicore Graph Algorithms
David Gleich
 
Introduction to Hadoop
Ovidiu Dimulescu
 
Hadoop Overview & Architecture
EMC
 
Hadoop for shanghai dev meetup
Roby Chen
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Hadoop at JavaZone 2010
Matthew McCullough
 
Feb 2013 HUG: A Visual Workbench for Big Data Analytics on Hadoop
Yahoo Developer Network
 
Ad

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

  • 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault … Internet of Things / M2M Scientific Computing
  • 3. Volume Petabytes+ Variety Unstructured Velocity Real-Time Our view: More data should mean better answers • Must balance Cost, Time, and Answer Quality 3
  • 4. 4
  • 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation 5
  • 6. throughout the entire analytics lifecycle 6
  • 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking) Organized for Collaboration: 7
  • 8. 8
  • 9. > 450,000 downloads 9
  • 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014 • See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
  • 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress) 11
  • 12. 12
  • 13. 13
  • 19. Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) Transformed RDD Worker results errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(„t‟)(2)) tasks Block 1 Driver cachedMsgs = messages.cache() Action cachedMsgs.filter(_.contains(“foo”)).count cachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
  • 20. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2)) HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  • 22. map readPoint cache Load data in memory once Initial parameter vector map p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x reduce _ + _ Repeated MapReduce steps to do gradient descent
  • 23. 60 50 Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
  • 24. Java API JavaRDD<String> lines = sc.textFile(...); (out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); PySpark lines = sc.textFile(...) (coming soon) lines.filter(lambda x: x.contains('error')) .count()
  • 26. Hive 20 Spark 0.5 Time (hours) 0 5 10 15 20
  • 28. Client CLI JDBC Driver Meta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
  • 29. Client CLI JDBC Driver Cache Mgr. Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
  • 30. Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4
  • 33. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30 100 m2.4xlarge nodes 20 2.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
  • 34. Shark Shark (disk) Hive 600 500 400 300 200 100 m2.4xlarge nodes 100 32 2.1 TB benchmark (Pavlo et al) 0 Group By
  • 35. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105 100 m2.4xlarge nodes 2.1 TB benchmark (Pavlo et al) 0 Join
  • 36. Shark Shark (disk) Hive 70 70 100 90 60 60 80 50 50 70 60 40 40 50 30 30 40 30 20 20 20 100 m2.4xlarge 10 10 10 nodes, 1.7 TB 1.0 0.8 0.7 0 Conviva dataset 0 0 Query 1 Query 2 Query 3
  • 38. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

Editor's Notes

  • #20: Add “variables” to the “functions” in functional programming
  • #22: Note that dataset is reused on each gradient computation
  • #23: Key idea: add “variables” to the “functions” in functional programming
  • #24: This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • #30: Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join