SlideShare a Scribd company logo
Scoobi

Ben Lever
@bmlever
Me



                                                  Haskell DSL for development of computer
      Machine learning, software
                                                      vision algorithms targeting GPUs
          systems, computer
vision, optimisation, networks, control
         and signal processing




                         Predictive analytics for the enterprise
Hadoop app development – wish list

        Quick dev cycles
          Expressive
          Reusability
          Type safety
           Reliability
Bridging the “tooling” gap
                      Scoobi

 MapReduce
  pipelines
          DList
DObject
                                     ScalaCheck
  Implementation                     Testing


                      Java APIs



                   HadoopMapReduce
At a glance
•   Scoobi = Scala for Hadoop
•   Inspired by Google’s FlumeJava
•   Developed at NICTA
•   Open-sourced Oct 2011
•   Apache V2
HadoopMapReduce – word count
                                                      ... hello … cat
                        324   ... cat …    323                                  325         ... hello … fire …       326       ... fire… cat …
                                                             …




(k1, v1)  [(k2, v2)]           Mapper                    Mapper                            Mapper                       Mapper


                                    cat   1                   hello     1                       hello    1                 fire       1
                                                              cat       1                       fire     1                 cat        1


[(k2, v2)]  [(k2, [v2])]                          Sort and shuffle: aggregate values by key

                                          hello       1   1             cat         1       1    1           fire    1     1



(k2, [v2])  [(k3, v3)]                   Reducer                           Reducer                          Reducer


                                           hello      2                       cat       3                     fire   2


                                                                                                                                            6
Java style
public class WordCount {                                           public static void main(String[] args) throws Exception {
                                                                       Configuration conf = new Configuration();
public static class Map extends Mapper<LongWritable, Text, Text,
    IntWritable> {                                                    Job job = new Job(conf, "wordcount");
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();                                 job.setOutputKeyClass(Text.class);
                                                                   job.setOutputValueClass(IntWritable.class);
    public void map(LongWritable key, Text value, Context
      context) throws IOException, InterruptedException {          job.setMapperClass(Map.class);
         String line = value.toString();                           job.setReducerClass(Reduce.class);
StringTokenizertokenizer = new StringTokenizer(line);
         while (tokenizer.hasMoreTokens()) {                       job.setInputFormatClass(TextInputFormat.class);
word.set(tokenizer.nextToken());                                   job.setOutputFormatClass(TextOutputFormat.class);
context.write(word, one);
         }                                                         FileInputFormat.addInputPath(job, new Path(args[0]));
    }                                                              FileOutputFormat.setOutputPath(job, new Path(args[1]));
 }
 public static class Reduce extends Reducer<Text, IntWritable,     job.waitForCompletion(true);
      Text, IntWritable> {
                                                                    }
                                                                   }
    public void reduce(Text key, Iterable<IntWritable> values,
      Context context)
       throws IOException, InterruptedException {
int sum = 0;
         for (IntWritableval : values) {
                                                                                        Source: http://wiki.apache.org/hadoop/WordCount
             sum += val.get();
         }
context.write(key, new IntWritable(sum));
    }
 }


                                                                                                                                 7
DList abstraction
                                                            Distributed List (DList)
 Data
   on
HDFS

                                        Transform




                   DList type                           Abstraction for
   DList[String]                      Lines of text files
   DList[(Int, String, Boolean)]      CVS files of the form “37,Joe,M”
   DList[(Float,Map[(String, Int)]]   Avro files withshema: {record { int, map}}
Scoobi style
importcom.nicta.scoobi.Scoobi._

// Count the frequency of words from corpus of documents
objectWordCountextendsScoobiApp {
def run() {
vallines: DList[String] = fromTextFile(args(0))

valfreqs: DList[(String, Int)] =
lines.flatMap(_.split(" ")) // DList[String]
             .map(w=> (w, 1))      // DList[(String, Int)]
             .groupByKey// DList[(String, Iterable[Int])]
             .combine(_+_)          // DList[(String, Int)]

persist(toTextFile(freqs, args(1)))
  }
}
DList trait
traitDList[A] {
/* Abstract methods */
def parallelDo[B](dofn: DoFn[A, B]): DList[B]

def ++(that: DList[A]): DList[A]

def groupByKey[K, V]
    (implicit A <:< (K, V)): DList[(K, Iterable[V])]

def combine[K, V]
    (f: (V, V) => V)
    (implicit A <:< (K, Iterable[V])): DList[(K, V)]

/* All other methods are derived, e.g. „map‟ */
}
Under the hood
fromTextFile               LD

                   lines           HDFS
    flatMap                PD

                  words

        map                PD    MapReduce Job


                  word1

 groupByKey                GBK

                  wordG            HDFS

    combine                CV

                   freq
    persist
Removing less than the average
importcom.nicta.scoobi.Scoobi._

// Remove all integers that are less than the average integer
objectBetterThanAverageextendsScoobiApp {
def run() {
valints: DList[Int] =
fromTextFile(args(0)) collect { case AnInt(i) =>i }

valtotal: DObject[Int] = ints.sum
valcount: DObject[Int] = ints.size

valaverage: DObject[Int] =
      (total, count) map { case (t, c) =>t / c }

valbigger: DList[Int] =
      (average join ints) filter { case (a, i) =>i> a }

persist(toTextFile(bigger, args(1)))
  }
}
Under the hood        HDFS
                 LD


                                        MapReduce Job
    ints         PD


           PD         PD
                                            HDFS
           GBK        GBK


           CV         CV               Client computation



           M          M
                                       DCach
total                       count                  HDFS
                 OP                      e
    average

                 PD                     MapReduce Job


                 PD
        bigger
                                            HDFS
DObject abstraction
                                                Dlist[B]
                                                                 HDFS

          map              map                                 Distributed
DObject          DObject           DObject
                                                                  Cache
                                               Dobject[A]


      Client-side computations                     join



                                               DList[(A, B)]
                                                                 HDFS +
trait DObject[A] {                                             Distributed
  def map[B](f: A => B): DObject[B]                               Cache
  def join[B](list: DList[B]): DList[(A, B)]
}
Mirroring the Scala Collection API
     DList =>DList    DList =>DObject

       flatMap           reduce
         map             product
        filter             sum
      filterNot           length
      groupBy              size
      partition           count
       flatten             max
       distinct           maxBy
          ++               min
     keys, values         minBy
Building abstractions
               Functional programming




Functions as                            Functions as
procedures                              parameters




                   Composability
                        +
                    Reusability
Composing
// Compute the average of a DList of “numbers”
defaverage[A : Numeric](in: DList[A]): DObject[A] =
  (in.sum, in.size) map { case (sum, size) => sum / size }




// Compute histogram
defhistogram[A](in: DList[A]): DList[(A, Int)] =
in.map(x=> (x, 1)).groupByKey.combine(_+_)




// Throw away words with less-than-average frequency
defbetterThanAvgWords(lines: DList[String]): DList[String] = {
val words = lines.flatMap(_.split(“ “))
valwordCnts = histogram(words)
valavgFreq = average(wordCounts.values)
  (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w }
}
Unit-testing ‘histogram’
// Specification for histogram function
class HistogramSpecextendsHadoopSpecification {

“Histogram from DList”>> {
                                                                   ScalaCheck
“Sum of bins must equal size of DList”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
valhist = histogram(list.toDList)
valbinSum = persist(hist.values.sum)
binSum == list.sz
      }
    }

“Number of bins must equal number of unique values”>> { implicitc: SC=>
Prop.forAll { list: List[Int] =>
val input = list.toDList
val bins = histogram(input).keys.size
valuniques = input.distinct.size
val (b, u) = persist(bins, uniques)
b == u
       }
    }
  }
}
sbt integration
> test-only *Histogram* -- exclude cluster
[info] HistogramSpec
[info]
[info] Histogram from DList
[info] + Sum of bins must equal size of DList
[info] No cluster execution time
[info] + Number of bins must equal number of unique values
[info] No cluster execution time
[info]
[info]
[info] Total for specification BoundedFilterSpec
[info] Finished in 12 seconds, 600 ms
[info] 2 examples, 4 expectations, 0 failure, 0 error                 Dependent JARs are
[info]                                                                 copied (once) to a
[info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0   directory on the cluster
>                                                                     (~/libjars by default)
> test-only *Histogram*
> test-only *Histogram* -- scoobi verbose
> test-only *Histogram* -- scoobiverbose.warning
Other features
• Grouping:
  – API for controlling Hadoop’s sort-and-shuffle
  – Useful for implementing secondary sorting
• Join and Co-group helper methods
• Matrix multiplication utilities
• I/O:
  – Text, sequence, Avro
  – Roll your own
Want to know more?
• http://nicta.github.com/scoobi
• Mailing lists:
  – http://groups.google.com/group/scoobi-users
  – http://groups.google.com/group/scoobi-dev
• Twitter:
  – @bmlever
  – @etorreborre
• Meet me:
  – Will also be at Hadoop Summit (June 13-14)
  – Keen to get feedback

More Related Content

PDF
Scalding for Hadoop
Chicago Hadoop Users Group
 
PDF
Introduction to Scalding and Monoids
Hugo Gävert
 
PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
KEY
Scalding: Twitter's Scala DSL for Hadoop/Cascading
johnynek
 
PPTX
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
PPTX
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
PDF
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
PDF
Scalding
Mario Pastorelli
 
Scalding for Hadoop
Chicago Hadoop Users Group
 
Introduction to Scalding and Monoids
Hugo Gävert
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
johnynek
 
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 

What's hot (20)

PDF
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
PPTX
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
BigDataEverywhere
 
PDF
Kotlin @ Coupang Backend 2017
Sunghyouk Bae
 
PPTX
Scalding: Reaching Efficient MapReduce
LivePerson
 
PDF
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
PDF
Spark workshop
Wojciech Pituła
 
PDF
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
PDF
Scala+data
Samir Bessalah
 
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
PDF
Requery overview
Sunghyouk Bae
 
PDF
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Eelco Visser
 
PDF
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
PPTX
Kotlin coroutines and spring framework
Sunghyouk Bae
 
PDF
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
PPTX
Hadoop Streaming Tutorial With Python
Joe Stein
 
PDF
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
PDF
Spark and shark
DataWorks Summit
 
PDF
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
PDF
2014 holden - databricks umd scala crash course
Holden Karau
 
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
BigDataEverywhere
 
Kotlin @ Coupang Backend 2017
Sunghyouk Bae
 
Scalding: Reaching Efficient MapReduce
LivePerson
 
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Spark workshop
Wojciech Pituła
 
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Sunghyouk Bae
 
Scala+data
Samir Bessalah
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
CloudxLab
 
Requery overview
Sunghyouk Bae
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Eelco Visser
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Kotlin coroutines and spring framework
Sunghyouk Bae
 
Domain-Specific Languages for Composable Editor Plugins (LDTA 2009)
lennartkats
 
Hadoop Streaming Tutorial With Python
Joe Stein
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Spark and shark
DataWorks Summit
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
2014 holden - databricks umd scala crash course
Holden Karau
 
Ad

Similar to Scoobi - Scala for Startups (20)

PPTX
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
PDF
Introducción a hadoop
datasalt
 
PDF
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Xebia Nederland BV
 
PDF
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
PDF
Solving real world problems with Hadoop
synctree
 
PDF
OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop
Publicis Sapient Engineering
 
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
KEY
Hadoop本 輪読会 1章〜2章
moai kids
 
KEY
Intro to Cascading (SpringOne2GX)
Paco Nathan
 
PPTX
Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate...
Johannes Schildgen
 
PDF
Hadoop pig
Sean Murphy
 
PDF
Hw09 Hadoop + Clojure
Cloudera, Inc.
 
KEY
Dachis group pigout_101
ktsafford
 
PPTX
EMC2, Владимир Суворов
EYevseyeva
 
KEY
Buzz words
cwensel
 
PDF
Doug Cutting on the State of the Hadoop Ecosystem
Cloudera, Inc.
 
PDF
Distributed computing the Google way
Eduard Hildebrandt
 
PDF
Cloud jpl
Marc de Palol
 
PDF
Scala DSLの作り方
Tomoharu ASAMI
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Vitaly Gordon
 
Introducción a hadoop
datasalt
 
Why hadoop map reduce needs scala, an introduction to scoobi and scalding
Xebia Nederland BV
 
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
Solving real world problems with Hadoop
synctree
 
OSDC.fr 2012 :: Cascalog : progammation logique pour Hadoop
Publicis Sapient Engineering
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Hadoop本 輪読会 1章〜2章
moai kids
 
Intro to Cascading (SpringOne2GX)
Paco Nathan
 
Marimba - A MapReduce-based Programming Model for Self-maintainable Aggregate...
Johannes Schildgen
 
Hadoop pig
Sean Murphy
 
Hw09 Hadoop + Clojure
Cloudera, Inc.
 
Dachis group pigout_101
ktsafford
 
EMC2, Владимир Суворов
EYevseyeva
 
Buzz words
cwensel
 
Doug Cutting on the State of the Hadoop Ecosystem
Cloudera, Inc.
 
Distributed computing the Google way
Eduard Hildebrandt
 
Cloud jpl
Marc de Palol
 
Scala DSLの作り方
Tomoharu ASAMI
 
Ad

Recently uploaded (20)

PDF
Doc9.....................................
SofiaCollazos
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Doc9.....................................
SofiaCollazos
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 

Scoobi - Scala for Startups

  • 2. Me Haskell DSL for development of computer Machine learning, software vision algorithms targeting GPUs systems, computer vision, optimisation, networks, control and signal processing Predictive analytics for the enterprise
  • 3. Hadoop app development – wish list Quick dev cycles Expressive Reusability Type safety Reliability
  • 4. Bridging the “tooling” gap Scoobi MapReduce pipelines DList DObject ScalaCheck Implementation Testing Java APIs HadoopMapReduce
  • 5. At a glance • Scoobi = Scala for Hadoop • Inspired by Google’s FlumeJava • Developed at NICTA • Open-sourced Oct 2011 • Apache V2
  • 6. HadoopMapReduce – word count ... hello … cat 324 ... cat … 323 325 ... hello … fire … 326 ... fire… cat … … (k1, v1)  [(k2, v2)] Mapper Mapper Mapper Mapper cat 1 hello 1 hello 1 fire 1 cat 1 fire 1 cat 1 [(k2, v2)]  [(k2, [v2])] Sort and shuffle: aggregate values by key hello 1 1 cat 1 1 1 fire 1 1 (k2, [v2])  [(k3, v3)] Reducer Reducer Reducer hello 2 cat 3 fire 2 6
  • 7. Java style public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class); StringTokenizertokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { job.setInputFormatClass(TextInputFormat.class); word.set(tokenizer.nextToken()); job.setOutputFormatClass(TextOutputFormat.class); context.write(word, one); } FileInputFormat.addInputPath(job, new Path(args[0])); } FileOutputFormat.setOutputPath(job, new Path(args[1])); } public static class Reduce extends Reducer<Text, IntWritable, job.waitForCompletion(true); Text, IntWritable> { } } public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritableval : values) { Source: http://wiki.apache.org/hadoop/WordCount sum += val.get(); } context.write(key, new IntWritable(sum)); } } 7
  • 8. DList abstraction Distributed List (DList) Data on HDFS Transform DList type Abstraction for DList[String] Lines of text files DList[(Int, String, Boolean)] CVS files of the form “37,Joe,M” DList[(Float,Map[(String, Int)]] Avro files withshema: {record { int, map}}
  • 9. Scoobi style importcom.nicta.scoobi.Scoobi._ // Count the frequency of words from corpus of documents objectWordCountextendsScoobiApp { def run() { vallines: DList[String] = fromTextFile(args(0)) valfreqs: DList[(String, Int)] = lines.flatMap(_.split(" ")) // DList[String] .map(w=> (w, 1)) // DList[(String, Int)] .groupByKey// DList[(String, Iterable[Int])] .combine(_+_) // DList[(String, Int)] persist(toTextFile(freqs, args(1))) } }
  • 10. DList trait traitDList[A] { /* Abstract methods */ def parallelDo[B](dofn: DoFn[A, B]): DList[B] def ++(that: DList[A]): DList[A] def groupByKey[K, V] (implicit A <:< (K, V)): DList[(K, Iterable[V])] def combine[K, V] (f: (V, V) => V) (implicit A <:< (K, Iterable[V])): DList[(K, V)] /* All other methods are derived, e.g. „map‟ */ }
  • 11. Under the hood fromTextFile LD lines HDFS flatMap PD words map PD MapReduce Job word1 groupByKey GBK wordG HDFS combine CV freq persist
  • 12. Removing less than the average importcom.nicta.scoobi.Scoobi._ // Remove all integers that are less than the average integer objectBetterThanAverageextendsScoobiApp { def run() { valints: DList[Int] = fromTextFile(args(0)) collect { case AnInt(i) =>i } valtotal: DObject[Int] = ints.sum valcount: DObject[Int] = ints.size valaverage: DObject[Int] = (total, count) map { case (t, c) =>t / c } valbigger: DList[Int] = (average join ints) filter { case (a, i) =>i> a } persist(toTextFile(bigger, args(1))) } }
  • 13. Under the hood HDFS LD MapReduce Job ints PD PD PD HDFS GBK GBK CV CV Client computation M M DCach total count HDFS OP e average PD MapReduce Job PD bigger HDFS
  • 14. DObject abstraction Dlist[B] HDFS map map Distributed DObject DObject DObject Cache Dobject[A] Client-side computations join DList[(A, B)] HDFS + trait DObject[A] { Distributed def map[B](f: A => B): DObject[B] Cache def join[B](list: DList[B]): DList[(A, B)] }
  • 15. Mirroring the Scala Collection API DList =>DList DList =>DObject flatMap reduce map product filter sum filterNot length groupBy size partition count flatten max distinct maxBy ++ min keys, values minBy
  • 16. Building abstractions Functional programming Functions as Functions as procedures parameters Composability + Reusability
  • 17. Composing // Compute the average of a DList of “numbers” defaverage[A : Numeric](in: DList[A]): DObject[A] = (in.sum, in.size) map { case (sum, size) => sum / size } // Compute histogram defhistogram[A](in: DList[A]): DList[(A, Int)] = in.map(x=> (x, 1)).groupByKey.combine(_+_) // Throw away words with less-than-average frequency defbetterThanAvgWords(lines: DList[String]): DList[String] = { val words = lines.flatMap(_.split(“ “)) valwordCnts = histogram(words) valavgFreq = average(wordCounts.values) (avgFreq join wordCnts) collect { case (avg, (w, f)) iff>avg=>w } }
  • 18. Unit-testing ‘histogram’ // Specification for histogram function class HistogramSpecextendsHadoopSpecification { “Histogram from DList”>> { ScalaCheck “Sum of bins must equal size of DList”>> { implicitc: SC=> Prop.forAll { list: List[Int] => valhist = histogram(list.toDList) valbinSum = persist(hist.values.sum) binSum == list.sz } } “Number of bins must equal number of unique values”>> { implicitc: SC=> Prop.forAll { list: List[Int] => val input = list.toDList val bins = histogram(input).keys.size valuniques = input.distinct.size val (b, u) = persist(bins, uniques) b == u } } } }
  • 19. sbt integration > test-only *Histogram* -- exclude cluster [info] HistogramSpec [info] [info] Histogram from DList [info] + Sum of bins must equal size of DList [info] No cluster execution time [info] + Number of bins must equal number of unique values [info] No cluster execution time [info] [info] [info] Total for specification BoundedFilterSpec [info] Finished in 12 seconds, 600 ms [info] 2 examples, 4 expectations, 0 failure, 0 error Dependent JARs are [info] copied (once) to a [info] Passed: : Total 2, Failed 0, Errors 0, Passed 2, Skipped 0 directory on the cluster > (~/libjars by default) > test-only *Histogram* > test-only *Histogram* -- scoobi verbose > test-only *Histogram* -- scoobiverbose.warning
  • 20. Other features • Grouping: – API for controlling Hadoop’s sort-and-shuffle – Useful for implementing secondary sorting • Join and Co-group helper methods • Matrix multiplication utilities • I/O: – Text, sequence, Avro – Roll your own
  • 21. Want to know more? • http://nicta.github.com/scoobi • Mailing lists: – http://groups.google.com/group/scoobi-users – http://groups.google.com/group/scoobi-dev • Twitter: – @bmlever – @etorreborre • Meet me: – Will also be at Hadoop Summit (June 13-14) – Keen to get feedback

Editor's Notes