Spark RDD Operations
Spark and Scala
Spark RDD Operations
 Spark Transformations
 Transform/filter elements type operation
 Reduction Type Operations
 Set type Operation
 Spark Actions
 Q & A
 Different Types of RDD
 Start Spark Shell
 In Local Mode
 in Standalone Mode
 in Yarn Cluster Mode
 Writing Spark Application in Scala and spark-submit in
 in Local Mode
 in Standalone Mode
 in Yarn Cluster Mode
 Q & A
Spark and Scala
RDD Operations: Transformations
 Some common transformations
 -map : Create a new RDD by applying function to each element within
collection. Function is passed as argument
 -flatmap: Like map function but output of flatmap is flattened . Though the
function in flatmap returns a list of element(s) for each element but the output
of flatmap will be an RDD which has all the elements flattened to a single list.
 -filter : Create a new RDD by including elements that satisfy Boolean expression
passed as argument
Spark and Scala
Spark Transformation
- map/flatmap
Spark and Scala
Spark Transformation
- map/flatmap
Spark and Scala
Spark Transformations
-filter/coalesce/repartition
Spark and Scala
Spark Transformations
-filter/coalesce/repartition
Spark and Scala
Spark Transformations
-filter/coalesce/repartition
Spark and Scala
Diff repartition and Coalesce
-filter/coalesce/repartition
-coalesce :
 Used to reduce number of partitions
 Tries to minimize data movement by avoiding network shuffle
 Creates unequal sized partitions
-repartition
 Used to increase or decrease partitions
 A network shuffle will be triggered which will increase data
movement
 Creates equal sized partitions
Spark and Scala
Spark Reduce Transformation
reduce/fold/aggregate
Spark and Scala
Spark Reduce Transformation
reduce/fold/aggregate
Spark and Scala
Spark Actions
 - count
 - collect
 - first
 - take
 - foreach
 - saveasTextFile
 - saveasObjectFile
Spark and Scala
Spark Actions
-count/collect
Spark and Scala
Spark Actions
-first/take
Spark and Scala
Spark Actions
-saveastextfile/foreach
Spark and Scala
Spark Set Transformations
-union/intersection
Spark and Scala
Spark Set Transformations
-union/intersection
Spark and Scala
Chaining Spark Transformations
 Transformations can be chained together
Spark and Scala
Chaining Spark Transformations (Python)
 Transformation in RDD can be chained together
Spark and Scala
Quiz
 Difference between Map and flatMap?
 -map : Create a new RDD by applying function to each element within
collection. Function is passed as argument
 -flatmap: Like map function but output of flatmap is flattened . Though the
function in flatmap returns a list of element(s) for each element but the output
of flatmap will be an RDD which has all the elements flattened to a single list.
Demo Hadoop & Big Data
Quiz
 Difference between coalesce and repartition?
coalesce :
 Used to reduce number of partitions
 Tries to minimize data movement by avoiding network shuffle
 Creates unequal sized partitions
-repartition
 Used to increase or decrease partitions
 A network shuffle will be triggered which will increase data
movement
 Creates equal sized partitions
Demo Hadoop & Big Data
Quiz
 What ate reduce transformation below?
 A. aggregate
 B. map
 C. flatMap
 D. fold
Demo Hadoop & Big Data
Spark RDD Operations
 Spark Transformations
 Transform/filter elements type operation
 Reduction Type Operations
 Set type Operation
 Spark Actions
 Q & A
 Different Types of RDD
 Start Spark Shell
 In Local Mode
 in Standalone Mode
 in Yarn Cluster Mode
 Writing Spark Application in Scala and spark-submit in
 in Local Mode
 in Standalone Mode
 in Yarn Cluster Mode
 Q & A
Spark and Scala
Different Types of RDDs
 HadoopRDD
 FilteredRDD
 MappedRDD
 PairRDD
 ShuffledRDD
 SchemaRDD
 CassandraRDD
 EsSpark(Elastic Search)
Spark and Scala
Source code RDDs
Spark and Scala
Source code of RDD
Spark and Scala
Run Spark App in Local Mode
 ./spark-shell –master local
 Spark starts in local mode, can view jobs in localhost:4040
Spark and Scala
Local Mode http port
Spark and Scala
Run Spark App in YARN
 ./spark-shell –master yarn –deploy-mode client
 Starts spark in yarn in client mode
Spark and Scala
Run Spark App in YARN client Mode
Spark and Scala
Run Spark Shell in Yarn Mode(Cluster)
 Spark Shell cant be started in Yarn Cluster Mode
Spark and Scala
Run Spark Shell in Standalone Mode
 Spark Shell start in Standalone Mode
 ./spark-shell --master spark://quickstart.cloudera:7077
Spark and Scala
Run Spark Shell in Standalone Mode
Spark and Scala
Write Simple Spark Program
 Create Maven project for simple Spark Application
 Add dependencies in pom.xml
Spark and Scala
Write Simple Spark Program
 package com.zoomskills.spark
 import org.apache.spark._
 object FirstSpark {
 def main(arg:Array[String]){
 val sconfig= new SparkConf().setAppName("First Spark")
 val sc=new SparkContext(sconfig)
 val rdd=sc.parallelize(List("Mango","Banana","Apple","Guava"))
 if (arg.length !=1)
 {
 return
 }

 //Collect will return Array of String in this case
 rdd.saveAsTextFile(arg(0))
 }
 }
Spark and Scala
Spark-Submit in local mode
 /usr/lib/spark/bin/spark-submit --master local --class
com.zoomskills.spark.FirstSpark
/home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1-
SNAPSHOT.jar /application/spark/firstprogram/output
Spark and Scala
Spark-Submit in Yarn mode(Client)
 /usr/lib/spark/bin/spark-submit --master yarn-client --class
com.zoomskills.spark.FirstSpark
/home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1-
SNAPSHOT.jar /application/spark/firstprogram/output
Spark and Scala
Spark-Submit in Yarn mode(Cluster)
 /usr/lib/spark/bin/spark-submit --master yarn-cluster --class
com.zoomskills.spark.FirstSpark
/home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1-
SNAPSHOT.jar /application/spark/firstprogram/output
Spark and Scala
Spark-Submit in Standalone mode
 /usr/lib/spark/bin/spark-submit --master
spark://quickstart.cloudera:7077 --class
com.zoomskills.spark.FirstSpark
/home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1-
SNAPSHOT.jar /application/spark/firstprogram/output
Spark and Scala
Quiz
 Can We start Spark shell in Yarn Cluster Mode
 TRUE
 FALSE
Demo Hadoop & Big Data
Quiz
 How to run Spark Application
 Use Spark Shell
 Use spark-submit
Demo Hadoop & Big Data
Quiz
 What different modes can run Spark job
 Local Mode
 Server Mode
 Cluster
 Standalone mode
Demo Hadoop & Big Data
Quiz
 Spark cluster modes support running job in two deployment modes
 Client Mode
 Cluster Mode
 Standalone Mode
 Server Mode
Demo Hadoop & Big Data
Q & A
Spark and Scala

03 spark rdd operations

  • 1.
  • 2.
    Spark RDD Operations Spark Transformations  Transform/filter elements type operation  Reduction Type Operations  Set type Operation  Spark Actions  Q & A  Different Types of RDD  Start Spark Shell  In Local Mode  in Standalone Mode  in Yarn Cluster Mode  Writing Spark Application in Scala and spark-submit in  in Local Mode  in Standalone Mode  in Yarn Cluster Mode  Q & A Spark and Scala
  • 3.
    RDD Operations: Transformations Some common transformations  -map : Create a new RDD by applying function to each element within collection. Function is passed as argument  -flatmap: Like map function but output of flatmap is flattened . Though the function in flatmap returns a list of element(s) for each element but the output of flatmap will be an RDD which has all the elements flattened to a single list.  -filter : Create a new RDD by including elements that satisfy Boolean expression passed as argument Spark and Scala
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
    Diff repartition andCoalesce -filter/coalesce/repartition -coalesce :  Used to reduce number of partitions  Tries to minimize data movement by avoiding network shuffle  Creates unequal sized partitions -repartition  Used to increase or decrease partitions  A network shuffle will be triggered which will increase data movement  Creates equal sized partitions Spark and Scala
  • 10.
  • 11.
  • 12.
    Spark Actions  -count  - collect  - first  - take  - foreach  - saveasTextFile  - saveasObjectFile Spark and Scala
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
    Chaining Spark Transformations Transformations can be chained together Spark and Scala
  • 19.
    Chaining Spark Transformations(Python)  Transformation in RDD can be chained together Spark and Scala
  • 20.
    Quiz  Difference betweenMap and flatMap?  -map : Create a new RDD by applying function to each element within collection. Function is passed as argument  -flatmap: Like map function but output of flatmap is flattened . Though the function in flatmap returns a list of element(s) for each element but the output of flatmap will be an RDD which has all the elements flattened to a single list. Demo Hadoop & Big Data
  • 21.
    Quiz  Difference betweencoalesce and repartition? coalesce :  Used to reduce number of partitions  Tries to minimize data movement by avoiding network shuffle  Creates unequal sized partitions -repartition  Used to increase or decrease partitions  A network shuffle will be triggered which will increase data movement  Creates equal sized partitions Demo Hadoop & Big Data
  • 22.
    Quiz  What atereduce transformation below?  A. aggregate  B. map  C. flatMap  D. fold Demo Hadoop & Big Data
  • 23.
    Spark RDD Operations Spark Transformations  Transform/filter elements type operation  Reduction Type Operations  Set type Operation  Spark Actions  Q & A  Different Types of RDD  Start Spark Shell  In Local Mode  in Standalone Mode  in Yarn Cluster Mode  Writing Spark Application in Scala and spark-submit in  in Local Mode  in Standalone Mode  in Yarn Cluster Mode  Q & A Spark and Scala
  • 24.
    Different Types ofRDDs  HadoopRDD  FilteredRDD  MappedRDD  PairRDD  ShuffledRDD  SchemaRDD  CassandraRDD  EsSpark(Elastic Search) Spark and Scala
  • 25.
  • 26.
    Source code ofRDD Spark and Scala
  • 27.
    Run Spark Appin Local Mode  ./spark-shell –master local  Spark starts in local mode, can view jobs in localhost:4040 Spark and Scala
  • 28.
    Local Mode httpport Spark and Scala
  • 29.
    Run Spark Appin YARN  ./spark-shell –master yarn –deploy-mode client  Starts spark in yarn in client mode Spark and Scala
  • 30.
    Run Spark Appin YARN client Mode Spark and Scala
  • 31.
    Run Spark Shellin Yarn Mode(Cluster)  Spark Shell cant be started in Yarn Cluster Mode Spark and Scala
  • 32.
    Run Spark Shellin Standalone Mode  Spark Shell start in Standalone Mode  ./spark-shell --master spark://quickstart.cloudera:7077 Spark and Scala
  • 33.
    Run Spark Shellin Standalone Mode Spark and Scala
  • 34.
    Write Simple SparkProgram  Create Maven project for simple Spark Application  Add dependencies in pom.xml Spark and Scala
  • 35.
    Write Simple SparkProgram  package com.zoomskills.spark  import org.apache.spark._  object FirstSpark {  def main(arg:Array[String]){  val sconfig= new SparkConf().setAppName("First Spark")  val sc=new SparkContext(sconfig)  val rdd=sc.parallelize(List("Mango","Banana","Apple","Guava"))  if (arg.length !=1)  {  return  }   //Collect will return Array of String in this case  rdd.saveAsTextFile(arg(0))  }  } Spark and Scala
  • 36.
    Spark-Submit in localmode  /usr/lib/spark/bin/spark-submit --master local --class com.zoomskills.spark.FirstSpark /home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1- SNAPSHOT.jar /application/spark/firstprogram/output Spark and Scala
  • 37.
    Spark-Submit in Yarnmode(Client)  /usr/lib/spark/bin/spark-submit --master yarn-client --class com.zoomskills.spark.FirstSpark /home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1- SNAPSHOT.jar /application/spark/firstprogram/output Spark and Scala
  • 38.
    Spark-Submit in Yarnmode(Cluster)  /usr/lib/spark/bin/spark-submit --master yarn-cluster --class com.zoomskills.spark.FirstSpark /home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1- SNAPSHOT.jar /application/spark/firstprogram/output Spark and Scala
  • 39.
    Spark-Submit in Standalonemode  /usr/lib/spark/bin/spark-submit --master spark://quickstart.cloudera:7077 --class com.zoomskills.spark.FirstSpark /home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1- SNAPSHOT.jar /application/spark/firstprogram/output Spark and Scala
  • 40.
    Quiz  Can Westart Spark shell in Yarn Cluster Mode  TRUE  FALSE Demo Hadoop & Big Data
  • 41.
    Quiz  How torun Spark Application  Use Spark Shell  Use spark-submit Demo Hadoop & Big Data
  • 42.
    Quiz  What differentmodes can run Spark job  Local Mode  Server Mode  Cluster  Standalone mode Demo Hadoop & Big Data
  • 43.
    Quiz  Spark clustermodes support running job in two deployment modes  Client Mode  Cluster Mode  Standalone Mode  Server Mode Demo Hadoop & Big Data
  • 44.
    Q & A Sparkand Scala