03 spark rdd operations

Spark RDD Operations
Spark and Scala

Spark RDD Operations
 Spark Transformations
 Transform/filter elements type operation
 Reduction Type Operations
 Set type Operation
 Spark Actions
 Q & A
 Different Types of RDD
 Start Spark Shell
 In Local Mode
 in Standalone Mode
 in Yarn Cluster Mode
 Writing Spark Application in Scala and spark-submit in
 in Local Mode
 in Standalone Mode
 in Yarn Cluster Mode
 Q & A
Spark and Scala

RDD Operations: Transformations
 Some common transformations
 -map : Create a new RDD by applying function to each element within
collection. Function is passed as argument
 -flatmap: Like map function but output of flatmap is flattened . Though the
function in flatmap returns a list of element(s) for each element but the output
of flatmap will be an RDD which has all the elements flattened to a single list.
 -filter : Create a new RDD by including elements that satisfy Boolean expression
passed as argument
Spark and Scala

Spark Transformation
- map/flatmap
Spark and Scala

Spark Transformations
-filter/coalesce/repartition
Spark and Scala

Diff repartition and Coalesce
-filter/coalesce/repartition
-coalesce :
 Used to reduce number of partitions
 Tries to minimize data movement by avoiding network shuffle
 Creates unequal sized partitions
-repartition
 Used to increase or decrease partitions
 A network shuffle will be triggered which will increase data
movement
 Creates equal sized partitions
Spark and Scala

Spark Reduce Transformation
reduce/fold/aggregate
Spark and Scala

Spark Actions
 - count
 - collect
 - first
 - take
 - foreach
 - saveasTextFile
 - saveasObjectFile
Spark and Scala

Spark Actions
-count/collect
Spark and Scala

Spark Actions
-first/take
Spark and Scala

Spark Actions
-saveastextfile/foreach
Spark and Scala

Spark Set Transformations
-union/intersection
Spark and Scala

Chaining Spark Transformations
 Transformations can be chained together
Spark and Scala

Chaining Spark Transformations (Python)
 Transformation in RDD can be chained together
Spark and Scala

Quiz
 Difference between Map and flatMap?
 -map : Create a new RDD by applying function to each element within
collection. Function is passed as argument
 -flatmap: Like map function but output of flatmap is flattened . Though the
function in flatmap returns a list of element(s) for each element but the output
of flatmap will be an RDD which has all the elements flattened to a single list.
Demo Hadoop & Big Data

Quiz
 Difference between coalesce and repartition?
coalesce :
 Used to reduce number of partitions
 Tries to minimize data movement by avoiding network shuffle
 Creates unequal sized partitions
-repartition
 Used to increase or decrease partitions
 A network shuffle will be triggered which will increase data
movement
 Creates equal sized partitions

Quiz
 What ate reduce transformation below?
 A. aggregate
 B. map
 C. flatMap
 D. fold

Different Types of RDDs
 HadoopRDD
 FilteredRDD
 MappedRDD
 PairRDD
 ShuffledRDD
 SchemaRDD
 CassandraRDD
 EsSpark(Elastic Search)
Spark and Scala

Source code RDDs
Spark and Scala

Source code of RDD
Spark and Scala

Run Spark App in Local Mode
 ./spark-shell –master local
 Spark starts in local mode, can view jobs in localhost:4040
Spark and Scala

Local Mode http port
Spark and Scala

Run Spark App in YARN
 ./spark-shell –master yarn –deploy-mode client
 Starts spark in yarn in client mode
Spark and Scala

Run Spark App in YARN client Mode
Spark and Scala

Run Spark Shell in Yarn Mode(Cluster)
 Spark Shell cant be started in Yarn Cluster Mode
Spark and Scala

Run Spark Shell in Standalone Mode
 Spark Shell start in Standalone Mode
 ./spark-shell --master spark://quickstart.cloudera:7077
Spark and Scala

Run Spark Shell in Standalone Mode
Spark and Scala

Write Simple Spark Program
 Create Maven project for simple Spark Application
 Add dependencies in pom.xml
Spark and Scala

Write Simple Spark Program
 package com.zoomskills.spark
 import org.apache.spark._
 object FirstSpark {
 def main(arg:Array[String]){
 val sconfig= new SparkConf().setAppName("First Spark")
 val sc=new SparkContext(sconfig)
 val rdd=sc.parallelize(List("Mango","Banana","Apple","Guava"))
 if (arg.length !=1)
 {
 return
 }

 //Collect will return Array of String in this case
 rdd.saveAsTextFile(arg(0))
 }
 }
Spark and Scala

Spark-Submit in local mode
 /usr/lib/spark/bin/spark-submit --master local --class
com.zoomskills.spark.FirstSpark
/home/cloudera/applications/spark/firstscalaprogram/spark-0.0.1-
SNAPSHOT.jar /application/spark/firstprogram/output
Spark and Scala

Spark-Submit in Yarn mode(Client)
 /usr/lib/spark/bin/spark-submit --master yarn-client --class
Spark and Scala

Spark-Submit in Yarn mode(Cluster)
 /usr/lib/spark/bin/spark-submit --master yarn-cluster --class
Spark and Scala

Spark-Submit in Standalone mode
 /usr/lib/spark/bin/spark-submit --master
spark://quickstart.cloudera:7077 --class
Spark and Scala

Quiz
 Can We start Spark shell in Yarn Cluster Mode
 TRUE
 FALSE

Quiz
 How to run Spark Application
 Use Spark Shell
 Use spark-submit

Quiz
 What different modes can run Spark job
 Local Mode
 Server Mode
 Cluster
 Standalone mode

Quiz
 Spark cluster modes support running job in two deployment modes
 Client Mode
 Cluster Mode
 Standalone Mode
 Server Mode

03 spark rdd operations

More Related Content

What's hot

Similar to 03 spark rdd operations

Recently uploaded

03 spark rdd operations