Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Basics of RDD
Dataset:
Collection of data
elements.
e.g. Array, Tables, Data frame (R), collections of
mongodb
Distributed:
Parts Multiple
machines
Resilient:
Recovers on
Failure
What is RDD?
RDDs - Resilient Distributed Datasets

Basics of RDD
SPARK - CONCEPTS - RESILIENT DISTRIBUTED DATASET
A collection of elements partitioned across cluster
Machine 1 Machine 2 Machine 3 Machine 4

Basics of RDD
Resilient Distributed Dataset (RDD)
Node 1 Node 2 Node 3
Node 4
Driver
Application
Spark
Application
Spark
Application
Spark
Application
Spark
Application
5, 6, 7, 8 9, 10, 11, 121, 2, 3, 4 13, 14, 15

Basics of RDD
A collection of elements partitioned across cluster
• An immutable distributed collection of objects.
• Split in partitions which may be on multiple nodes
• Can contain any data type:
○ Python,
○ Java,
○ Scala objects
○ including user defined classes

Basics of RDD
• RDD Can be persisted in memory
• RDD Auto recover from node failures
• Can have any data type but has a special dataset type for key-value
• Supports two type of operations:
○ Transformation
○ Action

Basics of RDD
>> val arr = 1 to 10000
>> var nums = sc.parallelize(arr)
Creating RDD - Scala
>>var lines = sc.textFile("/data/mr/wordcount/input/big.txt")
Method 1: By Directly Loading a file from remote
Method 2: By distributing existing object

Basics of RDD
WordCount - Scala
var linesRdd = sc.textFile("/data/mr/wordcount/input/big.txt")
var words = linesRdd.flatMap(x => x.split(" "))
var wordsKv = words.map(x => (x, 1))
//def myfunc(x:Int, y:Int): Int = x + y
var output = wordsKv.reduceByKey(_ + _)
output.take(10)
or
output.saveAsTextFile("my_result")

Basics of RDD
RDD Operations
Two Kinds Operations
Transformation Action

Basics of RDD
RDD - Operations : Transformation
Resilient Distributed Dataset 2 (RDD)
Transformation Transformation Transformation Transformation
Resilient Distributed Dataset 1 (RDD)
• Transformations are operations on RDDs
• return a new RDD
• such as map() and filter()

Basics of RDD
RDD - Operations : Transformation
• Transformations are operations on RDDs
• return a new RDD
• such as map() and filter()

Basics of RDD
➢ Map is a transformation
➢ That runs provided function against each element of RDD
➢ And creates a new RDD from the results of execution function
Map Transformation

Basics of RDD
➢ val arr = 1 to 10000
➢ val nums = sc.parallelize(arr)
➢ def multiplyByTwo(x:Int):Int = x*2
➢ multiplyByTwo(5)
10
➢ var dbls = nums.map(multiplyByTwo);
➢ dbls.take(5)
[2, 4, 6, 8, 10]
Map Transformation - Scala

Basics of RDD
Transformations - filter() - scala
1 2 3 4 5 6 7
2 4 6
isEven(2) isEven(4) isEven(6)
isEven(1) isEven(7)isEven(3) isEven(5)
nums
evens
➢ var arr = 1 to 1000
➢ var nums = sc.parallelize(arr)
➢ def isEven(x:Int):Boolean = x%2 == 0
➢ var evens =
nums.filter(isEven)
➢ evens.take(3)
➢ [2, 4, 6]
…..
…..

Basics of RDD
RDD - Operations : Actions
• Causes the full execution of transformations
• Involves both spark driver as well as the nodes
• Example - Take(): Brings back the data to driver

Basics of RDD
➢ val arr = 1 to 1000000
➢ def multipleByTwo(x:Int):Int = x*2
Action Example - take()
➢ var dbls =
nums.map(multipleByTwo);
➢ dbls.take(5)
➢ [2, 4, 6, 8, 10]

Basics of RDD
To save the results in HDFS or Any other file system
Call saveAsTextFile(directoryName)
It would create directory
And save the results inside it
If directory exists, it would throw error.
Action Example - saveAsTextFile()

Basics of RDD
val arr = 1 to 1000
val nums = sc.parallelize(arr)
def multipleByTwo(x:Int):Int = x*2
Action Example - saveAsTextFile()
var dbls = nums.map(multipleByTwo);
dbls.saveAsTextFile("mydirectory")
Check the HDFS home directory

Basics of RDD
RDD Operations
Transformation Action
Examples map() take()
Returns Another RDD Local value
Executes Lazily Immediately. Executes transformations

Basics of RDD
Cheese burger,
soup and
a Plate of Noodles
please
Soup and
A Plate of
Noodles for
me
Ok.
One cheese burger
Two soups
Two plates of Noodles
Anything else, sir?
The chef is able to
optimize because of
clubbing multiple
order together
Lazy Evaluation Example - The waiter takes orders patiently

Basics of RDD
And Soup?
Cheese Burger...
Let me get a cheese burger
for you. I'll be right back!
Instant Evaluation
The soup order will be taken once the waiter is back.

Basics of RDD
Instant Evaluation
The usual programing languages have instant evaluation.
As you as you type:
var x = 2+10.
It doesn't wait. It immediately evaluates.

Basics of RDD
Actions: Lazy Evaluation
1. Every time we call an action, entire RDD must be computed from scratch
2. Everytime d gets executed, a,b,c would be run
a. lines = sc.textFile("myfile");
b. fewlines = lines.filter(...)
c. uppercaselines = fewlines.map(...)
d. uppercaselines.count()
3. When we call a transformation, it is not evaluated immediately.
4. It helps Spark optimize the performance
5. Similar to Pig, tensorflow etc.
6. Instead of thinking RDD as dataset, think of it as the instruction on how to
compute data

Basics of RDD
Actions: Lazy Evaluation - Optimization - Scala
def Map1(x:String):String =
x.trim();
def Map2(x:String):String =
x.toUpperCase();
var lines = sc.textFile(...)
var lines1 = lines.map(Map1);
var lines2 = lines1.map(Map2);
lines2.collect()
def Map3(x:String):String={
var y = x.trim();
return y.toUpperCase();
}
lines = sc.textFile(...)
lines2 = lines.map(Map3);
lines2.collect()

Basics of RDD
Lineage Graph
lines = sc.textFile("myfile");
fewlines = lines.filter(...)
uppercaselines = fewlines.map(...)
uppercaselines.count()
lines
Spark Code Lineage Graph
HDFS Input Split
fewlines
uppercaselines
sc.textFile
filter
map
lowercaselines = fewlines.map(...)
lowercaselines
map
1
2
3

Basics of RDD
Transformations:: flatMap() - Scala
To convert one record of an RDD into multiple records.

Basics of RDD
Transformations:: flatMap() - Scala
➢ var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
➢ def toWords(line:String):Array[String]= line.split(" ")
➢ var wordsRDD = linesRDD.flatMap(toWords)
➢ wordsRDD.collect()
➢ ['this', 'is', 'a', 'dog', 'named', 'jerry']
this is a dog named jerry
this is a dog
toWords() toWords()
linesRDD
wordsRDD named jerry

Basics of RDD
How is it different from Map()?
● In case of map() the resulting rdd and input rdd having same number of elements.
● map() can only convert one to one while flatMap could convert one to many.

Basics of RDD
What would happen if map() is used
➢ var linesRDD = sc.parallelize( Array("this is a dog", "named jerry"))
➢ def toWords(line:String):Array[String]= line.split(" ")
➢ var wordsRDD1 = linesRDD.map(toWords)
➢ wordsRDD1.collect()
➢ [['this', 'is', 'a', 'dog'], ['named', 'jerry']]
this is a dog named jerrylinesRDD
wordsRDD1 ['this', 'is', 'a', 'dog'] ['named', 'jerry']
toWords() toWords()

Basics of RDD
FlatMap
● Very similar to Hadoop's Map()
● Can give out 0 or more records

Basics of RDD
FlatMap
● Can emulate map as well as filter
● Can produce many as well as no value which empty array as output
○ If it give out single value, it behaves like map().
○ If it gives out empty array, it behaves like filter.

Basics of RDD
➢ val arr = 1 to 10000
➢ def multiplyByTwo(x:Int) = Array(x*2)
➢ multiplyByTwo(5)
Array(10)
➢ var dbls = nums.flatMap(multiplyByTwo);
➢ dbls.take(5)
[2, 4, 6, 8, 10]
flatMap as map

Basics of RDD
flatMap as filter
➢ var arr = 1 to 1000
➢ var nums = sc.parallelize(arr)
➢ def isEven(x:Int):Array[Int] = {
➢ if(x%2 == 0) Array(x)
➢ else Array()
➢ }
➢ var evens =
nums.flatMap(isEven)
➢ evens.take(3)
➢ [2, 4, 6]

Basics of RDD
Transformations:: Union
['1', '2', '3']
➢ var a = sc.parallelize(Array('1','2','3'));
➢ var b = sc.parallelize(Array('A','B','C'));
➢ var c=a.union(b)
➢ Note: doesn't remove duplicates
➢ c.collect();
[1, 2, 3, 'A', 'B', 'C']
['A','B','C'])
['1', '2', '3', 'A','B','C']
Union

Basics of RDD
Transformations:: union()
RDD lineage graph created during log analysis
InputRDD
errorsRDD warningsRDD
badlinesRDD
Filter Filter
Union

Basics of RDD
Saves all the elements into HDFS as text files.
➢ var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7));
➢ a.saveAsTextFile("myresult");
➢ // Check the HDFS.
➢ //There should myresult folder in your home directory.
Actions: saveAsTextFile() - Scala

Basics of RDD
➢ a
org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:21
➢ var localarray = a.collect();
➢ localarray
[1, 2, 3, 4, 5, 6, 7]
Actions: collect() - Scala
1 2 3 4 5 6 7
Brings all the elements back to you. Data must fit into memory.
Mostly it is impractical.

Basics of RDD
➢ var localarray = a.take(4);
➢ localarray
[1, 2, 3, 4]
Actions: take() - Scala
1 2 3 4 5 6 7
Bring only few elements to the driver.
This is more practical than collect()

Basics of RDD
➢ var a = sc.parallelize(Array(1,2,3, 4, 5 , 6, 7), 3);
➢ var mycount = a.count();
➢ mycount
7
Actions: count() - Scala
1, 2, 3 4,5 6,7
3
2
2
3+ 2 + 2 = 7
To find out how many elements are there in an RDD.
Works in distributed fashion.

Basics of RDD
More Actions - Reduce()
➢ var seq = sc.parallelize(1 to 100)
➢ def sum(x: Int, y:Int):Int = {return x+y}
➢ var total = seq.reduce(sum);
total: Int = 5050
Aggregate elements of dataset using a function:
• Takes 2 arguments and returns only one
• Commutative and associative for parallelism
• Return type of function has to be same as argument

Basics of RDD

Basics of RDD
To confirm, you could use the formula for summation of natural numbers
= n*(n+1)/2
= 100*101/2
= 5050

Basics of RDD
3 7 13 16
10
23
48
How does reduce work?
9
25
Partition 1 Partition 2
RDD
Spark Driver
Spark Application
Spark Application

Basics of RDD
Which is wrong. The correct average of 3, 7, 13, 16, 19 is 11.6.
For avg(), can we use reduce?
The way we had computed summation using reduce,
Can we compute the average in the same way?
≫ var seq = sc.parallelize(Array(3.0, 7, 13, 16, 19))
≫ def avg(x: Double, y:Double):Double = {return (x+y)/2}
≫ var total = seq.reduce(avg);
total: Double = 9.875

Basics of RDD
3 7 13 16
5
9
10.75
Why average with reduce is wrong?
9
12.5
Partition 1 Partition 2
RDD

Basics of RDD
Why average with reduce is wrong?
!=

Basics of RDD
But sum is ok
=
=
=

Basics of RDD
Reduce
A reduce function must be
commutative and associative
otherwise
the results could be unpredictable and wrong.

Basics of RDD
Non Commutative
Division
2 / 3 not eq 3 / 2
Subtraction
2 - 3 != 3 - 2
Exponent / power
4 ^ 2 != 2^4
Examples
Addition
2 + 3 = 3 + 2
Multiplication
2 * 3 = 3*2
Average:
(3+4+5)/3 = (4+3+5)/3
Euclidean Distance:
=
Commutative
If changing the order of inputs does not make any difference to
output, the function is commutative.

Basics of RDD
Examples
Multiplication:
(3 * 4 ) * 2 = 3 * ( 4 * 2)
Min:
Min(Min(3,4), 30)
= Min(3, Min(4, 30)) = 3
Max:
Max(Max(3,4), 30)
= Max(3, Min(4, 30)) = 30
Non Associative
Division:
(⅔) / 4 not equal to 2 / (¾)
Subtraction:
(2 - 3) - 1 != 2 - (3-1)
Exponent / power:
4 ^ 2 != 2^4
Average:
avg(avg(2, 3), 4) != avg(avg(2, 4), 3)
Associative
Associative property:
Can add or multiply regardless of how
the numbers are grouped.
By 'grouped' we mean 'how you use
parenthesis'.

Solving Some Problems with Spark

Basics of RDD
What's wrong with this approach?
Approach 1 - So, how to compute average?
Approach 1
➢ var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
➢ var avg = rdd.reduce(_ + _) / rdd.count();
We are computing RDD twice - during reduce and during count.
Can we compute sum and count in a single reduce?

Basics of RDD
(Total1, Count1) (Total2, Count2)
(Total1 + Total 2, Count1 + Count2)
(4, 1) (5,1)
(9, 2)
4 5 6
(6,1)
(15, 3)15/3 = 5

Basics of RDD
➢ var rdd = sc.parallelize(Array(1.0,2,3, 4, 5 , 6, 7), 3);
➢ var rdd_count = rdd.map((_, 1))
➢ var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
➢ var avg = sum / count
avg: Double = 4.0
(Total1, Count1) (Total2, Count2)
(Total1 + Total 2, Count1 + Count2)

Basics of RDD
Comparision of the two approaches?
Approach1:
0.023900 + 0.065180
= 0.08908 seconds ~ 89 ms
Approach2:
0.058654 seconds ~ 58 ms
Approximately 2X difference.

Basics of RDD
How to compute Standard deviation?

Basics of RDD
So, how to compute Standard deviation?
The Standard Deviation is a measure of how spread out numbers are.

Basics of RDD
1. Work out the Mean (the simple average of the numbers)

Basics of RDD
2. Then for each number: subtract the Mean and square the result

Basics of RDD
3. Then work out the mean of those squared differences.

Basics of RDD
3. Then work out the mean of those squared differences.
4. Take the square root of that and we are done!

Basics of RDD
Lets calculate SD of 2 3 5 6

Basics of RDD
Already Computed in
Previous problem
1. Mean of numbers is μ
= (2 + 3 + 5 + 6) / 4 => 4

Basics of RDD
= (2 + 3 + 5 + 6) / 4 => 4
2. xi
- μ = (-2, -1, 1 , 2)
3. (xi
- μ)2
= (4, 1, 1 , 4)
Already Computed in
Previous problem
Can be done using map()

Basics of RDD
= (2 + 3 + 5 + 6) / 4 => 4
2. xi
- μ = (-2, -1, 1 , 2)
3. (xi
- μ)2
= (4, 1, 1 , 4)
4. ∑(xi
- μ)2
= 10
Already Computed in
Previous problem
Requires reduce.

Basics of RDD
= (2 + 3 + 5 + 6) / 4 => 4
2. xi
- μ = (-2, -1, 1 , 2)
3. (xi
- μ)2
= (4, 1, 1 , 4)
4. ∑(xi
- μ)2
= 10
5. √1/N ∑(xi
- μ)2
= √10/4 = √2.5 =
1.5811
Already Computed in
Previous problem
Requires reduce.
Can be performed locally

Basics of RDD
➢ var rdd = sc.parallelize(Array(2, 3, 5, 6))

Basics of RDD
//Mean or average of numbers is μ
➢ var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 +
y._2))
// (xi
- μ)2

Basics of RDD
// (xi
- μ)2
➢ var sqdiff = rdd.map( _ - avg).map(x => x*x)

Basics of RDD
// (xi
- μ)2
// ∑(xi
- μ)2
➢ var sum_sqdiff = sqdiff.reduce(_ + _)

Basics of RDD
// (xi
- μ)2
// ∑(xi
- μ)2
➢ var sum_sqdiff = sqdiff.reduce(_ + _)
//√1/N ∑(xi
- μ)2
➢ import math._;
➢ var sd = sqrt(sum_sqdiff*1.0/count)

Basics of RDD
a. var rdd = sc.parallelize(Array(2, 3, 5, 6))
b. //Mean or average of numbers is μ
i. var rdd_count = rdd.map((_, 1))
ii. var (sum, count) = rdd_count.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
iii. var avg = sum / count
c. // (xi
- μ)2
d. var sqdiff = rdd.map( _ - avg).map(x => x*x)
e. // ∑(xi
- μ)2
f. var sum_sqdiff = sqdiff.reduce(_ + _)
g. //√1/N ∑(xi
- μ)2
h. import math._;
i. var sd = sqrt(sum_sqdiff*1.0/count)
2. sd: Double = 1.5811388300841898

Basics of RDD
Computing random sample from a dataset
The objective of the exercise is to pick a random sample from huge data.
Though there is a method provided in RDD but we are creating our own.

Basics of RDD
1. Lets try to understand it for say picking 50% records.

Basics of RDD
2. The approach is very simple. We pick a record from RDD and do a coin
toss. If its head, keep the element otherwise discard it. It can be achieved
using filter.

Basics of RDD
using filter.
3. For picking any fraction, we might use a coin having 100s of faces or in
other words a random number generator.

Basics of RDD
using filter.
3. For picking any fraction, we might use a coin having 100s of faces or in
other words a random number generator.
4. Please notice that it would not give the sample of exact size

Basics of RDD
➢ var rdd = sc.parallelize(1 to 1000);

Basics of RDD
➢ var fraction = 0.1

Basics of RDD
➢ def cointoss(x:Int): Boolean = scala.util.Random.nextFloat() <= fraction

Basics of RDD
➢ var myrdd = rdd.filter(cointoss)

Basics of RDD
➢ var myrdd = rdd.filter(cointoss)
➢ var localsample = myrdd.collect()
➢ localsample.length

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

More Related Content

What's hot (20)

Similar to Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab (20)

More from CloudxLab (20)

Recently uploaded (20)

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab