SlideShare a Scribd company logo
Basics of RDD - More Operations
Basics of RDD
More Transformations
sample(withReplacement, fraction, [seed])
Sample an RDD, with or without replacement.
Basics of RDD
val seq = sc.parallelize(1 to 100, 5)
seq.sample(false, 0.1).collect();
[8, 19, 34, 37, 43, 51, 70, 83]
More Transformations
sample(withReplacement, fraction, [seed])
Sample an RDD, with or without replacement.
Basics of RDD
More Transformations
sample(withReplacement, fraction, [seed])
Sample an RDD, with or without replacement.
val seq = sc.parallelize(1 to 100, 5)
seq.sample(false, 0.1).collect();
[8, 19, 34, 37, 43, 51, 70, 83]
seq.sample(true, 0.1).collect();
[14, 26, 40, 47, 55, 67, 69, 69]
Please note that the result will be different on every run.
Basics of RDD
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
val rdd = sc.parallelize(1 to 50, 3)
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
val rdd = sc.parallelize(1 to 50, 3)
def f(l:Iterator[Int]):Iterator[Int] = {
var sum = 0
while(l.hasNext){
sum = sum + l.next
}
return List(sum).iterator
}
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
val rdd = sc.parallelize(1 to 50, 3)
def f(l:Iterator[Int]):Iterator[Int] = {
var sum = 0
while(l.hasNext){
sum = sum + l.next
}
return List(sum).iterator
}
rdd.mapPartitions(f).collect()
Array(136, 425, 714)
17, 17, 16
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
Basics of RDD
func: A function used to compute the sort key for each element.
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
Basics of RDD
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
func: A function used to compute the sort key for each element.
ascending: A flag to indicate whether the sorting is ascending or descending.
Basics of RDD
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
func: A function used to compute the sort key for each element.
ascending: A flag to indicate whether the sorting is ascending or descending.
numPartitions: Number of partitions to create.
Basics of RDD
⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
⋙ var rdd = sc.parallelize(tmp)
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
Basics of RDD
⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
⋙ var rdd = sc.parallelize(tmp)
⋙ rdd.sortBy(x => x._1).collect()
[('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
Basics of RDD
⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
⋙ var rdd = sc.parallelize(tmp)
⋙ rdd.sortBy(x => x._2).collect()
[('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
Basics of RDD
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
var rdd = sc.parallelize(Array(10, 2, 3,21, 4, 5))
var sortedrdd = rdd.sortBy(x => x)
sortedrdd.collect()
Basics of RDD
Common Transformations (continued..)
Pseudo set operations
Though RDD is not really a set but still the set operations try to provide you utility set functions
Basics of RDD
distinct()
+ Give the set property to your rdd
+ Expensive as shuffling is required
Set operations (Pseudo)
Basics of RDD
union()
+ Simply appends one rdd to another
+ Is not same as mathematical function
+ It may have duplicates
Set operations (Pseudo)
Basics of RDD
subtract()
+ Returns values in first RDD and not second
+ Requires Shuffling like intersection()
Set operations (Pseudo)
Basics of RDD
intersection()
+ Finds common values in RDDs
+ Also removes duplicates
+ Requires shuffling
Set operations (Pseudo)
Basics of RDD
cartesian()
+ Returns all possible pairs of (a,b)
+ a is in source RDD and b is in other RDD
Set operations (Pseudo)
Basics of RDD
fold(initial value, func)
+ Very similar to reduce
+ Provides a little extra control over the
initialisation
+ Lets us specify an initial value
More Actions - fold()
Basics of RDD
More Actions - fold()
fold(initial value, func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
1 7 2 4 7 6
Partition 1 Partition 2
Basics of RDD
More Actions - fold()
fold(initial value, func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
1 7 2 4 7 6
Partition 1 Partition 2
Initial Value Initial Value
Basics of RDD
More Actions - fold()
fold(initial value)(func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
Initial Value 1 7 2 4 7 6Initial Value
Partition 1 Partition 2
1 1
Basics of RDD
More Actions - fold()
fold(initial value)(func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
Initial Value 1 7 2 4 7 6Initial Value
Partition 1 Partition 2
1
2
3
1
2
3
Result1 Result2
Basics of RDD
More Actions - fold()
fold(initial value)(func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
Result1 Result2Initial Value
4
5
1 7 2 4 7 6
Partition 1 Partition 2
1
2
3
1
2
3
Basics of RDD
var myrdd = sc.parallelize(1 to 10, 2)
More Actions - fold()
fold(initial value, func) Example: Concatnating to _
Basics of RDD
var myrdd = sc.parallelize(1 to 10, 2)
var myrdd1 = myrdd.map(_.toString)
More Actions - fold()
fold(initial value, func) Example: Concatnating to _
Basics of RDD
var myrdd = sc.parallelize(1 to 10, 2)
var myrdd1 = myrdd.map(_.toString)
def concat(s:String, n:String):String = s + n
More Actions - fold()
fold(initial value, func) Example: Concatnating to _
Basics of RDD
More Actions - fold()
var myrdd = sc.parallelize(1 to 10, 2)
var myrdd1 = myrdd.map(_.toString)
def concat(s:String, n:String):String = s + n
var s = "_"
myrdd1.fold(s)(concat)
res1: String = _ _12345 _678910
fold(initial value, func) Example: Concatnating to _
Basics of RDD
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
1, 2, 3 4,5 6,7
SeqOp() SeqOp() SeqOp()
CombOp()
Output
Basics of RDD
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
More Actions - aggregate()
Basics of RDD
var rdd = sc.parallelize(1 to 100)
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1)
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1)
def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2)
var d = rdd.aggregate(init)(seq, comb)
res6: (Int, Int) = (5050,100)
Basics of RDD
More Actions - aggregate()
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1)
def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2)
var d = rdd.aggregate(init)(seq, comb)
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
res6: (Int, Int) = (5050,100)
Basics of RDD
Number of times each element occurs in the RDD.
More Actions: countByValue()
1 2 3 3 5 5 5
var rdd = sc.parallelize(List(1, 2, 3, 3, 5, 5, 5))
var dict = rdd.countByValue()
dict
Map(1 -> 1, 5 -> 3, 2 -> 1, 3 -> 2)
Basics of RDD
Sorts and gets the maximum n values.
More Actions: top(n)
4 4 8 1 2 3 10 9
var a=sc.parallelize(List(4,4,8,1,2, 3, 10, 9))
a.top(6)
Array(10, 9, 8, 4, 4, 3)
Basics of RDD
sc.parallelize(List(10, 1, 2, 9, 3, 4, 5, 6, 7)).takeOrdered(6)
var l = List((10, "SG"), (1, "AS"), (2, "AB"), (9, "AA"), (3, "SS"), (4, "RG"), (5, "AU"), (6, "DD"), (7, "ZZ"))
var r = sc.parallelize(l)
r.takeOrdered(6)(Ordering[Int].reverse.on(x => x._1))
(10,SG), (9,AA), (7,ZZ), (6,DD), (5,AU), (4,RG)
r.takeOrdered(6)(Ordering[String].reverse.on(x => x._2))
(7,ZZ), (3,SS), (10,SG), (4,RG), (6,DD), (5,AU)
r.takeOrdered(6)(Ordering[String].on(x => x._2))
(9,AA), (2,AB), (1,AS), (5,AU), (6,DD), (4,RG)
Get the N elements from an RDD ordered in ascending order or as specified by
the optional key function.
More Actions: takeOrdered()
Basics of RDD
Applies a function to all elements of this RDD.
More Actions: foreach()
>>> def f(x:Int)= println(s"Save $x to DB")
>>> sc.parallelize(1 to 5).foreach(f)
Save 2 to DB
Save 1 to DB
Save 4 to DB
Save 5 to DB
Basics of RDD
Differences from map()
More Actions: foreach()
1. Use foreach if you don't expect any result. For example
saving to database.
2. Foreach is an action. Map is transformation
Basics of RDD
Applies a function to each partition of this RDD.
More Actions: foreachPartition(f)
Basics of RDD
def partitionSum(itr: Iterator[Int]) =
println("The sum of the parition is " + itr.sum.toString)
Applies a function to each partition of this RDD.
More Actions: foreachPartition(f)
Basics of RDD
Applies a function to each partition of this RDD.
More Actions: foreachPartition(f)
def partitionSum(itr: Iterator[Int]) =
println("The sum of the parition is " + itr.sum.toString)
sc.parallelize(1 to 40, 4).foreachPartition(partitionSum)
The sum of the parition is 155
The sum of the parition is 55
The sum of the parition is 355
The sum of the parition is 255
Thank you!
Basics of RDD

More Related Content

What's hot (20)

PDF
Py spark cheat sheet by cheatsheetmaker.com
Lam Hoang
 
PDF
Scalding for Hadoop
Chicago Hadoop Users Group
 
PDF
AJUG April 2011 Cascading example
Christopher Curtin
 
PDF
Using NoSQL databases to store RADIUS and Syslog data
Karri Huhtanen
 
PPTX
Scoobi - Scala for Startups
bmlever
 
PDF
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
PDF
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
PDF
Introduction to Scalding and Monoids
Hugo Gävert
 
PPTX
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
PDF
Hack reduce mr-intro
montrealouvert
 
PDF
R Programming: Export/Output Data In R
Rsquared Academy
 
PPTX
CR17 - Designing a database like an archaeologist
yoavrubin
 
PPTX
Designing a database like an archaeologist
yoavrubin
 
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
PDF
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
PPTX
Using Arbor/ RGraph JS libaries for Data Visualisation
Alex Hardman
 
PDF
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
PDF
Grouping & Summarizing Data in R
Jeffrey Breen
 
KEY
Ruby on Big Data @ Philly Ruby Group
Brian O'Neill
 
Py spark cheat sheet by cheatsheetmaker.com
Lam Hoang
 
Scalding for Hadoop
Chicago Hadoop Users Group
 
AJUG April 2011 Cascading example
Christopher Curtin
 
Using NoSQL databases to store RADIUS and Syslog data
Karri Huhtanen
 
Scoobi - Scala for Startups
bmlever
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Introduction to Scalding and Monoids
Hugo Gävert
 
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
Hack reduce mr-intro
montrealouvert
 
R Programming: Export/Output Data In R
Rsquared Academy
 
CR17 - Designing a database like an archaeologist
yoavrubin
 
Designing a database like an archaeologist
yoavrubin
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Alex Hardman
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
Grouping & Summarizing Data in R
Jeffrey Breen
 
Ruby on Big Data @ Philly Ruby Group
Brian O'Neill
 

Similar to Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutorial | CloudxLab (20)

PDF
Apache Spark & Streaming
Fernando Rodriguez
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Operations on rdd
sparrowAnalytics.com
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
PDF
Spark workshop
Wojciech Pituła
 
PDF
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
PDF
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
PPTX
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
PDF
Lecture 5: Functional Programming
Eelco Visser
 
PPTX
Scala meetup - Intro to spark
Javier Arrieta
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Frsa
_111
 
PDF
Functional Programming with Groovy
Arturo Herrero
 
PDF
Functional programming with clojure
Lucy Fang
 
PPTX
Testing batch and streaming Spark applications
Łukasz Gawron
 
PDF
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
PDF
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
Apache Spark & Streaming
Fernando Rodriguez
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Operations on rdd
sparrowAnalytics.com
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Spark workshop
Wojciech Pituła
 
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Distributed computing with spark
Javier Santos Paniego
 
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
Lecture 5: Functional Programming
Eelco Visser
 
Scala meetup - Intro to spark
Javier Arrieta
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Frsa
_111
 
Functional Programming with Groovy
Arturo Herrero
 
Functional programming with clojure
Lucy Fang
 
Testing batch and streaming Spark applications
Łukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Tuning and Debugging in Apache Spark
Databricks
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
Ad

More from CloudxLab (20)

PDF
Understanding computer vision with Deep Learning
CloudxLab
 
PDF
Deep Learning Overview
CloudxLab
 
PDF
Recurrent Neural Networks
CloudxLab
 
PDF
Natural Language Processing
CloudxLab
 
PDF
Naive Bayes
CloudxLab
 
PDF
Autoencoders
CloudxLab
 
PDF
Training Deep Neural Nets
CloudxLab
 
PDF
Reinforcement Learning
CloudxLab
 
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
PPTX
Introduction to Deep Learning | CloudxLab
CloudxLab
 
PPTX
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
PPTX
Ensemble Learning and Random Forests
CloudxLab
 
PPTX
Decision Trees
CloudxLab
 
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
CloudxLab
 
Naive Bayes
CloudxLab
 
Autoencoders
CloudxLab
 
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
CloudxLab
 
Ad

Recently uploaded (20)

PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
July Patch Tuesday
Ivanti
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Designing Production-Ready AI Agents
Kunal Rai
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 

Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Basics of RDD - More Operations
  • 2. Basics of RDD More Transformations sample(withReplacement, fraction, [seed]) Sample an RDD, with or without replacement.
  • 3. Basics of RDD val seq = sc.parallelize(1 to 100, 5) seq.sample(false, 0.1).collect(); [8, 19, 34, 37, 43, 51, 70, 83] More Transformations sample(withReplacement, fraction, [seed]) Sample an RDD, with or without replacement.
  • 4. Basics of RDD More Transformations sample(withReplacement, fraction, [seed]) Sample an RDD, with or without replacement. val seq = sc.parallelize(1 to 100, 5) seq.sample(false, 0.1).collect(); [8, 19, 34, 37, 43, 51, 70, 83] seq.sample(true, 0.1).collect(); [14, 26, 40, 47, 55, 67, 69, 69] Please note that the result will be different on every run.
  • 5. Basics of RDD Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 6. Basics of RDD val rdd = sc.parallelize(1 to 50, 3) Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 7. Basics of RDD val rdd = sc.parallelize(1 to 50, 3) def f(l:Iterator[Int]):Iterator[Int] = { var sum = 0 while(l.hasNext){ sum = sum + l.next } return List(sum).iterator } Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 8. Basics of RDD val rdd = sc.parallelize(1 to 50, 3) def f(l:Iterator[Int]):Iterator[Int] = { var sum = 0 while(l.hasNext){ sum = sum + l.next } return List(sum).iterator } rdd.mapPartitions(f).collect() Array(136, 425, 714) 17, 17, 16 Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 9. Basics of RDD Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func
  • 10. Basics of RDD func: A function used to compute the sort key for each element. Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func
  • 11. Basics of RDD Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func func: A function used to compute the sort key for each element. ascending: A flag to indicate whether the sorting is ascending or descending.
  • 12. Basics of RDD Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func func: A function used to compute the sort key for each element. ascending: A flag to indicate whether the sorting is ascending or descending. numPartitions: Number of partitions to create.
  • 13. Basics of RDD ⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) ⋙ var rdd = sc.parallelize(tmp) Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc
  • 14. Basics of RDD ⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) ⋙ var rdd = sc.parallelize(tmp) ⋙ rdd.sortBy(x => x._1).collect() [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)] Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc
  • 15. Basics of RDD ⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) ⋙ var rdd = sc.parallelize(tmp) ⋙ rdd.sortBy(x => x._2).collect() [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)] Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc
  • 16. Basics of RDD Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc var rdd = sc.parallelize(Array(10, 2, 3,21, 4, 5)) var sortedrdd = rdd.sortBy(x => x) sortedrdd.collect()
  • 17. Basics of RDD Common Transformations (continued..) Pseudo set operations Though RDD is not really a set but still the set operations try to provide you utility set functions
  • 18. Basics of RDD distinct() + Give the set property to your rdd + Expensive as shuffling is required Set operations (Pseudo)
  • 19. Basics of RDD union() + Simply appends one rdd to another + Is not same as mathematical function + It may have duplicates Set operations (Pseudo)
  • 20. Basics of RDD subtract() + Returns values in first RDD and not second + Requires Shuffling like intersection() Set operations (Pseudo)
  • 21. Basics of RDD intersection() + Finds common values in RDDs + Also removes duplicates + Requires shuffling Set operations (Pseudo)
  • 22. Basics of RDD cartesian() + Returns all possible pairs of (a,b) + a is in source RDD and b is in other RDD Set operations (Pseudo)
  • 23. Basics of RDD fold(initial value, func) + Very similar to reduce + Provides a little extra control over the initialisation + Lets us specify an initial value More Actions - fold()
  • 24. Basics of RDD More Actions - fold() fold(initial value, func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". 1 7 2 4 7 6 Partition 1 Partition 2
  • 25. Basics of RDD More Actions - fold() fold(initial value, func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". 1 7 2 4 7 6 Partition 1 Partition 2 Initial Value Initial Value
  • 26. Basics of RDD More Actions - fold() fold(initial value)(func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". Initial Value 1 7 2 4 7 6Initial Value Partition 1 Partition 2 1 1
  • 27. Basics of RDD More Actions - fold() fold(initial value)(func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". Initial Value 1 7 2 4 7 6Initial Value Partition 1 Partition 2 1 2 3 1 2 3 Result1 Result2
  • 28. Basics of RDD More Actions - fold() fold(initial value)(func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". Result1 Result2Initial Value 4 5 1 7 2 4 7 6 Partition 1 Partition 2 1 2 3 1 2 3
  • 29. Basics of RDD var myrdd = sc.parallelize(1 to 10, 2) More Actions - fold() fold(initial value, func) Example: Concatnating to _
  • 30. Basics of RDD var myrdd = sc.parallelize(1 to 10, 2) var myrdd1 = myrdd.map(_.toString) More Actions - fold() fold(initial value, func) Example: Concatnating to _
  • 31. Basics of RDD var myrdd = sc.parallelize(1 to 10, 2) var myrdd1 = myrdd.map(_.toString) def concat(s:String, n:String):String = s + n More Actions - fold() fold(initial value, func) Example: Concatnating to _
  • 32. Basics of RDD More Actions - fold() var myrdd = sc.parallelize(1 to 10, 2) var myrdd1 = myrdd.map(_.toString) def concat(s:String, n:String):String = s + n var s = "_" myrdd1.fold(s)(concat) res1: String = _ _12345 _678910 fold(initial value, func) Example: Concatnating to _
  • 33. Basics of RDD More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type 1, 2, 3 4,5 6,7 SeqOp() SeqOp() SeqOp() CombOp() Output
  • 34. Basics of RDD More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 35. Basics of RDD More Actions - aggregate()
  • 36. Basics of RDD var rdd = sc.parallelize(1 to 100) More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 37. Basics of RDD var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 38. Basics of RDD var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1) More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 39. Basics of RDD More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1) def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2) var d = rdd.aggregate(init)(seq, comb) res6: (Int, Int) = (5050,100)
  • 40. Basics of RDD More Actions - aggregate() var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1) def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2) var d = rdd.aggregate(init)(seq, comb) aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type res6: (Int, Int) = (5050,100)
  • 41. Basics of RDD Number of times each element occurs in the RDD. More Actions: countByValue() 1 2 3 3 5 5 5 var rdd = sc.parallelize(List(1, 2, 3, 3, 5, 5, 5)) var dict = rdd.countByValue() dict Map(1 -> 1, 5 -> 3, 2 -> 1, 3 -> 2)
  • 42. Basics of RDD Sorts and gets the maximum n values. More Actions: top(n) 4 4 8 1 2 3 10 9 var a=sc.parallelize(List(4,4,8,1,2, 3, 10, 9)) a.top(6) Array(10, 9, 8, 4, 4, 3)
  • 43. Basics of RDD sc.parallelize(List(10, 1, 2, 9, 3, 4, 5, 6, 7)).takeOrdered(6) var l = List((10, "SG"), (1, "AS"), (2, "AB"), (9, "AA"), (3, "SS"), (4, "RG"), (5, "AU"), (6, "DD"), (7, "ZZ")) var r = sc.parallelize(l) r.takeOrdered(6)(Ordering[Int].reverse.on(x => x._1)) (10,SG), (9,AA), (7,ZZ), (6,DD), (5,AU), (4,RG) r.takeOrdered(6)(Ordering[String].reverse.on(x => x._2)) (7,ZZ), (3,SS), (10,SG), (4,RG), (6,DD), (5,AU) r.takeOrdered(6)(Ordering[String].on(x => x._2)) (9,AA), (2,AB), (1,AS), (5,AU), (6,DD), (4,RG) Get the N elements from an RDD ordered in ascending order or as specified by the optional key function. More Actions: takeOrdered()
  • 44. Basics of RDD Applies a function to all elements of this RDD. More Actions: foreach() >>> def f(x:Int)= println(s"Save $x to DB") >>> sc.parallelize(1 to 5).foreach(f) Save 2 to DB Save 1 to DB Save 4 to DB Save 5 to DB
  • 45. Basics of RDD Differences from map() More Actions: foreach() 1. Use foreach if you don't expect any result. For example saving to database. 2. Foreach is an action. Map is transformation
  • 46. Basics of RDD Applies a function to each partition of this RDD. More Actions: foreachPartition(f)
  • 47. Basics of RDD def partitionSum(itr: Iterator[Int]) = println("The sum of the parition is " + itr.sum.toString) Applies a function to each partition of this RDD. More Actions: foreachPartition(f)
  • 48. Basics of RDD Applies a function to each partition of this RDD. More Actions: foreachPartition(f) def partitionSum(itr: Iterator[Int]) = println("The sum of the parition is " + itr.sum.toString) sc.parallelize(1 to 40, 4).foreachPartition(partitionSum) The sum of the parition is 155 The sum of the parition is 55 The sum of the parition is 355 The sum of the parition is 255