SlideShare a Scribd company logo
Big data distributed processing
Spark introduction
INDEXINDEX
1. History of Big Data
2. Apache Spark: basic concepts
3. Spark SQL
4. Spark deployment
1. History of Big Data
Definitions of Big Data:
1. Lots of data! (estimated 44 zettabytes of information in 2020). @me
2. The term that describes a huge amount of data (structured and not
structured) that floods the daily businesses. @SAS
3. 3, 4, 5, 7, 10 Big Data V’s. @ORACLE @IBM
Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, …
History of Big Data
The most accurate definition:
Big Data refers to the systems and technologies needed to obtain, process,
analyze, visualize or get value of the data that cannot be done with the previous
technologies or systems due to its high volume, traffic and volatility.
History of Big Data
Big data distributed processing: Spark introduction
Google and the liberation of their technologies:
2003: The Google File System
● Problems with storage starts to arise.
● We have designed and implemented the Google File System (GFS) to meet
the rapidly growing demands of Google’s data processing needs…
2004: MapReduce: Simplified Data Processing on Large Clusters
● Problems with data processing starts to arise.
● Over the past five years, the authors and many others at Google have
implemented hundreds of special-purpose computations that process large
amounts of raw data…
History of Big Data
(2003) The Google File System
Definition of three (four) problems to solve:
1. Servers hardware.
2. File sizes.
3. File usage.
4. Flexible applications.
History of Big Data
(2003) The Google File System
1. Servers hardware:
History of Big Data
Problem Solution
Hardware & Software problems
Component failures
Human errors
Fault tolerance: store multiple copies of the
same file in different servers.
High performance servers costs Horizontal scalability (+1k servers)
Commodity servers (cheap and small)
(2003) The Google File System
2. File sizes:
History of Big Data
Problem Solution
Multiple GB files
(cost of 1GB=10$ in year 2000)
Optimize the reading at the expense of
writing.
Block file size: few KBs
(1000s of millions of reads per file)
Change the block file size to 64-128 MBs
to reduce the amount of reads per file.
(2003) The Google File System
3. File usage:
History of Big Data
Problem Solution
Files are only appended (historic files, audit
files, intermediate calculus, ...).
Sequential read.
Immutable and incremental files: read-only,
append-only.
(2004) MapReduce: Simplified Data Processing on Large Clusters
Distributed computing paradigm:
● Distributed computing using distributed file system.
● Functional computing is parallelizable by design.
● The system must provide:
○ Data movement administration.
○ Distributed execution management.
○ Fault tolerance.
● Clusters of commodity machines processing TB’s of data.
History of Big Data
(2004) MapReduce: Simplified Data Processing on Large Clusters
Definition of two simple operations:
● Map: a given function is applied for each pair of key-value to generate an
intermediate key-value.
● Reduce: a combine function groups each key to calculate the aggregation
of the multiple values associated to the key.
History of Big Data
(2004) MapReduce: Simplified Data Processing on Large Clusters
History of Big Data
Hadoop Distributed File System (HDFS)
● ~2003: started with project Nutch: search engine and crawler.
● They had problems processing data with dozens of servers.
● Google papers gave them a new approach.
● 2006: Yahoo! interested on the project and they took the distributed
computing part of the software, renaming it as Hadoop.
History of Big Data
Hadoop Distributed File System (HDFS)
History of Big Data
2. Apache Spark:
Basic concepts
● Framework for distributed data computing.
● Designed to be executed in large scale clusters with lots of data!
● Run faster than MapReduce (memory usage).
● More functions than just Map and Reduce.
● Multiple APIs, multiple programming languages:
○ Core, SQL, Streaming, GraphX, ML, MlLib, Structured Streaming, …
○ Scala (native), Java, Python, R.
● Runs everywhere:
○ Standalone, YARN, Mesos, Kubernetes, AWS, ...
Apache Spark: basic concepts
● Fault tolerance (RDD).
● Easier resource managing.
● Reusable data: caching.
● Code control and analysis (DAG).
● Generic programming patterns: the same code can run in local mode or
100’s of executors.
● Lazy evaluation: transformations and actions.
Apache Spark: basic concepts
Installation
● GOTO: https://spark.apache.org/downloads.html
● Select version 2.3.2, for Hadoop 2.7 or later.
● Extract it, set SPARK_HOME and add it to PATH.
○ tar xvf spark-2.3.2-bin-hadoop2.7.tgz
○ Add the following lines in .bashrc:
■ export SPARK_HOME=/your/spark/folder
■ export PATH=$PATH:$SPARK_HOME/bin
Apache Spark: basic concepts
Apache Spark: basic concepts
RDD
● The basic abstraction: RDD (Resilient Distributed Datasets)
[1, 3] [2, 5] [4][1, 2, 3, 4, 5]
Partition0
Partition1
Partition2
RDD[Integer]
Apache Spark: basic concepts
RDD
● Distributed: splitted in multiple partitions.
● Resilient: fault tolerance. Data can be reprocessed or duplicated in case of
failure.
● Immutable typed elements: data cannot be modified.
● Data abstraction that allows to natively parallelize them into different
executors.
● Likely to be deprecated soon.
Apache Spark: basic concepts
RDD
val numbers = Seq(1, 2, 3, 4, 5)
val numbersRDD = spark.sparkContext.parallelize(numbers)
numbersRDD: org.apache.spark.rdd.RDD[Int]
Apache Spark: basic concepts
Dataframe
● Distributed collection of Row objects: data organized into columns.
● Data is organized into columns.
name, age
hektor, 26
pepe, 22
jose, 40
Partition0
RDD[Row]
name age
hektor 26
pepe 22
Partition1
name age
jose 40
Apache Spark: basic concepts
Dataframe
● Catalyst: powers the Dataframe and SQL APIs.
1. Analyzing a logical plan to resolve references
2. Logical plan optimization
3. Physical planning
4. Code generation to compile parts of the query to Java bytecode.
● Tungsten: provides a physical execution backend which explicitly manages
memory and dynamically generates bytecode for expression evaluation.
Apache Spark: basic concepts
Dataframe
val rowNumbersRDD = numbersRDD.map(v => Row(v))
val schema = StructType(Seq(
StructField("id", IntegerType, false)
))
val numbersDF = spark.createDataFrame(rowNumbersRDD, schema)
Apache Spark: basic concepts
Dataset
● Extension of the DataFrame API: combines the power of strong typing and
lambda functions in RDD and the execution engine of the Dataframe APIs.
● Most used and powerful API.
● Encoders: allows to work with structured and unstructured data.
● The encoders provide type-safe and object-oriented programming.
Apache Spark: basic concepts
Dataset
val numbersDS = Seq(1, 2, 3).toDS
[...]
case class Person(name:String, age:Integer)
val personList = Seq(Person("Hektor", 26), Person("Pepe",22))
val personDS = personList.toDS
val personDS = spark.createDataset(personList)
[...]
Apache Spark: basic concepts
Spark operation types
● Transformations:
○ Operations that creates a new RDD, usually based on a previous one.
○ Does not evaluate the expression until an action is called.
○ Spark is able to infer the output type.
○ You can concatenate multiple transformations, before an action.
● Actions:
○ Operations that evaluates all the transformations defined.
○ Forces the evaluation to save or use the result data.
Apache Spark: basic concepts
Transformation types
Apache Spark: basic concepts
● Map map[U] (f: (T)=>U)
Applies the given function to all single elements. Can modify the values or
return a different type.
val myF = (v:Int) => v + 1
val newRDD : RDD[Int] = numbersRDD.map(myF)
val strRDD : RDD[String] = numbersRDD.map(v => v.toString)
val toLong = numbersRDD.map(_.toLong)
Narrow transformations
3 “3”
Apache Spark: basic concepts
● flatMap flatMap[U](f: (T) => TransversableOnce[U])
Applies the given function to all single elements and then flattens. Can
modify the values, return a different type, return multiple values or none.
val myF = (v:Int) => Seq(v + 1)
val newRDD = numbersRDD.flatMap(myF)
val strRDD = numbersRDD.flatMap(v => Seq(v.toString))
val toLong = numbersRDD.flatMap(_.toLong::Nil)
Narrow transformations
Apache Spark: basic concepts
● flatMap flatMap[U](f: (T) => TransversableOnce[U])
numbersRDD.flatMap(v=> 1 to v)
Narrow transformations
1
2
3
4
5
[1]
[1,2]
[1,2,3]
[1,2,3,4]
[1,2,3,4,5]
1,1,2
1,2,3
1,2,3
4,1,2
3,4,5
Apache Spark: basic concepts
● mapPartitions mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U])
Similar to map, but applying the function to the whole partition.
numbersRDD.mapPartitions(v => v.map(_ + 1))
● filter filter(f: (T) ⇒ Boolean)
Obtains a new RDD with those elements succeeded the predicate
val myF = (v:Int) => v%2==0
val evenNumbers = numbersRDD.filter(myF)
Narrow transformations
Apache Spark: basic concepts
● groupBy groupBy[(K, Iterable[T])](f: (T) ⇒ K, p: Partitioner)
Obtains a new RDD grouped by key.
val myF = (v:Int) => v % 2
val groupedRDD = numbersRDD.groupBy(myF)
Wide transformations
1
2
3
4
5
[0,[2,4]]
[1,[1,3,5]]
Apache Spark: basic concepts
● repartition(n)
Rearranges the RDD to match the new number of partitions with equal size
of partitions.
● coalesce(n, shuffle : Boolean = false)
Rearranges the RDD to match the new number of partitions with equal size
of partitions. If shuffle is false, you can only reduce the number of partitions
and the transformation will be narrow.
Wide transformations
[1,2]
[3,4]
[5,6]
[1,2,3,4]
[5,6]
coalesce(2, false)
Apache Spark: basic concepts
● coalesce(n, shuffle : Boolean = false)
Wide transformations
[1,2]
[3,4]
[5,6]
[1,2,3]
[4,5,6]
coalesce(2, true)
repartition(2)
(shuffle = wide)
RDD[Int](3) RDD[Int](2)
Apache Spark: basic concepts
The RDD automatically cast to PairRDD when a (K,V) is detected.
New transformations are available:
val strings = "tomato apple apple pear tomato"
val stringsRDD = spark.sparkContext.parallelize(strings.split(" "))
val pairs = stringsRDD.map(v => (v,1))
val countPairs = pairs.reduceByKey(_ + _)
● mapValues, flatMapValues, sortByKey, countByKey, foldByKey, ...
Key-value transformations
Apache Spark: basic concepts
● foldByKey foldByKey(zero: V)(f: (V, V) ⇒ V): RDD[(K, V)]
Applies the function for each partition using the zero value as first value.
val sumFunc = (v1:Int, v2:Int) => v1 + v2
val countPairs = pairs.foldByKey(0)(_ + _) // or sumFunc
● aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒
U)
Like foldByKey, but combOp can be a different operation.
* These functions have their RDD counterparts (without ByKey)
Key-value transformations
Apache Spark: basic concepts
foldByKey and aggregateByKey
Apache Spark: basic concepts
Executes the current function and all the previous ones defined in the DAG.
Spark Driver data collection
These functions get the data from executors into the driver.
● collect: the driver obtains all the information. This is a really dangerous
operation that could kill the driver if you handle huge amounts of data.
● take(n), takeOrdered(n), first: take the first n results to the driver.
● count: counts the number of elements.
Actions
Apache Spark: basic concepts
Data usage or movement
● foreach(f: [T] => Unit): applies the function to each element. Cannot
modify the values (use map instead).
Use this function for side effects instead of map.
rdd.foreach { v =>
doSomethingWithIt(v) // good behaviour
v + 1 // this won’t modify the value
}
rdd.map { v =>
doSomethingWithIt(v) // bad behaviour
v + 1 // this will modify the value
}
Actions
Apache Spark: basic concepts
Data usage or movement
● saveAs*
Saves the RDD into multiple sinks, like Hadoop or FileSystem.
Spark’s RDD API is not commonly used nowadays, and it should be only used
when you cannot use Datasets API.
dataset.rdd().makeRDDThings.toDS()
Actions
Apache Spark: basic concepts
One of the most important things in Spark. This allows us to reuse intermediate
results to optimize its usage.
● cache():
Persist this RDD with the default storage level (MEMORY_ONLY).
● persist(newLevel: StorageLevel)
Set this RDD's storage level to persist its values across operations after the
first time it is computed.
Persistence and caching
Apache Spark: basic concepts
Storage Levels:
● MEMORY_ONLY
● MEMORY_AND_DISK
● MEMORY_ONLY_SER
● MEMORY_AND_DISK_SER
● DISK_ONLY
● All of the above with _2
● OFF_HEAP
Persistence and caching
Apache Spark: basic concepts
1. Read log file
2. Print total number of lines, chars and show first few lines.
3. Separate the line into a tuple and persist it.
4. Calculate the number of logs per country. Show top 3.
5. Calculate the number of logs per type.
6. Check when the errors starts happening.
7. Do the same with Datasets
Painful RDD example
Apache Spark: basic concepts
Complete the following exercise
Exercise
Apache Spark: basic concepts
Persistence and caching
read file
print lines
count lines
count chars
obtain tuples
logs per
country
logs per type
min of date
ask file system where to
get the data and obtain it
Apache Spark: basic concepts
Wordcount example
3. Spark SQL
Spark SQL
Cleaner API than the RDD:
● Data inference: strong-type safe in compile time!!!
● Business intelligence easier than ever!
● Throw some SQL if you are stuck!!
spark.read.text("./myfile.txt").createOrReplaceTempView("data")
val data = spark.sql("SELECT * FROM data")
data.show()
Datasets
Spark SQL
● read: reads the data from the selected datasource.
If the data is partitioned, Spark will have its job done!
spark.read.options(MapWithOptions).csv
parquet
json
text
jdbc
orc
[...]
Data usage or movement (Dataset)
Spark SQL
● write: saves the data into the selected datasource.
Each partition is saved separately, unless you coalesce or repartition first.
The output will create parts inside the designed folder, and they can be
read natively with spark reading the parent folder.
ds.write.options(
Map("header"->"true", delimiter"->"|")
).csv("file:/tmp/test")
Data usage or movement (Dataset)
Spark SQL
● select(“col1”, “col2”:*): obtain the columns selected from the dataset.
● drop(“col1”, “col2”:*): drop the columns selected from the dataset.
● filter(expr): obtains the rows that satisfies the expression.
● union(ds): combines with other dataset with similar schema.
● dropDuplicates(“col1”, “col2”:*): drop the duplicated rows considering the
columns given.
● except(ds): obtains the columns not seen in the other dataset.
● join(ds, joinExprs, joinType): the good ol’ join.
● groupBy(“cols”): groups the dataset with the given columns. You can now
use aggregation functions with agg(f1,f2,f3,...)
[...]
Dataset transformations
Spark SQL
Complete the following exercise
Exercise
4. Spark deployment
● Driver: in charge of managing the job sent to the executors.
● Executors: in charge of running the tasks from the job.
● Cluster Manager: provides connectivity and resources to the driver needs.
○ Standalone
○ Mesos
○ Yarn
○ Kubernetes (k8s)
● SparkContext / SparkSession: must be created to run Spark jobs. Configures the
context where the jobs will be run (number and location of executors, resources, …).
Only one SparkSession can exist per JVM.
Spark deployment
Components
Spark deployment
Components
Spark deployment
Components
Application lifecycle
● The Spark job is packaged into a JAR, usually an “uber JAR” w/o Spark or Hadoop
dependencies (they are added in runtime).
● To launch the job in the Cluster Manager varies between them. Usually, you have an
endpoint where you can submit your jobs.
./bin/spark-submit --master spark://host:port [...]
● This JAR is received from the Cluster Manager, who proceeds to launch a Driver
program containing the JAR.
● The driver starts the job, and ask the Cluster Manager for the resources needed for
the executors.
● The driver connects to the executors, sends the JAR and the executors starts
receiving tasks until the job is finished.
Spark deployment
Components
Job components
● Task: a unit of work that will be sent to one executor.
● Job: a parallel computation consisting of multiple tasks that gets spawned in
response to a Spark action (e.g. save, collect); you'll see this term used in the driver's
logs.
● Stage: each job gets divided into smaller sets of tasks called stages that depend on
each other (similar to the map and reduce stages in MapReduce).
Books...
Big data distributed processing: Spark introduction
people@stratio.com
WE ARE HIRING
@StratioBD
Thanks!
hjacynycz@stratio.com

More Related Content

What's hot (20)

PDF
Time series database by Harshil Ambagade
Sigmoid
 
PPTX
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
PPTX
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Using spark for timeseries graph analytics
Sigmoid
 
PDF
Boston Spark Meetup event Slides Update
vithakur
 
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
PPTX
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
Benchmark MinHash+LSH algorithm on Spark
Xiaoqian Liu
 
PPTX
Spark - Philly JUG
Brian O'Neill
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
SparkR-Advance Analytic for Big Data
samuel shamiri
 
PDF
Large Scale Math with Hadoop MapReduce
Hortonworks
 
PDF
Databases and how to choose them
Datio Big Data
 
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
PDF
Resilient Distributed Datasets
Alessandro Menabò
 
PPTX
Neo, Titan & Cassandra
johnrjenson
 
PPT
Evolving as a professional software developer
Anton Kirillov
 
PDF
Approximation algorithms for stream and batch processing
Gabriele Modena
 
PDF
Understanding Hadoop
Ahmed Ossama
 
Time series database by Harshil Ambagade
Sigmoid
 
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Vijay Srinivas Agneeswaran, Ph.D
 
Using spark for timeseries graph analytics
Sigmoid
 
Boston Spark Meetup event Slides Update
vithakur
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Benchmark MinHash+LSH algorithm on Spark
Xiaoqian Liu
 
Spark - Philly JUG
Brian O'Neill
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
SparkR-Advance Analytic for Big Data
samuel shamiri
 
Large Scale Math with Hadoop MapReduce
Hortonworks
 
Databases and how to choose them
Datio Big Data
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
rhatr
 
Resilient Distributed Datasets
Alessandro Menabò
 
Neo, Titan & Cassandra
johnrjenson
 
Evolving as a professional software developer
Anton Kirillov
 
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Understanding Hadoop
Ahmed Ossama
 

Similar to Big data distributed processing: Spark introduction (20)

PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
PDF
New Developments in Spark
Databricks
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PPTX
Apache spark - History and market overview
Martin Zapletal
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PPT
Scala and spark
Fabio Fumarola
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PDF
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
An introduction To Apache Spark
Amir Sedighi
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
 
New Developments in Spark
Databricks
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Unified Big Data Processing with Apache Spark
C4Media
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Apache spark - History and market overview
Martin Zapletal
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Scala and spark
Fabio Fumarola
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
How Apache Spark fits in the Big Data landscape
Paco Nathan
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Ad

Recently uploaded (20)

PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Ad

Big data distributed processing: Spark introduction

  • 1. Big data distributed processing Spark introduction
  • 2. INDEXINDEX 1. History of Big Data 2. Apache Spark: basic concepts 3. Spark SQL 4. Spark deployment
  • 3. 1. History of Big Data
  • 4. Definitions of Big Data: 1. Lots of data! (estimated 44 zettabytes of information in 2020). @me 2. The term that describes a huge amount of data (structured and not structured) that floods the daily businesses. @SAS 3. 3, 4, 5, 7, 10 Big Data V’s. @ORACLE @IBM Volume, Velocity, Variety, Veracity, Value, Validity, Variability, Venue, … History of Big Data
  • 5. The most accurate definition: Big Data refers to the systems and technologies needed to obtain, process, analyze, visualize or get value of the data that cannot be done with the previous technologies or systems due to its high volume, traffic and volatility. History of Big Data
  • 7. Google and the liberation of their technologies: 2003: The Google File System ● Problems with storage starts to arise. ● We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs… 2004: MapReduce: Simplified Data Processing on Large Clusters ● Problems with data processing starts to arise. ● Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computations that process large amounts of raw data… History of Big Data
  • 8. (2003) The Google File System Definition of three (four) problems to solve: 1. Servers hardware. 2. File sizes. 3. File usage. 4. Flexible applications. History of Big Data
  • 9. (2003) The Google File System 1. Servers hardware: History of Big Data Problem Solution Hardware & Software problems Component failures Human errors Fault tolerance: store multiple copies of the same file in different servers. High performance servers costs Horizontal scalability (+1k servers) Commodity servers (cheap and small)
  • 10. (2003) The Google File System 2. File sizes: History of Big Data Problem Solution Multiple GB files (cost of 1GB=10$ in year 2000) Optimize the reading at the expense of writing. Block file size: few KBs (1000s of millions of reads per file) Change the block file size to 64-128 MBs to reduce the amount of reads per file.
  • 11. (2003) The Google File System 3. File usage: History of Big Data Problem Solution Files are only appended (historic files, audit files, intermediate calculus, ...). Sequential read. Immutable and incremental files: read-only, append-only.
  • 12. (2004) MapReduce: Simplified Data Processing on Large Clusters Distributed computing paradigm: ● Distributed computing using distributed file system. ● Functional computing is parallelizable by design. ● The system must provide: ○ Data movement administration. ○ Distributed execution management. ○ Fault tolerance. ● Clusters of commodity machines processing TB’s of data. History of Big Data
  • 13. (2004) MapReduce: Simplified Data Processing on Large Clusters Definition of two simple operations: ● Map: a given function is applied for each pair of key-value to generate an intermediate key-value. ● Reduce: a combine function groups each key to calculate the aggregation of the multiple values associated to the key. History of Big Data
  • 14. (2004) MapReduce: Simplified Data Processing on Large Clusters History of Big Data
  • 15. Hadoop Distributed File System (HDFS) ● ~2003: started with project Nutch: search engine and crawler. ● They had problems processing data with dozens of servers. ● Google papers gave them a new approach. ● 2006: Yahoo! interested on the project and they took the distributed computing part of the software, renaming it as Hadoop. History of Big Data
  • 16. Hadoop Distributed File System (HDFS) History of Big Data
  • 18. ● Framework for distributed data computing. ● Designed to be executed in large scale clusters with lots of data! ● Run faster than MapReduce (memory usage). ● More functions than just Map and Reduce. ● Multiple APIs, multiple programming languages: ○ Core, SQL, Streaming, GraphX, ML, MlLib, Structured Streaming, … ○ Scala (native), Java, Python, R. ● Runs everywhere: ○ Standalone, YARN, Mesos, Kubernetes, AWS, ... Apache Spark: basic concepts
  • 19. ● Fault tolerance (RDD). ● Easier resource managing. ● Reusable data: caching. ● Code control and analysis (DAG). ● Generic programming patterns: the same code can run in local mode or 100’s of executors. ● Lazy evaluation: transformations and actions. Apache Spark: basic concepts
  • 20. Installation ● GOTO: https://spark.apache.org/downloads.html ● Select version 2.3.2, for Hadoop 2.7 or later. ● Extract it, set SPARK_HOME and add it to PATH. ○ tar xvf spark-2.3.2-bin-hadoop2.7.tgz ○ Add the following lines in .bashrc: ■ export SPARK_HOME=/your/spark/folder ■ export PATH=$PATH:$SPARK_HOME/bin Apache Spark: basic concepts
  • 21. Apache Spark: basic concepts RDD ● The basic abstraction: RDD (Resilient Distributed Datasets) [1, 3] [2, 5] [4][1, 2, 3, 4, 5] Partition0 Partition1 Partition2 RDD[Integer]
  • 22. Apache Spark: basic concepts RDD ● Distributed: splitted in multiple partitions. ● Resilient: fault tolerance. Data can be reprocessed or duplicated in case of failure. ● Immutable typed elements: data cannot be modified. ● Data abstraction that allows to natively parallelize them into different executors. ● Likely to be deprecated soon.
  • 23. Apache Spark: basic concepts RDD val numbers = Seq(1, 2, 3, 4, 5) val numbersRDD = spark.sparkContext.parallelize(numbers) numbersRDD: org.apache.spark.rdd.RDD[Int]
  • 24. Apache Spark: basic concepts Dataframe ● Distributed collection of Row objects: data organized into columns. ● Data is organized into columns. name, age hektor, 26 pepe, 22 jose, 40 Partition0 RDD[Row] name age hektor 26 pepe 22 Partition1 name age jose 40
  • 25. Apache Spark: basic concepts Dataframe ● Catalyst: powers the Dataframe and SQL APIs. 1. Analyzing a logical plan to resolve references 2. Logical plan optimization 3. Physical planning 4. Code generation to compile parts of the query to Java bytecode. ● Tungsten: provides a physical execution backend which explicitly manages memory and dynamically generates bytecode for expression evaluation.
  • 26. Apache Spark: basic concepts Dataframe val rowNumbersRDD = numbersRDD.map(v => Row(v)) val schema = StructType(Seq( StructField("id", IntegerType, false) )) val numbersDF = spark.createDataFrame(rowNumbersRDD, schema)
  • 27. Apache Spark: basic concepts Dataset ● Extension of the DataFrame API: combines the power of strong typing and lambda functions in RDD and the execution engine of the Dataframe APIs. ● Most used and powerful API. ● Encoders: allows to work with structured and unstructured data. ● The encoders provide type-safe and object-oriented programming.
  • 28. Apache Spark: basic concepts Dataset val numbersDS = Seq(1, 2, 3).toDS [...] case class Person(name:String, age:Integer) val personList = Seq(Person("Hektor", 26), Person("Pepe",22)) val personDS = personList.toDS val personDS = spark.createDataset(personList) [...]
  • 29. Apache Spark: basic concepts Spark operation types ● Transformations: ○ Operations that creates a new RDD, usually based on a previous one. ○ Does not evaluate the expression until an action is called. ○ Spark is able to infer the output type. ○ You can concatenate multiple transformations, before an action. ● Actions: ○ Operations that evaluates all the transformations defined. ○ Forces the evaluation to save or use the result data.
  • 30. Apache Spark: basic concepts Transformation types
  • 31. Apache Spark: basic concepts ● Map map[U] (f: (T)=>U) Applies the given function to all single elements. Can modify the values or return a different type. val myF = (v:Int) => v + 1 val newRDD : RDD[Int] = numbersRDD.map(myF) val strRDD : RDD[String] = numbersRDD.map(v => v.toString) val toLong = numbersRDD.map(_.toLong) Narrow transformations 3 “3”
  • 32. Apache Spark: basic concepts ● flatMap flatMap[U](f: (T) => TransversableOnce[U]) Applies the given function to all single elements and then flattens. Can modify the values, return a different type, return multiple values or none. val myF = (v:Int) => Seq(v + 1) val newRDD = numbersRDD.flatMap(myF) val strRDD = numbersRDD.flatMap(v => Seq(v.toString)) val toLong = numbersRDD.flatMap(_.toLong::Nil) Narrow transformations
  • 33. Apache Spark: basic concepts ● flatMap flatMap[U](f: (T) => TransversableOnce[U]) numbersRDD.flatMap(v=> 1 to v) Narrow transformations 1 2 3 4 5 [1] [1,2] [1,2,3] [1,2,3,4] [1,2,3,4,5] 1,1,2 1,2,3 1,2,3 4,1,2 3,4,5
  • 34. Apache Spark: basic concepts ● mapPartitions mapPartitions[U](f: (Iterator[T]) ⇒ Iterator[U]) Similar to map, but applying the function to the whole partition. numbersRDD.mapPartitions(v => v.map(_ + 1)) ● filter filter(f: (T) ⇒ Boolean) Obtains a new RDD with those elements succeeded the predicate val myF = (v:Int) => v%2==0 val evenNumbers = numbersRDD.filter(myF) Narrow transformations
  • 35. Apache Spark: basic concepts ● groupBy groupBy[(K, Iterable[T])](f: (T) ⇒ K, p: Partitioner) Obtains a new RDD grouped by key. val myF = (v:Int) => v % 2 val groupedRDD = numbersRDD.groupBy(myF) Wide transformations 1 2 3 4 5 [0,[2,4]] [1,[1,3,5]]
  • 36. Apache Spark: basic concepts ● repartition(n) Rearranges the RDD to match the new number of partitions with equal size of partitions. ● coalesce(n, shuffle : Boolean = false) Rearranges the RDD to match the new number of partitions with equal size of partitions. If shuffle is false, you can only reduce the number of partitions and the transformation will be narrow. Wide transformations [1,2] [3,4] [5,6] [1,2,3,4] [5,6] coalesce(2, false)
  • 37. Apache Spark: basic concepts ● coalesce(n, shuffle : Boolean = false) Wide transformations [1,2] [3,4] [5,6] [1,2,3] [4,5,6] coalesce(2, true) repartition(2) (shuffle = wide) RDD[Int](3) RDD[Int](2)
  • 38. Apache Spark: basic concepts The RDD automatically cast to PairRDD when a (K,V) is detected. New transformations are available: val strings = "tomato apple apple pear tomato" val stringsRDD = spark.sparkContext.parallelize(strings.split(" ")) val pairs = stringsRDD.map(v => (v,1)) val countPairs = pairs.reduceByKey(_ + _) ● mapValues, flatMapValues, sortByKey, countByKey, foldByKey, ... Key-value transformations
  • 39. Apache Spark: basic concepts ● foldByKey foldByKey(zero: V)(f: (V, V) ⇒ V): RDD[(K, V)] Applies the function for each partition using the zero value as first value. val sumFunc = (v1:Int, v2:Int) => v1 + v2 val countPairs = pairs.foldByKey(0)(_ + _) // or sumFunc ● aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U) Like foldByKey, but combOp can be a different operation. * These functions have their RDD counterparts (without ByKey) Key-value transformations
  • 40. Apache Spark: basic concepts foldByKey and aggregateByKey
  • 41. Apache Spark: basic concepts Executes the current function and all the previous ones defined in the DAG. Spark Driver data collection These functions get the data from executors into the driver. ● collect: the driver obtains all the information. This is a really dangerous operation that could kill the driver if you handle huge amounts of data. ● take(n), takeOrdered(n), first: take the first n results to the driver. ● count: counts the number of elements. Actions
  • 42. Apache Spark: basic concepts Data usage or movement ● foreach(f: [T] => Unit): applies the function to each element. Cannot modify the values (use map instead). Use this function for side effects instead of map. rdd.foreach { v => doSomethingWithIt(v) // good behaviour v + 1 // this won’t modify the value } rdd.map { v => doSomethingWithIt(v) // bad behaviour v + 1 // this will modify the value } Actions
  • 43. Apache Spark: basic concepts Data usage or movement ● saveAs* Saves the RDD into multiple sinks, like Hadoop or FileSystem. Spark’s RDD API is not commonly used nowadays, and it should be only used when you cannot use Datasets API. dataset.rdd().makeRDDThings.toDS() Actions
  • 44. Apache Spark: basic concepts One of the most important things in Spark. This allows us to reuse intermediate results to optimize its usage. ● cache(): Persist this RDD with the default storage level (MEMORY_ONLY). ● persist(newLevel: StorageLevel) Set this RDD's storage level to persist its values across operations after the first time it is computed. Persistence and caching
  • 45. Apache Spark: basic concepts Storage Levels: ● MEMORY_ONLY ● MEMORY_AND_DISK ● MEMORY_ONLY_SER ● MEMORY_AND_DISK_SER ● DISK_ONLY ● All of the above with _2 ● OFF_HEAP Persistence and caching
  • 46. Apache Spark: basic concepts 1. Read log file 2. Print total number of lines, chars and show first few lines. 3. Separate the line into a tuple and persist it. 4. Calculate the number of logs per country. Show top 3. 5. Calculate the number of logs per type. 6. Check when the errors starts happening. 7. Do the same with Datasets Painful RDD example
  • 47. Apache Spark: basic concepts Complete the following exercise Exercise
  • 48. Apache Spark: basic concepts Persistence and caching read file print lines count lines count chars obtain tuples logs per country logs per type min of date ask file system where to get the data and obtain it
  • 49. Apache Spark: basic concepts Wordcount example
  • 51. Spark SQL Cleaner API than the RDD: ● Data inference: strong-type safe in compile time!!! ● Business intelligence easier than ever! ● Throw some SQL if you are stuck!! spark.read.text("./myfile.txt").createOrReplaceTempView("data") val data = spark.sql("SELECT * FROM data") data.show() Datasets
  • 52. Spark SQL ● read: reads the data from the selected datasource. If the data is partitioned, Spark will have its job done! spark.read.options(MapWithOptions).csv parquet json text jdbc orc [...] Data usage or movement (Dataset)
  • 53. Spark SQL ● write: saves the data into the selected datasource. Each partition is saved separately, unless you coalesce or repartition first. The output will create parts inside the designed folder, and they can be read natively with spark reading the parent folder. ds.write.options( Map("header"->"true", delimiter"->"|") ).csv("file:/tmp/test") Data usage or movement (Dataset)
  • 54. Spark SQL ● select(“col1”, “col2”:*): obtain the columns selected from the dataset. ● drop(“col1”, “col2”:*): drop the columns selected from the dataset. ● filter(expr): obtains the rows that satisfies the expression. ● union(ds): combines with other dataset with similar schema. ● dropDuplicates(“col1”, “col2”:*): drop the duplicated rows considering the columns given. ● except(ds): obtains the columns not seen in the other dataset. ● join(ds, joinExprs, joinType): the good ol’ join. ● groupBy(“cols”): groups the dataset with the given columns. You can now use aggregation functions with agg(f1,f2,f3,...) [...] Dataset transformations
  • 55. Spark SQL Complete the following exercise Exercise
  • 57. ● Driver: in charge of managing the job sent to the executors. ● Executors: in charge of running the tasks from the job. ● Cluster Manager: provides connectivity and resources to the driver needs. ○ Standalone ○ Mesos ○ Yarn ○ Kubernetes (k8s) ● SparkContext / SparkSession: must be created to run Spark jobs. Configures the context where the jobs will be run (number and location of executors, resources, …). Only one SparkSession can exist per JVM. Spark deployment Components
  • 59. Spark deployment Components Application lifecycle ● The Spark job is packaged into a JAR, usually an “uber JAR” w/o Spark or Hadoop dependencies (they are added in runtime). ● To launch the job in the Cluster Manager varies between them. Usually, you have an endpoint where you can submit your jobs. ./bin/spark-submit --master spark://host:port [...] ● This JAR is received from the Cluster Manager, who proceeds to launch a Driver program containing the JAR. ● The driver starts the job, and ask the Cluster Manager for the resources needed for the executors. ● The driver connects to the executors, sends the JAR and the executors starts receiving tasks until the job is finished.
  • 60. Spark deployment Components Job components ● Task: a unit of work that will be sent to one executor. ● Job: a parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. ● Stage: each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce).