SlideShare a Scribd company logo
Big Data Analytics with Apache Spark
Marco Yuri Fujii Melo
2
Agenda

Apache Spark definition

RDD (Resilient Distributed Datasets)

Important aspects about RDDs

RDD operations

Pair RDD

Lazy evaluation

Apache Spark architecture

Spark SQL

Dataset: 2021 Stack Overflow Developer Survey
(the dataset used in the demo)

Demo
3
Definition
“Apache Spark is an open-source, distributed processing system used
for big data workloads. It utilizes in-memory caching, and optimized
query execution for fast analytic queries against data of any size. It
provides development APIs in Java, Scala, Python and R, and supports
code reuse across multiple workloads—batch processing, interactive
queries, real-time analytics, machine learning, and graph processing.”
(Amazon Web Services)
4
RDD (Resilient Distributed Datasets)

Most important concept in Apache Spark

RDD is a capsulation around a very large
dataset

All work in Spark involves either creating new
RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result

Behind the scenes, Spark will distribute the
data contained in the RDD across the cluster
and parallelize the operations being performed
on it
5
RDDs are Distributed

Spark automatically breaks RDDs into multiple
pieces called partitions, which are divided
across the clusters
For instance, if your cluster has 6 nodes, then the RDD
could be split into 6 partitions

The partitions are processed in parallel and
independently
6
RDDs are Immutable

RDDs cannot be changed after they are created
This avoids a significant amount of potencial problems
due to updates from multiple threads at the same time
7
RDDs are Resilient

RDDs parts can be recreated at any time

Fault tolerance: In case of any node in the
cluster goes down, Spark can recover the parts
of the RDD from the input data and pick up from
where it left off
8
RDD operations

Transformations: apply functions to the data in
an existing RDD to create a new RDD
Example: the filter transformation, which returns a new
RDD with a subset of the data in the original RDD
Other popular transformations: map, flatMap, sample, distinct, union, intersection,
subtract, cartesian

Actions: computes a result based on an RDD or
persists data to an external storage
Example: the first action, which returns the first element in
an RDD
Other popular actions: collect, count, countByValue, take, saveAsTextFile, reduce
9
Pair RDD

Pair RDD is a type of RDD which can store key-
value pairs

In this type of dataset, each row has a key
mapping to one value or multiple values
Tuple2<Integer, String> tuple = new Tuple2<>(123, "value");
Integer key = tuple1._1();
String value = tuple1._2();
...
JavaPairRDD<Integer, String> pairRDD = sparkContext.parallelizePairs(tupleList);
10
Lazy evaluation
// nothing happens when Spark sees the textFile() method call
JavaRDD<String> lines = sc.textFile("months.txt");
// nothing happens when Spark sees the filter() transformation
JavaRDD<String> linesWithApril = lines.filter(line -> line.startsWith("April"));
// Spark only starts loading months.txt when first() action is called
// Spark is smart enough to know it doesn’t need to go through the entire file
// it will scan the file only until it finds the first line starting with “April”
String firstLineWithApril = linesWithApril.first();
11
Spark architecture in the cluster
12
Spark SQL

Apache Spark's module for working with
structured data

Structured data is any data that has a schema

Provides a dataset abstraction that simplifies
working with structured data

Dataset is similar to tables in a relational
database
13
2021 Stack Overflow Developer Survey

Full dataset available at:
https://insights.stackoverflow.com/survey

Nearly 80,000 responses fielded from over 180
countries

There are six sections in this survey:
1. Basic Information
2. Education, Work, and Career
3. Technology and Tech Culture
4. Stack Overflow Usage + Community
5. Demographic Information
6. Final Questions
14
2021 Stack Overflow Developer Survey
Demo
Source code available at:
http://www.marco.tec.br
15
References

https://spark.apache.org/

https://aws.amazon.com/big-data/what-is-spark/

https://stackoverflow.blog/2021/08/30/the-full-data-
set-for-the-2021-developer-survey-now-available/

More Related Content

Similar to Big Data Analytics with Apache Spark (20)

PDF
Apache Spark Introduction.pdf
MaheshPandit16
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
PDF
Boston Spark Meetup event Slides Update
vithakur
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PPTX
Learning spark ch04 - Working with Key/Value Pairs
phanleson
 
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPTX
Spark
Heena Madan
 
PDF
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPTX
Berlin buzzwords 2018
Matija Gobec
 
PPTX
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PDF
Apache Spark Introduction
sudhakara st
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Tuning and Debugging in Apache Spark
Databricks
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Boston Spark Meetup event Slides Update
vithakur
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Learning spark ch04 - Working with Key/Value Pairs
phanleson
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark Fundamentals Meetup Talk
Eren Avşaroğulları
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Berlin buzzwords 2018
Matija Gobec
 
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Introduction to Apache Spark
Vincent Poncet
 
Apache Spark Introduction
sudhakara st
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 

Recently uploaded (20)

PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Ad

Big Data Analytics with Apache Spark

  • 1. Big Data Analytics with Apache Spark Marco Yuri Fujii Melo
  • 2. 2 Agenda  Apache Spark definition  RDD (Resilient Distributed Datasets)  Important aspects about RDDs  RDD operations  Pair RDD  Lazy evaluation  Apache Spark architecture  Spark SQL  Dataset: 2021 Stack Overflow Developer Survey (the dataset used in the demo)  Demo
  • 3. 3 Definition “Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size. It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple workloads—batch processing, interactive queries, real-time analytics, machine learning, and graph processing.” (Amazon Web Services)
  • 4. 4 RDD (Resilient Distributed Datasets)  Most important concept in Apache Spark  RDD is a capsulation around a very large dataset  All work in Spark involves either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result  Behind the scenes, Spark will distribute the data contained in the RDD across the cluster and parallelize the operations being performed on it
  • 5. 5 RDDs are Distributed  Spark automatically breaks RDDs into multiple pieces called partitions, which are divided across the clusters For instance, if your cluster has 6 nodes, then the RDD could be split into 6 partitions  The partitions are processed in parallel and independently
  • 6. 6 RDDs are Immutable  RDDs cannot be changed after they are created This avoids a significant amount of potencial problems due to updates from multiple threads at the same time
  • 7. 7 RDDs are Resilient  RDDs parts can be recreated at any time  Fault tolerance: In case of any node in the cluster goes down, Spark can recover the parts of the RDD from the input data and pick up from where it left off
  • 8. 8 RDD operations  Transformations: apply functions to the data in an existing RDD to create a new RDD Example: the filter transformation, which returns a new RDD with a subset of the data in the original RDD Other popular transformations: map, flatMap, sample, distinct, union, intersection, subtract, cartesian  Actions: computes a result based on an RDD or persists data to an external storage Example: the first action, which returns the first element in an RDD Other popular actions: collect, count, countByValue, take, saveAsTextFile, reduce
  • 9. 9 Pair RDD  Pair RDD is a type of RDD which can store key- value pairs  In this type of dataset, each row has a key mapping to one value or multiple values Tuple2<Integer, String> tuple = new Tuple2<>(123, "value"); Integer key = tuple1._1(); String value = tuple1._2(); ... JavaPairRDD<Integer, String> pairRDD = sparkContext.parallelizePairs(tupleList);
  • 10. 10 Lazy evaluation // nothing happens when Spark sees the textFile() method call JavaRDD<String> lines = sc.textFile("months.txt"); // nothing happens when Spark sees the filter() transformation JavaRDD<String> linesWithApril = lines.filter(line -> line.startsWith("April")); // Spark only starts loading months.txt when first() action is called // Spark is smart enough to know it doesn’t need to go through the entire file // it will scan the file only until it finds the first line starting with “April” String firstLineWithApril = linesWithApril.first();
  • 12. 12 Spark SQL  Apache Spark's module for working with structured data  Structured data is any data that has a schema  Provides a dataset abstraction that simplifies working with structured data  Dataset is similar to tables in a relational database
  • 13. 13 2021 Stack Overflow Developer Survey  Full dataset available at: https://insights.stackoverflow.com/survey  Nearly 80,000 responses fielded from over 180 countries  There are six sections in this survey: 1. Basic Information 2. Education, Work, and Career 3. Technology and Tech Culture 4. Stack Overflow Usage + Community 5. Demographic Information 6. Final Questions
  • 14. 14 2021 Stack Overflow Developer Survey Demo Source code available at: http://www.marco.tec.br