Big Data Analytics with Apache Spark

Big Data Analytics with Apache Spark
Marco Yuri Fujii Melo

2
Agenda

Apache Spark definition

RDD (Resilient Distributed Datasets)

Important aspects about RDDs

RDD operations

Pair RDD

Lazy evaluation

Apache Spark architecture

Spark SQL

Dataset: 2021 Stack Overflow Developer Survey
(the dataset used in the demo)

Demo

3
Definition
“Apache Spark is an open-source, distributed processing system used
for big data workloads. It utilizes in-memory caching, and optimized
query execution for fast analytic queries against data of any size. It
provides development APIs in Java, Scala, Python and R, and supports
code reuse across multiple workloads—batch processing, interactive
queries, real-time analytics, machine learning, and graph processing.”
(Amazon Web Services)

4
RDD (Resilient Distributed Datasets)

Most important concept in Apache Spark

RDD is a capsulation around a very large
dataset

All work in Spark involves either creating new
RDDs, transforming existing RDDs, or calling
operations on RDDs to compute a result

Behind the scenes, Spark will distribute the
data contained in the RDD across the cluster
and parallelize the operations being performed
on it

5
RDDs are Distributed

Spark automatically breaks RDDs into multiple
pieces called partitions, which are divided
across the clusters
For instance, if your cluster has 6 nodes, then the RDD
could be split into 6 partitions

The partitions are processed in parallel and
independently

6
RDDs are Immutable

RDDs cannot be changed after they are created
This avoids a significant amount of potencial problems
due to updates from multiple threads at the same time

7
RDDs are Resilient

RDDs parts can be recreated at any time

Fault tolerance: In case of any node in the
cluster goes down, Spark can recover the parts
of the RDD from the input data and pick up from
where it left off

8
RDD operations

Transformations: apply functions to the data in
an existing RDD to create a new RDD
Example: the filter transformation, which returns a new
RDD with a subset of the data in the original RDD
Other popular transformations: map, flatMap, sample, distinct, union, intersection,
subtract, cartesian

Actions: computes a result based on an RDD or
persists data to an external storage
Example: the first action, which returns the first element in
an RDD
Other popular actions: collect, count, countByValue, take, saveAsTextFile, reduce

9
Pair RDD

Pair RDD is a type of RDD which can store key-
value pairs

In this type of dataset, each row has a key
mapping to one value or multiple values
Tuple2<Integer, String> tuple = new Tuple2<>(123, "value");
Integer key = tuple1._1();
String value = tuple1._2();
...
JavaPairRDD<Integer, String> pairRDD = sparkContext.parallelizePairs(tupleList);

10
Lazy evaluation
// nothing happens when Spark sees the textFile() method call
JavaRDD<String> lines = sc.textFile("months.txt");
// nothing happens when Spark sees the filter() transformation
JavaRDD<String> linesWithApril = lines.filter(line -> line.startsWith("April"));
// Spark only starts loading months.txt when first() action is called
// Spark is smart enough to know it doesn’t need to go through the entire file
// it will scan the file only until it finds the first line starting with “April”
String firstLineWithApril = linesWithApril.first();

11
Spark architecture in the cluster

12
Spark SQL

Apache Spark's module for working with
structured data

Structured data is any data that has a schema

Provides a dataset abstraction that simplifies
working with structured data

Dataset is similar to tables in a relational
database

13
2021 Stack Overflow Developer Survey

Full dataset available at:
https://insights.stackoverflow.com/survey

Nearly 80,000 responses fielded from over 180
countries

There are six sections in this survey:
1. Basic Information
2. Education, Work, and Career
3. Technology and Tech Culture
4. Stack Overflow Usage + Community
5. Demographic Information
6. Final Questions

14
2021 Stack Overflow Developer Survey
Demo
Source code available at:
http://www.marco.tec.br

15
References

https://spark.apache.org/

https://aws.amazon.com/big-data/what-is-spark/

https://stackoverflow.blog/2021/08/30/the-full-data-
set-for-the-2021-developer-survey-now-available/

Big Data Analytics with Apache Spark

More Related Content

Similar to Big Data Analytics with Apache Spark (20)

Recently uploaded (20)

Big Data Analytics with Apache Spark