Big Data Transformations Powered By Spark

Big Data
transformations
powered by Apache Spark
Mohika Rastogi

Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.

 Introduction to BIG DATA
 What is Big Data
 Challenges and Benefits Involving Big Data
 Apache Spark in the Big Data Industry
 Data Transformations
 What is Data Transformation and why do we need
to transform data ?
 Why Spark ..?
 Spark Transformations
 RDD, Dataframe,Wide vs Narrow Transformation.
 Examples of transformations
 Aggregate Functions
 Array Functions
 Spark Joins
 DEMO

What is Big Data ?
 The definition of big data is data that contains
greater variety, arriving in increasing volumes
and with more velocity. This is also known as
the three Vs.
 Put simply, big data is larger, more complex data
sets, especially from new data sources. These data
sets are so voluminous that traditional data
processing software just can’t manage them. But
these massive volumes of data can be used to
address business problems you wouldn’t have
been able to tackle before.

 Scalability and storage bottleneck
 Noise accumulation
 Fault Tolerance
 Incidental endogeneity and measurement errors
 Data quality
 Storage
 Lack of data science professionals
 Validating data
 Big data technology is changing at a rapid pace. A
few years ago, Apache Hadoop was the popular
technology used to handle big data. Then Apache
Spark was introduced in 2014. Today, a
combination of the two frameworks appears to be
the best approach. Keeping up with big data
technology is an ongoing challenge.
Challenges With Big Data
 Big data makes it possible for you to gain more
complete answers because you have more
information. More complete answers mean more
confidence in the data—which means a completely
different approach to tackling problems.
 Cost Savings: Optimizing processes based on big
data insights can result in cost savings. This
includes improvements in supply chain
management, resource utilization, and more
efficient business operations.
 Risk Management: Big data analytics helps
organizations identify and mitigate risks by
analyzing patterns and anomalies in data. This is
particularly valuable in financial services,
insurance, and other industries where risk
management is crucial.
Big Data Benefits

HOW DOES APACHE
SPARK IMPROVES
BUSINESS IN THE BIG
DATA INDUSTRY ??

Better Analytics
Powerful Data Processing
 Apache Spark is an ideal tool for companies that work
on the Internet of Things. As it has low-latency in-
memory data processing capability, it can efficiently
handle a wide range of analytics problems. It
contains well-designed libraries used for graph
analytics algorithms and machine learning.
 Apache Spark libraries are used by big data scientists
to improve their analyses, querying, and
data transformation. It helps them to create complex
workflows in a smooth and seamless way. Apache
Spark is used for completing various tasks such
as analysis, interactive queries across large data sets,
and more.
Flexibility
Real-time processing.
 Apache Spark enables the organization to analyze the
data coming from IoT sensors. It enables
easy processing of continuous streaming of low-
latency data. In this way, organizations can utilize real-
time dashboards and data exploration to monitor
and optimize their business.
 Apache Spark is highly compatible with a variety of
programming languages and allows you to write
applications in Python, Scala, Java, and more.

What is Data Transformation ?
 Data transformation is defined as the technical
process of converting data from one format,
standard, or structure to another – without
changing the content of the datasets – typically to
prepare it for consumption by an app or a user or
to improve the data quality.
Why do we need to transform data ?
 It is crucial for any organization that seeks to
leverage its data to provide timely business
insights. Organizations need a reliable method for
utilizing data to put it to good use for their
operations as the number of data has increased.
 Data transformation is a component of using this
data since, when performed effectively, it ensures
that the information is accessible, consistent, safe,
and eventually acknowledged by the targeted
business users.

Why Spark Only
 In-memory computation
o Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory,
10 times faster on disk
 Lazy Evaluation
 Support SQL queries

 RDD: RDD is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core.
 Dataframe: DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a
relational database or a data frame in R/Python, but with richer optimizations under the hood.
Spark Transformations
Main Features:
 In-Memory Processing
 Immutability
 Fault Tolerance
 Lazy Evaluation
 Partitioning
 Parallelize
Wide vs Narrow
Transformations
Spark Transformations create a new
Resilient Distributed Dataset (RDD) from
an existing one.
eg :- map, flatMap,groupByKey,filter, union

Spark Transformations Examples
 GroupBy :- groupBy() is a transformation operation in spark that is used to group the data in a Spark DataFrame or RDD
based on one or more specified columns.
It returns a GroupedData object which can then be used to perform aggregation operations such as count(),sum(),avg(), etc. on
the grouped data.
 Map : Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame,
and Dataset and finally returns a new RDD/Dataset respectively.
 Union : This method of the DataFrame is used to combine two DataFrame’s of the same structure/schema. If schemas are not
the same it returns an error.
 Intersect : Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to
INTERSECT in SQL.
 Where : Spark where() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL
expression.

Aggregate Functions
 Spark provides built-in standard
Aggregate functions defines in
DataFrame API, these come in
handy when we need to make
aggregate operations on
DataFrame columns. Aggregate
functions operate on a group of
rows and calculate a single
return value for every group. All
these aggregate functions
accept input as, Column type or
column name in a string and
several other arguments based
on the function and return
Column type.
 For eg :-
val avg_df = df.select(avg("salary"))
 df.select(first("salary").as("Top
Salary")).show(false)

Array Functions
 Spark SQL provides built-in
standard array functions defines in
DataFrame API, these come in
handy when we need to make
operations on array (ArrayType)
column.
 All these accept input as, array
column and several other
arguments based on the function.
 For eg :
inputdf.withColumn("result",
array_contains(col("array_col2"),
3))
 inputdf.withColumn("result",
array_max(col("array_col2"), 3))

Spark Joins
Inner Join
Spark Inner join is the default join and
it’s mostly used, It is used to join two
DataFrames/Datasets on key
columns, and where keys don’t match
the rows get dropped from both
datasets.
Left Join
Spark Left a.k.a Left Outer join returns all
rows from the left DataFrame/Dataset
regardless of match found on the right
dataset when join expression doesn’t
match, it assigns null for that record and
drops records from right where match not
found.
Full Outer Join
Outer or full, fullouter join returns all rows
from both Spark DataFrame/Datasets,
where join expression doesn’t match it
returns null on respective record columns.
Right Outer Join
Spark Right a.k.a Right Outer join is
opposite of left join, here it returns all rows
from the right DataFrame/Dataset
regardless of math found on the left
dataset, when join expression doesn’t
match, it assigns null for that record and
drops records from left where match not
found.
Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI,
CROSS, SELF JOIN. Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have
huge performance issues when not designed with care.

Joins Continued ...
 Left Semi Join :
Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns from the left
DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the
only left dataset for the records match in the right dataset on join expression, records not matched on join expression
are ignored from both left and right datasets.
** The same result can be achieved using select on the result of the inner join however, using this join would be
efficient.
 Left Anti Join :
Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only columns from the left
DataFrame/Dataset for non-matched records.
 Self Join :
Spark Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-
explained join types to join DataFrame to itself. below example use inner self join.
 Cross Join :
Returns the Cartesian product of both DataFrames, resulting in all possible combinations of rows. -> crossJoin(right:
Dataset[_]) -> if you did’nt specify any condition/joinExpr then it will cross join by default or you can use the crossJoin
method as well.

Joins Continued ...
 Join With :
The joinWith method in Spark also performs a join operation between
two DataFrames based on a specified join condition. However, it returns
a Dataset of tuples representing the joined rows from both DataFrames.
The resulting Dataset contains tuples, where each tuple represents a
joined row consisting of the rows from the left DataFrame and the
right DataFrame that satisfy the join condition.
-> The key distinction is that joinWith provides a more structured
output in the form of a Dataset of tuples, while join returns a
DataFrame with merged columns from both DataFrames.
-> In most cases, join is used when you want to combine rows
from two DataFrames based on a join condition, and you are
interested in the merged columns in the output DataFrame. On
the other hand, joinWith is useful when you want to keep the
original structure of the data and work with tuples representing
the joined rows.
For eg :- empDF.joinWith(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false)
EmpDF DeptDF
resultDF

Big Data Transformations Powered By Spark

More Related Content

Similar to Big Data Transformations Powered By Spark (20)

More from Knoldus Inc. (20)

Recently uploaded (20)

Big Data Transformations Powered By Spark