SlideShare a Scribd company logo
6
Most read
12
Most read
14
Most read
Big Data
transformations
powered by Apache Spark
Mohika Rastogi
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
 Introduction to BIG DATA
 What is Big Data
 Challenges and Benefits Involving Big Data
 Apache Spark in the Big Data Industry
 Data Transformations
 What is Data Transformation and why do we need
to transform data ?
 Why Spark ..?
 Spark Transformations
 RDD, Dataframe,Wide vs Narrow Transformation.
 Examples of transformations
 Aggregate Functions
 Array Functions
 Spark Joins
 DEMO
Big Data Transformations Powered By Spark
What is Big Data ?
 The definition of big data is data that contains
greater variety, arriving in increasing volumes
and with more velocity. This is also known as
the three Vs.
 Put simply, big data is larger, more complex data
sets, especially from new data sources. These data
sets are so voluminous that traditional data
processing software just can’t manage them. But
these massive volumes of data can be used to
address business problems you wouldn’t have
been able to tackle before.
 Scalability and storage bottleneck
 Noise accumulation
 Fault Tolerance
 Incidental endogeneity and measurement errors
 Data quality
 Storage
 Lack of data science professionals
 Validating data
 Big data technology is changing at a rapid pace. A
few years ago, Apache Hadoop was the popular
technology used to handle big data. Then Apache
Spark was introduced in 2014. Today, a
combination of the two frameworks appears to be
the best approach. Keeping up with big data
technology is an ongoing challenge.
Challenges With Big Data
 Big data makes it possible for you to gain more
complete answers because you have more
information. More complete answers mean more
confidence in the data—which means a completely
different approach to tackling problems.
 Cost Savings: Optimizing processes based on big
data insights can result in cost savings. This
includes improvements in supply chain
management, resource utilization, and more
efficient business operations.
 Risk Management: Big data analytics helps
organizations identify and mitigate risks by
analyzing patterns and anomalies in data. This is
particularly valuable in financial services,
insurance, and other industries where risk
management is crucial.
Big Data Benefits
HOW DOES APACHE
SPARK IMPROVES
BUSINESS IN THE BIG
DATA INDUSTRY ??
Better Analytics
Powerful Data Processing
 Apache Spark is an ideal tool for companies that work
on the Internet of Things. As it has low-latency in-
memory data processing capability, it can efficiently
handle a wide range of analytics problems. It
contains well-designed libraries used for graph
analytics algorithms and machine learning.
 Apache Spark libraries are used by big data scientists
to improve their analyses, querying, and
data transformation. It helps them to create complex
workflows in a smooth and seamless way. Apache
Spark is used for completing various tasks such
as analysis, interactive queries across large data sets,
and more.
Flexibility
Real-time processing.
 Apache Spark enables the organization to analyze the
data coming from IoT sensors. It enables
easy processing of continuous streaming of low-
latency data. In this way, organizations can utilize real-
time dashboards and data exploration to monitor
and optimize their business.
 Apache Spark is highly compatible with a variety of
programming languages and allows you to write
applications in Python, Scala, Java, and more.
What is Data Transformation ?
 Data transformation is defined as the technical
process of converting data from one format,
standard, or structure to another – without
changing the content of the datasets – typically to
prepare it for consumption by an app or a user or
to improve the data quality.
Why do we need to transform data ?
 It is crucial for any organization that seeks to
leverage its data to provide timely business
insights. Organizations need a reliable method for
utilizing data to put it to good use for their
operations as the number of data has increased.
 Data transformation is a component of using this
data since, when performed effectively, it ensures
that the information is accessible, consistent, safe,
and eventually acknowledged by the targeted
business users.
Why Spark Only
 In-memory computation
o Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory,
10 times faster on disk
 Lazy Evaluation
 Support SQL queries
SPARK
TRANSFORMATIONS
 RDD: RDD is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core.
 Dataframe: DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a
relational database or a data frame in R/Python, but with richer optimizations under the hood.
Spark Transformations
Main Features:
 In-Memory Processing
 Immutability
 Fault Tolerance
 Lazy Evaluation
 Partitioning
 Parallelize
Wide vs Narrow
Transformations
Spark Transformations create a new
Resilient Distributed Dataset (RDD) from
an existing one.
eg :- map, flatMap,groupByKey,filter, union
Spark Transformations Examples
 GroupBy :- groupBy() is a transformation operation in spark that is used to group the data in a Spark DataFrame or RDD
based on one or more specified columns.
It returns a GroupedData object which can then be used to perform aggregation operations such as count(),sum(),avg(), etc. on
the grouped data.
 Map : Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame,
and Dataset and finally returns a new RDD/Dataset respectively.
 Union : This method of the DataFrame is used to combine two DataFrame’s of the same structure/schema. If schemas are not
the same it returns an error.
 Intersect : Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to
INTERSECT in SQL.
 Where : Spark where() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL
expression.
Aggregate Functions
 Spark provides built-in standard
Aggregate functions defines in
DataFrame API, these come in
handy when we need to make
aggregate operations on
DataFrame columns. Aggregate
functions operate on a group of
rows and calculate a single
return value for every group. All
these aggregate functions
accept input as, Column type or
column name in a string and
several other arguments based
on the function and return
Column type.
 For eg :-
val avg_df = df.select(avg("salary"))
 df.select(first("salary").as("Top
Salary")).show(false)
Array Functions
 Spark SQL provides built-in
standard array functions defines in
DataFrame API, these come in
handy when we need to make
operations on array (ArrayType)
column.
 All these accept input as, array
column and several other
arguments based on the function.
 For eg :
inputdf.withColumn("result",
array_contains(col("array_col2"),
3))
 inputdf.withColumn("result",
array_max(col("array_col2"), 3))
Spark Joins
Inner Join
Spark Inner join is the default join and
it’s mostly used, It is used to join two
DataFrames/Datasets on key
columns, and where keys don’t match
the rows get dropped from both
datasets.
Left Join
Spark Left a.k.a Left Outer join returns all
rows from the left DataFrame/Dataset
regardless of match found on the right
dataset when join expression doesn’t
match, it assigns null for that record and
drops records from right where match not
found.
Full Outer Join
Outer or full, fullouter join returns all rows
from both Spark DataFrame/Datasets,
where join expression doesn’t match it
returns null on respective record columns.
Right Outer Join
Spark Right a.k.a Right Outer join is
opposite of left join, here it returns all rows
from the right DataFrame/Dataset
regardless of math found on the left
dataset, when join expression doesn’t
match, it assigns null for that record and
drops records from left where match not
found.
Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI,
CROSS, SELF JOIN. Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have
huge performance issues when not designed with care.
Joins Continued ...
 Left Semi Join :
Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns from the left
DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the
only left dataset for the records match in the right dataset on join expression, records not matched on join expression
are ignored from both left and right datasets.
** The same result can be achieved using select on the result of the inner join however, using this join would be
efficient.
 Left Anti Join :
Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only columns from the left
DataFrame/Dataset for non-matched records.
 Self Join :
Spark Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above-
explained join types to join DataFrame to itself. below example use inner self join.
 Cross Join :
Returns the Cartesian product of both DataFrames, resulting in all possible combinations of rows. -> crossJoin(right:
Dataset[_]) -> if you did’nt specify any condition/joinExpr then it will cross join by default or you can use the crossJoin
method as well.
Joins Continued ...
 Join With :
The joinWith method in Spark also performs a join operation between
two DataFrames based on a specified join condition. However, it returns
a Dataset of tuples representing the joined rows from both DataFrames.
The resulting Dataset contains tuples, where each tuple represents a
joined row consisting of the rows from the left DataFrame and the
right DataFrame that satisfy the join condition.
-> The key distinction is that joinWith provides a more structured
output in the form of a Dataset of tuples, while join returns a
DataFrame with merged columns from both DataFrames.
-> In most cases, join is used when you want to combine rows
from two DataFrames based on a join condition, and you are
interested in the merged columns in the output DataFrame. On
the other hand, joinWith is useful when you want to keep the
original structure of the data and work with tuples representing
the joined rows.
For eg :- empDF.joinWith(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false)
EmpDF DeptDF
resultDF
Big Data Transformations Powered By Spark
Big Data Transformations Powered By Spark

More Related Content

Similar to Big Data Transformations Powered By Spark (20)

PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PPTX
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
PPTX
Spark from the Surface
Josi Aranda
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PDF
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
PDF
Scaling Analytics with Apache Spark
QuantUniversity
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PDF
Big data processing with apache spark
sarith divakar
 
PPTX
Apache spark
Ramakrishna kapa
 
PPTX
Apachespark 160612140708
Srikrishna k
 
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
Spark what's new what's coming
Databricks
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Spark from the Surface
Josi Aranda
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Introduction to Spark SQL training workshop
(Susan) Xinh Huynh
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Provectus
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
Scaling Analytics with Apache Spark
QuantUniversity
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Big data processing with apache spark
sarith divakar
 
Apache spark
Ramakrishna kapa
 
Apachespark 160612140708
Srikrishna k
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Spark what's new what's coming
Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Digital Circuits, important subject in CS
contactparinay1
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Ad

Big Data Transformations Powered By Spark

  • 1. Big Data transformations powered by Apache Spark Mohika Rastogi
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3.  Introduction to BIG DATA  What is Big Data  Challenges and Benefits Involving Big Data  Apache Spark in the Big Data Industry  Data Transformations  What is Data Transformation and why do we need to transform data ?  Why Spark ..?  Spark Transformations  RDD, Dataframe,Wide vs Narrow Transformation.  Examples of transformations  Aggregate Functions  Array Functions  Spark Joins  DEMO
  • 5. What is Big Data ?  The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs.  Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.
  • 6.  Scalability and storage bottleneck  Noise accumulation  Fault Tolerance  Incidental endogeneity and measurement errors  Data quality  Storage  Lack of data science professionals  Validating data  Big data technology is changing at a rapid pace. A few years ago, Apache Hadoop was the popular technology used to handle big data. Then Apache Spark was introduced in 2014. Today, a combination of the two frameworks appears to be the best approach. Keeping up with big data technology is an ongoing challenge. Challenges With Big Data  Big data makes it possible for you to gain more complete answers because you have more information. More complete answers mean more confidence in the data—which means a completely different approach to tackling problems.  Cost Savings: Optimizing processes based on big data insights can result in cost savings. This includes improvements in supply chain management, resource utilization, and more efficient business operations.  Risk Management: Big data analytics helps organizations identify and mitigate risks by analyzing patterns and anomalies in data. This is particularly valuable in financial services, insurance, and other industries where risk management is crucial. Big Data Benefits
  • 7. HOW DOES APACHE SPARK IMPROVES BUSINESS IN THE BIG DATA INDUSTRY ??
  • 8. Better Analytics Powerful Data Processing  Apache Spark is an ideal tool for companies that work on the Internet of Things. As it has low-latency in- memory data processing capability, it can efficiently handle a wide range of analytics problems. It contains well-designed libraries used for graph analytics algorithms and machine learning.  Apache Spark libraries are used by big data scientists to improve their analyses, querying, and data transformation. It helps them to create complex workflows in a smooth and seamless way. Apache Spark is used for completing various tasks such as analysis, interactive queries across large data sets, and more. Flexibility Real-time processing.  Apache Spark enables the organization to analyze the data coming from IoT sensors. It enables easy processing of continuous streaming of low- latency data. In this way, organizations can utilize real- time dashboards and data exploration to monitor and optimize their business.  Apache Spark is highly compatible with a variety of programming languages and allows you to write applications in Python, Scala, Java, and more.
  • 9. What is Data Transformation ?  Data transformation is defined as the technical process of converting data from one format, standard, or structure to another – without changing the content of the datasets – typically to prepare it for consumption by an app or a user or to improve the data quality. Why do we need to transform data ?  It is crucial for any organization that seeks to leverage its data to provide timely business insights. Organizations need a reliable method for utilizing data to put it to good use for their operations as the number of data has increased.  Data transformation is a component of using this data since, when performed effectively, it ensures that the information is accessible, consistent, safe, and eventually acknowledged by the targeted business users.
  • 10. Why Spark Only  In-memory computation o Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory, 10 times faster on disk  Lazy Evaluation  Support SQL queries
  • 12.  RDD: RDD is a fundamental data structure of Spark and it is the primary data abstraction in Apache Spark and the Spark Core.  Dataframe: DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. Spark Transformations Main Features:  In-Memory Processing  Immutability  Fault Tolerance  Lazy Evaluation  Partitioning  Parallelize Wide vs Narrow Transformations Spark Transformations create a new Resilient Distributed Dataset (RDD) from an existing one. eg :- map, flatMap,groupByKey,filter, union
  • 13. Spark Transformations Examples  GroupBy :- groupBy() is a transformation operation in spark that is used to group the data in a Spark DataFrame or RDD based on one or more specified columns. It returns a GroupedData object which can then be used to perform aggregation operations such as count(),sum(),avg(), etc. on the grouped data.  Map : Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively.  Union : This method of the DataFrame is used to combine two DataFrame’s of the same structure/schema. If schemas are not the same it returns an error.  Intersect : Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.  Where : Spark where() function is used to filter the rows from DataFrame or Dataset based on the given condition or SQL expression.
  • 14. Aggregate Functions  Spark provides built-in standard Aggregate functions defines in DataFrame API, these come in handy when we need to make aggregate operations on DataFrame columns. Aggregate functions operate on a group of rows and calculate a single return value for every group. All these aggregate functions accept input as, Column type or column name in a string and several other arguments based on the function and return Column type.  For eg :- val avg_df = df.select(avg("salary"))  df.select(first("salary").as("Top Salary")).show(false)
  • 15. Array Functions  Spark SQL provides built-in standard array functions defines in DataFrame API, these come in handy when we need to make operations on array (ArrayType) column.  All these accept input as, array column and several other arguments based on the function.  For eg : inputdf.withColumn("result", array_contains(col("array_col2"), 3))  inputdf.withColumn("result", array_max(col("array_col2"), 3))
  • 16. Spark Joins Inner Join Spark Inner join is the default join and it’s mostly used, It is used to join two DataFrames/Datasets on key columns, and where keys don’t match the rows get dropped from both datasets. Left Join Spark Left a.k.a Left Outer join returns all rows from the left DataFrame/Dataset regardless of match found on the right dataset when join expression doesn’t match, it assigns null for that record and drops records from right where match not found. Full Outer Join Outer or full, fullouter join returns all rows from both Spark DataFrame/Datasets, where join expression doesn’t match it returns null on respective record columns. Right Outer Join Spark Right a.k.a Right Outer join is opposite of left join, here it returns all rows from the right DataFrame/Dataset regardless of math found on the left dataset, when join expression doesn’t match, it assigns null for that record and drops records from left where match not found. Spark DataFrame supports all basic SQL Join Types like INNER, LEFT OUTER, RIGHT OUTER, LEFT ANTI, LEFT SEMI, CROSS, SELF JOIN. Spark SQL Joins are wider transformations that result in data shuffling over the network hence they have huge performance issues when not designed with care.
  • 17. Joins Continued ...  Left Semi Join : Spark Left Semi join is similar to inner join difference being leftsemi join returns all columns from the left DataFrame/Dataset and ignores all columns from the right dataset. In other words, this join returns columns from the only left dataset for the records match in the right dataset on join expression, records not matched on join expression are ignored from both left and right datasets. ** The same result can be achieved using select on the result of the inner join however, using this join would be efficient.  Left Anti Join : Left Anti join does the exact opposite of the Spark leftsemi join, leftanti join returns only columns from the left DataFrame/Dataset for non-matched records.  Self Join : Spark Joins are not complete without a self join, Though there is no self-join type available, we can use any of the above- explained join types to join DataFrame to itself. below example use inner self join.  Cross Join : Returns the Cartesian product of both DataFrames, resulting in all possible combinations of rows. -> crossJoin(right: Dataset[_]) -> if you did’nt specify any condition/joinExpr then it will cross join by default or you can use the crossJoin method as well.
  • 18. Joins Continued ...  Join With : The joinWith method in Spark also performs a join operation between two DataFrames based on a specified join condition. However, it returns a Dataset of tuples representing the joined rows from both DataFrames. The resulting Dataset contains tuples, where each tuple represents a joined row consisting of the rows from the left DataFrame and the right DataFrame that satisfy the join condition. -> The key distinction is that joinWith provides a more structured output in the form of a Dataset of tuples, while join returns a DataFrame with merged columns from both DataFrames. -> In most cases, join is used when you want to combine rows from two DataFrames based on a join condition, and you are interested in the merged columns in the output DataFrame. On the other hand, joinWith is useful when you want to keep the original structure of the data and work with tuples representing the joined rows. For eg :- empDF.joinWith(deptDF, empDF("emp_dept_id") === deptDF("dept_id"), "inner").show(false) EmpDF DeptDF resultDF