SlideShare a Scribd company logo
3
Most read
14
Most read
15
Most read
Getting Started
with
Apache Spark
Presented By
Manish Mishra
Pradyuman Pratap Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction to Big Data and Apache Spark
 What is Big Data?
 What is Apache Spark?
 Features of Apache Spark
2. Overview of Spark Architecture
3. Spark Components
4. Spark Basic & Programming Model
 Spark Context
 Spark Session
 RDD
 Dataframe
 RDD v/s Dataframe
5. Advantages of Apache Spark
6. Disadvantages of Apache Spark
7. Demo
Getting Started with Apache Spark (Scala)
What is Big Data?
Big Data means very large and complex sets
of information that are too big and fast for
traditional computer systems to handle. It
includes a wide variety of data types from many
sources.
It is characterized by the 5 Vs:
 Volume: Massive amounts of data.
 Velocity: Speed at which data is generated
and processed.
 Variety: Different types of data (structured,
semi-structured, unstructured).
 Veracity: Data quality and accuracy.
 Value: Value the data provides.
What is Apache Spark?
 Apache Spark is an open-source analytical processing engine for large-scale powerful
distributed data processing and machine learning applications. It can handle
both batches as well as real-time analytics and data processing workloads.
 It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently
use it for more types of computations, which includes interactive queries and stream
processing.
 The main feature of Spark is its in-memory computing that increases the processing
speed of an application.
Features of Apache Spark
01 02
03
05 06
04
In Memory Computation
Speed
Different Cluster Managers
Distributed Processing
Fault Tolerant
Lazy Evaluation
02
Apache Spark Architecture
03
Spark Components
Spark Core
Spark SQL
Supported
Languages
Spark
Streaming
Real Time
Mlib
Machine
Learning
GraphX
Graph
Processing
Scala Java Python R
Spark
Engine
Libraries
04
Spark Basics
1. Spark Context: SparkContext is the primary entry point to any spark functionality.
When we run any Spark application, a driver program starts, which has the main
function and your SparkContext gets initiated here. The driver program then runs the
operations inside the executors on worker nodes.
2. Spark Session: SparkSession is a unified entry point for Spark applications; it was
introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities,
including RDDs, DataFrames, and Datasets, providing a unified interface to work with
structured data processing.
RDD
 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
 There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
RDD Operation:
o Transformation
o Actions
Dataframe
 In Spark, Dataframe are the distributed
collections of data, organized into rows and
columns. Each column in a Dataframe has a
name and an associated type. Dataframe are
like traditional database tables, which are
structured and concise.
 We can say that Dataframe are relational
databases with better optimization
techniques.
 Spark Dataframe can be created from
various sources, such as Hive tables, log
tables, external databases, or the existing
RDDs. Dataframe allow the processing of
huge amounts of data.
RDD v/s Dataframe
Features RDD Dataframe
Data Format Structured and unstructured Structured and semi-structured
APIs
Provide a low-level API that requires
more code to perform transformations
and actions on data
Provide a high-level API that makes it
easier to perform transformations and
actions on data.
Schema enforcement
Do not have an explicit schema, and are
often used for unstructured data.
Dataframe enforce schema at runtime.
Have an explicit schema that
describes the data and its types.
Optimization
No inbuilt optimization engine is
available in RDD.
It uses a catalyst optimizer for
optimization.
05
Advantages of Apache Spark
 In Memory Computation
 Speed
 Ease of Use
 Advanced Analytics
 Fault Tolerant
 Multi Language Support
06
Disadvantages of Apache Spark
 Small Files Issue
 File Management System
 No automatic optimization process
 Fewer Algorithms
07
Getting Started with Apache Spark (Scala)

More Related Content

What's hot (20)

PPTX
Data ingestion
nitheeshe2
 
PPTX
SAP Cloud Platform API Management Technical Brief
SAP Cloud Platform
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Kaar Tech Product portfolio - UAE
ThiyagarajanR12
 
PPTX
SAP Flexible workflows.pptx
KeshavaMurthy74
 
PDF
How to analyzing sap critical authorizations
Anywhere Gondodza SAP.GRC.FI.B.COM.ACC.HONS (MSU)
 
PDF
Decoding SAP's BI Analytics SAP Statement of Direction
Visual_BI
 
PPTX
What is an API?
Muhammad Zuhdi
 
PDF
Sap ewm detailed presentation
Jose De Andrade Pereira
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
openSAP_fiops1_Week_1_All_Slides.pdf
Sathish Kumar Elumalai
 
PDF
Oracle Advanced Analytics
aghosh_us
 
PPTX
Transition to SAP S/4HANA System Conversion: A step-by-step guide
Kellton Tech Solutions Ltd
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
ODP
Splunk
Knoldus Inc.
 
PDF
How to use abap cds for data provisioning in bw
Luc Vanrobays
 
PPT
Use the SAP Content Server for Your Document Imaging and Archiving Needs!
Verbella CMG
 
PPTX
Apitesting.pptx
NamanVerma88
 
Data ingestion
nitheeshe2
 
SAP Cloud Platform API Management Technical Brief
SAP Cloud Platform
 
Apache Spark Fundamentals
Zahra Eskandari
 
Kaar Tech Product portfolio - UAE
ThiyagarajanR12
 
SAP Flexible workflows.pptx
KeshavaMurthy74
 
How to analyzing sap critical authorizations
Anywhere Gondodza SAP.GRC.FI.B.COM.ACC.HONS (MSU)
 
Decoding SAP's BI Analytics SAP Statement of Direction
Visual_BI
 
What is an API?
Muhammad Zuhdi
 
Sap ewm detailed presentation
Jose De Andrade Pereira
 
Introduction to Spark with Python
Gokhan Atil
 
openSAP_fiops1_Week_1_All_Slides.pdf
Sathish Kumar Elumalai
 
Oracle Advanced Analytics
aghosh_us
 
Transition to SAP S/4HANA System Conversion: A step-by-step guide
Kellton Tech Solutions Ltd
 
Introduction to Stream Processing
Guido Schmutz
 
Apache Spark PDF
Naresh Rupareliya
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Splunk
Knoldus Inc.
 
How to use abap cds for data provisioning in bw
Luc Vanrobays
 
Use the SAP Content Server for Your Document Imaging and Archiving Needs!
Verbella CMG
 
Apitesting.pptx
NamanVerma88
 

Similar to Getting Started with Apache Spark (Scala) (20)

PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Spark Unveiled Essential Insights for All Developers
Knoldus Inc.
 
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Apache spark
Prashant Pranay
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PDF
Apache spark
Dona Mary Philip
 
PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
SparkPaper
Suraj Thapaliya
 
PPTX
Spark_Talha.pptx
ITLAb21
 
PPTX
An Introduction to Apache Spark
Dona Mary Philip
 
PPTX
Introduction to spark
Home
 
PPTX
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
PDF
Apache Spark Notes
Venkateswaran Kandasamy
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Spark Unveiled Essential Insights for All Developers
Knoldus Inc.
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache spark
Prashant Pranay
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache spark
Dona Mary Philip
 
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
Spark from the Surface
Josi Aranda
 
SparkPaper
Suraj Thapaliya
 
Spark_Talha.pptx
ITLAb21
 
An Introduction to Apache Spark
Dona Mary Philip
 
Introduction to spark
Home
 
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
Apache Spark Notes
Venkateswaran Kandasamy
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Digital Circuits, important subject in CS
contactparinay1
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 

Getting Started with Apache Spark (Scala)

  • 1. Getting Started with Apache Spark Presented By Manish Mishra Pradyuman Pratap Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction to Big Data and Apache Spark  What is Big Data?  What is Apache Spark?  Features of Apache Spark 2. Overview of Spark Architecture 3. Spark Components 4. Spark Basic & Programming Model  Spark Context  Spark Session  RDD  Dataframe  RDD v/s Dataframe 5. Advantages of Apache Spark 6. Disadvantages of Apache Spark 7. Demo
  • 5. What is Big Data? Big Data means very large and complex sets of information that are too big and fast for traditional computer systems to handle. It includes a wide variety of data types from many sources. It is characterized by the 5 Vs:  Volume: Massive amounts of data.  Velocity: Speed at which data is generated and processed.  Variety: Different types of data (structured, semi-structured, unstructured).  Veracity: Data quality and accuracy.  Value: Value the data provides.
  • 6. What is Apache Spark?  Apache Spark is an open-source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. It can handle both batches as well as real-time analytics and data processing workloads.  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory computing that increases the processing speed of an application.
  • 7. Features of Apache Spark 01 02 03 05 06 04 In Memory Computation Speed Different Cluster Managers Distributed Processing Fault Tolerant Lazy Evaluation
  • 8. 02
  • 10. 03
  • 11. Spark Components Spark Core Spark SQL Supported Languages Spark Streaming Real Time Mlib Machine Learning GraphX Graph Processing Scala Java Python R Spark Engine Libraries
  • 12. 04
  • 13. Spark Basics 1. Spark Context: SparkContext is the primary entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 2. Spark Session: SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and Datasets, providing a unified interface to work with structured data processing.
  • 14. RDD  Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. RDD Operation: o Transformation o Actions
  • 15. Dataframe  In Spark, Dataframe are the distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframe are like traditional database tables, which are structured and concise.  We can say that Dataframe are relational databases with better optimization techniques.  Spark Dataframe can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. Dataframe allow the processing of huge amounts of data.
  • 16. RDD v/s Dataframe Features RDD Dataframe Data Format Structured and unstructured Structured and semi-structured APIs Provide a low-level API that requires more code to perform transformations and actions on data Provide a high-level API that makes it easier to perform transformations and actions on data. Schema enforcement Do not have an explicit schema, and are often used for unstructured data. Dataframe enforce schema at runtime. Have an explicit schema that describes the data and its types. Optimization No inbuilt optimization engine is available in RDD. It uses a catalyst optimizer for optimization.
  • 17. 05
  • 18. Advantages of Apache Spark  In Memory Computation  Speed  Ease of Use  Advanced Analytics  Fault Tolerant  Multi Language Support
  • 19. 06
  • 20. Disadvantages of Apache Spark  Small Files Issue  File Management System  No automatic optimization process  Fewer Algorithms
  • 21. 07