SlideShare a Scribd company logo
View Apache Spark and Scala
course details at www.edureka.co/apache-spark-scala-training
Apache Spark | Spark SQL
Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2
Objectives
At the end of this module, you will be able to
 Introduction of Spark
 Spark Architecture
 What is an RDD
 Demo On Creating RDD and Running sample example
 Spark SQL
Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3
What is Spark?
Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it
easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.
 Developed at UC Berkeley
Written in Scala , a Functional Programming Language that runs in a JMV
It generalize the Map Reduce framework
Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4
Why Spark ?
Speed
Run programs up to 100x
faster than Hadoop Map
Reduce in memory, or 10x
faster on disk.
Ease of Use
Supports different
languages for developing
applications using Spark
Generality
Combine SQL, streaming,
and complex analytics into
one platform
Runs Everywhere
Spark runs on Hadoop,
Mesos, standalone, or in
the cloud.
Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5
Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass
computations and algorithms ( Machine learning etc.)
To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in
sequence
 Each of those jobs was high-latency, and none could start until the previous job had finished completely
The Job output data between each step has to be stored in the local file system before the next step can begin
 Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning
and Storm for streaming data processing)
Map Reduce Limitations
Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6
Spark Features
 Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-
memory data storage
 Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
 It’s designed to be an execution engine that works both in-memory and on-disk
 Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow
 Provides concise and consistent APIs in Scala, Java and Python
 Offers interactive shell for Scala and Python. This is not available in Java yet
 Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7
Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8
Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Cluster management ( Native Spark Cluster, YARN, MESOS )
Distributed storage ( HDFS, Cassandra, S3, HBase )
Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9
Spark Advantages
EASE OF
DEVELOPMENT
COMBINE
WORKFLOWS
IN-MEMORY
PERFORMANCE
 Easier APIs
 Python, Scala, Java
 RDDs
 DAGs Unify Processing
 Shark, ML
Streaming, GraphX
Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
 Multiple data sources
 Multiple applications
 Multiple users
 Reliability
 Multi-tenancy
 Security
 Files
 Databases
 Semi-structured
Hadoop Advantages
Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11
Spark + Hadoop
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
EASE OF
DEVELOPMENT
COMBINE WORKFLOWS
IN-MEMORY
PERFORMANCE
Operational Applications
Augmented by In-Memory
Performance
Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12
Resilient Distributed Datasets
RDD ( Resilient Distributed Data Sets )
Resilient – If data in memory is lost, It can be recreated
Distributed – Stored in memory across the cluster
Dataset – Initial data can come from a file or created programmatically.
RDDs are the fundamental unit of data in spark
Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13
Resilient Distributed Datasets
Core concept of Spark framework.
RDDs can store any type of data.
Primitive Types : Integer, Characters, Boolean etc.
Files : Text files, SequencFiles etc.
RDD is fault tolerance.
RDDs are immutable
Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14
RDD supports two types of operations:
Transformation: Transformations don't return a single value, they return a new RDD.
Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and
coalesce.
Action: Action operation evaluates and returns a new value.
Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.
Resilient Distributed Datasets
Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15
Spark Sql
Spark Core
 Spark SQL allows relational queries through Spark
 The backbone for all these operations is SchemaRDD
 Schema RDDs are mode of row objects along with the metadata information
 SchemaRDDs are equivalent to RDBMS tables
 They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data
stored in Apache Hive(*)
Spark SQL
Slide 16 www.edureka.co/apache-spark-scala-training
Spark SQL
Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with
integrated APIs in Scala and Java
 Shark Project is completely closed now
Earlier it was Shark but now
we will use Spark SQL
Shark
Spark SQL Hive on Spark
Development ending:
transitioning to Spark SQL
A new SQL engine designed
from ground up for Spark
Help existing Hive users
migrate Spark
Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17
Efficient In-Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-object overhead
Instead, Spark SQL employs column-oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18
Demo On Spark RDDs
Slide 19 www.edureka.co/apache-spark-scala-training
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
Course Features
Slide 20 www.edureka.co/apache-spark-scala-training
Questions
Slide 21 www.edureka.co/apache-spark-scala-training
Course Topics
 Module 1
» Introduction to Scala
 Module 2
» Scala Essentials
 Module 3
» Traits and OOPs in Scala
 Module 4
» Functional Programming in Scala
Module 5
» Introduction to Big Data and Spark
Module 6
» Spark Baby Steps
Module 7
» Playing with RDDs
Module 8
» Spark with SQL- When Spark meets Hive
Slide 22 www.edureka.co/apache-spark-scala-training

More Related Content

PDF
Big Data Processing with Spark and Scala
Edureka!
 
PDF
Spark Streaming
Edureka!
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PPTX
5 things one must know about spark!
Edureka!
 
PDF
Spark Will Replace Hadoop ! Know Why
Edureka!
 
PPTX
Apache spark
Edureka!
 
PDF
Performance of Spark vs MapReduce
Edureka!
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Big Data Processing with Spark and Scala
Edureka!
 
Spark Streaming
Edureka!
 
Spark For Faster Batch Processing
Edureka!
 
5 things one must know about spark!
Edureka!
 
Spark Will Replace Hadoop ! Know Why
Edureka!
 
Apache spark
Edureka!
 
Performance of Spark vs MapReduce
Edureka!
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 

What's hot (20)

PDF
Introduction to Apache Spark
Vincent Poncet
 
PPTX
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Edureka!
 
PDF
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PDF
5 things one must know about spark!
Edureka!
 
PDF
5 Reasons why Spark is in demand!
Edureka!
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PDF
Spark SQL
Joud Khattab
 
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
PPTX
An Introduction to Apache Spark
Dona Mary Philip
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
New Developments in Spark
Databricks
 
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PDF
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
PDF
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
PPT
Learning spark ch07 - Running on a Cluster
phanleson
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Introduction to Apache Spark
Vincent Poncet
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Edureka!
 
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
5 things one must know about spark!
Edureka!
 
5 Reasons why Spark is in demand!
Edureka!
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Apache spark
TEJPAL GAUTAM
 
Spark SQL
Joud Khattab
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
An Introduction to Apache Spark
Dona Mary Philip
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
New Developments in Spark
Databricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
An Introduction to Sparkling Water by Michal Malohlava
Spark Summit
 
Learning spark ch07 - Running on a Cluster
phanleson
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Ad

Viewers also liked (20)

PDF
Apache Spark beyond Hadoop MapReduce
Edureka!
 
PPTX
Spark for big data analytics
Edureka!
 
PDF
Big Data Processing With Spark
Edureka!
 
PPTX
5 reasons why spark is in demand!
Edureka!
 
PDF
Understanding Big Data And Hadoop
Edureka!
 
PDF
Introduction to Apache Spark
Juan Pedro Moreno
 
PDF
2016 spark survey
Abhishek Choudhary
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PDF
Fault Tolerance with Kafka
Edureka!
 
PPTX
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PDF
PySpark Best Practices
Cloudera, Inc.
 
PDF
Distributed ML in Apache Spark
Databricks
 
PPTX
Introduction to Apache Spark and MLlib
pumaranikar
 
PDF
Machine Learning with Spark MLlib
Todd McGrath
 
PPTX
Online Tweet Sentiment Analysis with Apache Spark
Davide Nardone
 
PDF
PySpark in practice slides
Dat Tran
 
PPTX
Programming in Spark using PySpark
Mostafa
 
Apache Spark beyond Hadoop MapReduce
Edureka!
 
Spark for big data analytics
Edureka!
 
Big Data Processing With Spark
Edureka!
 
5 reasons why spark is in demand!
Edureka!
 
Understanding Big Data And Hadoop
Edureka!
 
Introduction to Apache Spark
Juan Pedro Moreno
 
2016 spark survey
Abhishek Choudhary
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Fault Tolerance with Kafka
Edureka!
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Introduction to Big Data & Hadoop
Edureka!
 
An introduction To Apache Spark
Amir Sedighi
 
PySpark Cassandra - Amsterdam Spark Meetup
Frens Jan Rumph
 
PySpark Best Practices
Cloudera, Inc.
 
Distributed ML in Apache Spark
Databricks
 
Introduction to Apache Spark and MLlib
pumaranikar
 
Machine Learning with Spark MLlib
Todd McGrath
 
Online Tweet Sentiment Analysis with Apache Spark
Davide Nardone
 
PySpark in practice slides
Dat Tran
 
Programming in Spark using PySpark
Mostafa
 
Ad

Similar to Spark SQL | Apache Spark (20)

PDF
Apache spark
Dona Mary Philip
 
PDF
Module01
NPN Training
 
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PDF
Apache Spark Introduction.pdf
MaheshPandit16
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PPTX
Apache spark installation [autosaved]
Shweta Patnaik
 
PDF
Introduction to apache spark and the architecture
sundharakumarkb2
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
spark_v1_2
Frank Schroeter
 
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PPTX
Apache spark with java 8
Janu Jahnavi
 
PDF
Apache spark with java 8
Janu Jahnavi
 
PPTX
Apachespark 160612140708
Srikrishna k
 
PPTX
Apache spark
Ramakrishna kapa
 
PPT
An Introduction to Apache spark with scala
johnn210
 
Apache spark
Dona Mary Philip
 
Module01
NPN Training
 
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
Apache Spark Overview
Dharmjit Singh
 
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Apache spark installation [autosaved]
Shweta Patnaik
 
Introduction to apache spark and the architecture
sundharakumarkb2
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
spark_v1_2
Frank Schroeter
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
 
Apache Spark PDF
Naresh Rupareliya
 
Apache spark with java 8
Janu Jahnavi
 
Apache spark with java 8
Janu Jahnavi
 
Apachespark 160612140708
Srikrishna k
 
Apache spark
Ramakrishna kapa
 
An Introduction to Apache spark with scala
johnn210
 

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
PDF
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
PDF
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PDF
Python Programming Tutorial | Edureka
Edureka!
 
PDF
Top 5 PMP Certifications | Edureka
Edureka!
 
PDF
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
PDF
Linux Mint Tutorial | Edureka
Edureka!
 
PDF
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
PDF
Importance of Digital Marketing | Edureka
Edureka!
 
PDF
RPA in 2020 | Edureka
Edureka!
 
PDF
Email Notifications in Jenkins | Edureka
Edureka!
 
PDF
EA Algorithm in Machine Learning | Edureka
Edureka!
 
PDF
Cognitive AI Tutorial | Edureka
Edureka!
 
PDF
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
PDF
Blue Prism Top Interview Questions | Edureka
Edureka!
 
PDF
Big Data on AWS Tutorial | Edureka
Edureka!
 
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
PDF
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
PDF
Introduction to DevOps | Edureka
Edureka!
 
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Edureka!
 

Recently uploaded (20)

PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Spark SQL | Apache Spark

  • 1. View Apache Spark and Scala course details at www.edureka.co/apache-spark-scala-training Apache Spark | Spark SQL
  • 2. Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2 Objectives At the end of this module, you will be able to  Introduction of Spark  Spark Architecture  What is an RDD  Demo On Creating RDD and Running sample example  Spark SQL
  • 3. Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3 What is Spark? Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.  Developed at UC Berkeley Written in Scala , a Functional Programming Language that runs in a JMV It generalize the Map Reduce framework
  • 4. Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4 Why Spark ? Speed Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. Ease of Use Supports different languages for developing applications using Spark Generality Combine SQL, streaming, and complex analytics into one platform Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud.
  • 5. Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5 Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms ( Machine learning etc.) To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in sequence  Each of those jobs was high-latency, and none could start until the previous job had finished completely The Job output data between each step has to be stored in the local file system before the next step can begin  Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing) Map Reduce Limitations
  • 6. Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6 Spark Features  Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in- memory data storage  Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing  It’s designed to be an execution engine that works both in-memory and on-disk  Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow  Provides concise and consistent APIs in Scala, Java and Python  Offers interactive shell for Scala and Python. This is not available in Java yet  Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
  • 7. Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture
  • 8. Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture Cluster management ( Native Spark Cluster, YARN, MESOS ) Distributed storage ( HDFS, Cassandra, S3, HBase )
  • 9. Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9 Spark Advantages EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE  Easier APIs  Python, Scala, Java  RDDs  DAGs Unify Processing  Shark, ML Streaming, GraphX
  • 10. Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10 UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM  Multiple data sources  Multiple applications  Multiple users  Reliability  Multi-tenancy  Security  Files  Databases  Semi-structured Hadoop Advantages
  • 11. Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11 Spark + Hadoop UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE Operational Applications Augmented by In-Memory Performance
  • 12. Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12 Resilient Distributed Datasets RDD ( Resilient Distributed Data Sets ) Resilient – If data in memory is lost, It can be recreated Distributed – Stored in memory across the cluster Dataset – Initial data can come from a file or created programmatically. RDDs are the fundamental unit of data in spark
  • 13. Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13 Resilient Distributed Datasets Core concept of Spark framework. RDDs can store any type of data. Primitive Types : Integer, Characters, Boolean etc. Files : Text files, SequencFiles etc. RDD is fault tolerance. RDDs are immutable
  • 14. Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14 RDD supports two types of operations: Transformation: Transformations don't return a single value, they return a new RDD. Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. Action: Action operation evaluates and returns a new value. Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach. Resilient Distributed Datasets
  • 15. Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15 Spark Sql Spark Core  Spark SQL allows relational queries through Spark  The backbone for all these operations is SchemaRDD  Schema RDDs are mode of row objects along with the metadata information  SchemaRDDs are equivalent to RDBMS tables  They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data stored in Apache Hive(*) Spark SQL
  • 16. Slide 16 www.edureka.co/apache-spark-scala-training Spark SQL Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Scala and Java  Shark Project is completely closed now Earlier it was Shark but now we will use Spark SQL Shark Spark SQL Hive on Spark Development ending: transitioning to Spark SQL A new SQL engine designed from ground up for Spark Help existing Hive users migrate Spark
  • 17. Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17 Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Spark SQL employs column-oriented storage using arrays of primitive types 1 Column Storage 2 3 john mike sally 4.1 3.5 6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4
  • 19. Slide 19 www.edureka.co/apache-spark-scala-training LIVE Online Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work Verifiable Certificate Course Features
  • 21. Slide 21 www.edureka.co/apache-spark-scala-training Course Topics  Module 1 » Introduction to Scala  Module 2 » Scala Essentials  Module 3 » Traits and OOPs in Scala  Module 4 » Functional Programming in Scala Module 5 » Introduction to Big Data and Spark Module 6 » Spark Baby Steps Module 7 » Playing with RDDs Module 8 » Spark with SQL- When Spark meets Hive