SlideShare a Scribd company logo
www.edureka.co/apache-spark-scala-training
5 Things one must know about Spark!
www.edureka.co/apache-spark-scala-training
What will you learn today?
 Spark In-Memory Processing
 Streaming Support
 Machine Learning and Graph
 Spark DataFrame API
 Spark's Integration with Hadoop
www.edureka.co/apache-spark-scala-training
Spark In-Memory Processing
www.edureka.co/apache-spark-scala-training
Spark Cut Down Read/Write I/O To Disk
Spark tries to keep things in-memory of its distributed workers, allowing for significantly
faster/lower-latency computations, whereas MapReduce keep shuffling things in and out of disk.
www.edureka.co/apache-spark-scala-training
Spark is blazingly Fast
www.edureka.co/apache-spark-scala-training
Isn’t Spark In-Memory Only
But I have
heard Spark is
good for only
in-memory
processing?
www.edureka.co/apache-spark-scala-training
Spark : Best of both Worlds
It’s a common misconception Spark is only for in-memory processing. From its inception Spark
was designed to be a general execution engine that works both in-memory and on-disk.
Almost all Spark operators perform external operations when data does not fit in memory
www.edureka.co/apache-spark-scala-training
Streaming Support
www.edureka.co/apache-spark-scala-training
Spark Streaming
 Used for processing the real-time streaming data.
 It uses the DStream which is a series of RDDs, for processing the continuous real-time data.
 Spark Streaming API closely matches that of the Spark Core
www.edureka.co/apache-spark-scala-training
Machine Learning and Graph
Implementation with DAG
www.edureka.co/apache-spark-scala-training
Machine Learning
MLlib, a
machine
learning library
Classification Regression Clustering
Collaborative
filtering
Some of the algorithms also work with streaming data, such as linear regression using
ordinary least squares or k-means clustering
www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
 All jobs in spark comprise a series of operators and run on a set of data.
 All the operators in a job are used to construct a DAG (Directed Acyclic Graph).
 The DAG is optimized by rearranging and combining operators where possible.
www.edureka.co/apache-spark-scala-training
GraphX
Graph
Algorithms
Page Rank
Connected
Components
Triangle
Counting
 Component for graphs and graph-parallel computation
 Extends the Spark RDD by introducing a new Graph abstraction
www.edureka.co/apache-spark-scala-training
Support for DataFrames
www.edureka.co/apache-spark-scala-training
DataFrame
Inspired by DataFrames in R and Python (Pandas).
DataFrames API is designed to make big data processing on tabular data easier.
DataFrame is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.
Can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.
www.edureka.co/apache-spark-scala-training
DataFrame features
Ability to scale from KBs to PBs
Support for a wide array of data formats and storage systems
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Seamless integration with all big data tooling and infrastructure via spark
APIs for Python, Java, Scala, and R
www.edureka.co/apache-spark-scala-training
Spark’s Integration with Hadoop
www.edureka.co/apache-spark-scala-training
Spark Execution Platforms
 Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
 Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution
modes
Standalone Mesos HDFS
www.edureka.co/apache-spark-scala-training
Spark in one Snapshot
www.edureka.co/apache-spark-scala-training
Spark Use Cases
Different companies are using Spark
for solving various problems e.g.
recommendation systems, business
intelligence, fraud detection etc.
www.edureka.co/apache-spark-scala-training
Who is using Spark?
A complete list of companies using Spark can be found here : https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
www.edureka.co/apache-spark-scala-training
References
IBM backs Apache Spark for Big Data Analytics :
http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/
Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark' :
http://fortune.com/2015/09/09/cloudera-spark-mapreduce/
5 reasons to turn to Spark for Big Data Analytics :
http://www.infoworld.com/article/2897287/big-data/5-reasons-to-turn-to-spark-for-big-data-analytics.html
www.edureka.co/apache-spark-scala-training
References
Spark new record for large scale sorting :
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
How eBay uses Spark to ignite Data Analytics :
http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/
Spark is fast on disk too :
https://gigaom.com/2014/10/10/databricks-demolishes-big-data-benchmark-to-prove-spark-is-fast-on-disk-too/
www.edureka.co/apache-spark-scala-training
Thank You …
Questions/Queries/Feedback
Recording and presentation will be made available to you within 24 hours

More Related Content

PDF
Why Talend for Big Data?
Edureka!
 
PDF
Big Data Processing with Spark and Scala
Edureka!
 
PPT
Hadoop distributions - ecosystem
Jakub Stransky
 
PPTX
ImpalaToGo use case
David Groozman
 
PDF
Spark SQL | Apache Spark
Edureka!
 
PDF
Hadoop : The Pile of Big Data
Edureka!
 
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
PDF
HUG August 2010: Best practices
Hadoop User Group
 
Why Talend for Big Data?
Edureka!
 
Big Data Processing with Spark and Scala
Edureka!
 
Hadoop distributions - ecosystem
Jakub Stransky
 
ImpalaToGo use case
David Groozman
 
Spark SQL | Apache Spark
Edureka!
 
Hadoop : The Pile of Big Data
Edureka!
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
HUG August 2010: Best practices
Hadoop User Group
 

What's hot (20)

PPTX
Spark Application Development Made Easy
DataWorks Summit
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PDF
Apache Spark & Hadoop
MapR Technologies
 
PPTX
5 things one must know about spark!
Edureka!
 
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
PPTX
SQL-on-Hadoop Tutorial
Daniel Abadi
 
PPTX
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
PPTX
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PPTX
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
PDF
Migrating structured data between Hadoop and RDBMS
Bouquet
 
PDF
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Edureka!
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PPTX
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
PDF
XML Parsing with Map Reduce
Edureka!
 
PDF
Data Engineering Quick Guide
Asim Jalis
 
PDF
Cloudera Impala
Scott Leberknight
 
PDF
Hadoop vs spark
amarkayam
 
PPTX
An intriduction to hive
Reza Ameri
 
Spark Application Development Made Easy
DataWorks Summit
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Apache Spark & Hadoop
MapR Technologies
 
5 things one must know about spark!
Edureka!
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Hadoop vs Apache Spark
ALTEN Calsoft Labs
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
Migrating structured data between Hadoop and RDBMS
Bouquet
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Edureka!
 
High Performance Python on Apache Spark
Wes McKinney
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
 
XML Parsing with Map Reduce
Edureka!
 
Data Engineering Quick Guide
Asim Jalis
 
Cloudera Impala
Scott Leberknight
 
Hadoop vs spark
amarkayam
 
An intriduction to hive
Reza Ameri
 
Ad

Viewers also liked (15)

PDF
Spark Will Replace Hadoop ! Know Why
Edureka!
 
PPTX
5 reasons why spark is in demand!
Edureka!
 
PPTX
Apache spark
Edureka!
 
PDF
Understanding Big Data And Hadoop
Edureka!
 
PDF
Apache Spark beyond Hadoop MapReduce
Edureka!
 
PPTX
Spark for big data analytics
Edureka!
 
PPTX
Big data Processing with Apache Spark & Scala
Edureka!
 
PDF
Fault Tolerance with Kafka
Edureka!
 
PDF
Introduction to Big Data & Hadoop
Edureka!
 
PDF
Performance of Spark vs MapReduce
Edureka!
 
PDF
Spark For Faster Batch Processing
Edureka!
 
PDF
Hadoop Architecture and HDFS
Edureka!
 
PDF
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
PDF
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Etu Solution
 
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Edureka!
 
Spark Will Replace Hadoop ! Know Why
Edureka!
 
5 reasons why spark is in demand!
Edureka!
 
Apache spark
Edureka!
 
Understanding Big Data And Hadoop
Edureka!
 
Apache Spark beyond Hadoop MapReduce
Edureka!
 
Spark for big data analytics
Edureka!
 
Big data Processing with Apache Spark & Scala
Edureka!
 
Fault Tolerance with Kafka
Edureka!
 
Introduction to Big Data & Hadoop
Edureka!
 
Performance of Spark vs MapReduce
Edureka!
 
Spark For Faster Batch Processing
Edureka!
 
Hadoop Architecture and HDFS
Edureka!
 
MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka
Edureka!
 
Track A-1: Cloudera 大數據產品和技術最前沿資訊報告
Etu Solution
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Edureka!
 
Ad

Similar to 5 things one must know about spark! (20)

PDF
5 Reasons why Spark is in demand!
Edureka!
 
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
PDF
Apache spark
Dona Mary Philip
 
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
PDF
Big Data Processing With Spark
Edureka!
 
PDF
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
PDF
Module01
NPN Training
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PPTX
apache spark Presentation general seminar.pptx
abhinavas9207
 
PDF
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
 
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
PPTX
Apache spark installation [autosaved]
Shweta Patnaik
 
PPTX
Introduction to spark
Home
 
PPTX
Spark introduction & Architecture.pptx
MUMERSHARJEELCh
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PDF
SparkPaper
Suraj Thapaliya
 
5 Reasons why Spark is in demand!
Edureka!
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Edureka!
 
Apache spark
Dona Mary Philip
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Big Data Processing With Spark
Edureka!
 
Spark Interview Questions and Answers | Apache Spark Interview Questions | Sp...
Edureka!
 
Module01
NPN Training
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
apache spark Presentation general seminar.pptx
abhinavas9207
 
Infra space talk on Apache Spark - Into to CASK
Rob Mueller
 
Spark Concepts Cheat Sheet_Interview_Question.pdf
aekannake
 
39.-Introduction-to-Sparkspark and all-1.pdf
ajajkhan16
 
Apache spark installation [autosaved]
Shweta Patnaik
 
Introduction to spark
Home
 
Spark introduction & Architecture.pptx
MUMERSHARJEELCh
 
Apache Spark Overview
Dharmjit Singh
 
Apache Spark PDF
Naresh Rupareliya
 
SparkPaper
Suraj Thapaliya
 

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
PDF
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
PDF
Tableau Tutorial for Data Science | Edureka
Edureka!
 
PDF
Python Programming Tutorial | Edureka
Edureka!
 
PDF
Top 5 PMP Certifications | Edureka
Edureka!
 
PDF
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
PDF
Linux Mint Tutorial | Edureka
Edureka!
 
PDF
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
PDF
Importance of Digital Marketing | Edureka
Edureka!
 
PDF
RPA in 2020 | Edureka
Edureka!
 
PDF
Email Notifications in Jenkins | Edureka
Edureka!
 
PDF
EA Algorithm in Machine Learning | Edureka
Edureka!
 
PDF
Cognitive AI Tutorial | Edureka
Edureka!
 
PDF
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
PDF
Blue Prism Top Interview Questions | Edureka
Edureka!
 
PDF
Big Data on AWS Tutorial | Edureka
Edureka!
 
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
PDF
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
PDF
Introduction to DevOps | Edureka
Edureka!
 
What to learn during the 21 days Lockdown | Edureka
Edureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Edureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Edureka!
 
Tableau Tutorial for Data Science | Edureka
Edureka!
 
Python Programming Tutorial | Edureka
Edureka!
 
Top 5 PMP Certifications | Edureka
Edureka!
 
Top Maven Interview Questions in 2020 | Edureka
Edureka!
 
Linux Mint Tutorial | Edureka
Edureka!
 
How to Deploy Java Web App in AWS| Edureka
Edureka!
 
Importance of Digital Marketing | Edureka
Edureka!
 
RPA in 2020 | Edureka
Edureka!
 
Email Notifications in Jenkins | Edureka
Edureka!
 
EA Algorithm in Machine Learning | Edureka
Edureka!
 
Cognitive AI Tutorial | Edureka
Edureka!
 
AWS Cloud Practitioner Tutorial | Edureka
Edureka!
 
Blue Prism Top Interview Questions | Edureka
Edureka!
 
Big Data on AWS Tutorial | Edureka
Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Edureka!
 
Kubernetes Installation on Ubuntu | Edureka
Edureka!
 
Introduction to DevOps | Edureka
Edureka!
 

Recently uploaded (20)

PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Doc9.....................................
SofiaCollazos
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

5 things one must know about spark!