SlideShare a Scribd company logo
2
Most read
10
Most read
15
Most read
HADOOP
VS.
APACHE SPARK
Hadoop and Spark are popular Apache projects in the big data
ecosystem.
Apache Spark is an open-source platform, based on the original
Hadoop MapReduce component of the Hadoop ecosystem.
 Apache developed Hadoop project as open-source software
for reliable, scalable, distributed computing.
 A framework that allows distributed processing of large
datasets across clusters of computers using simple
programming models.
 Hadoop can be easily scaled-up to multi cluster machines, each
offering local storage and computation.
 Hadoop libraries are designed in such a way that it can detect
the failed cluster at application layer and can handle those
failures by it.
 Hadoop Common: These are Java libraries and utilities required
for running other Hadoop modules. These libraries provide OS
level and filesystem abstractions and contain the necessary
Java files and scripts required to start and run Hadoop.
 Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
 Hadoop YARN: A framework for job scheduling and cluster
resource management.
 Hadoop MapReduce: A YARN-based system for parallel
processing of large datasets.
The project includes these modules:
Hadoop MapReduce, HDFS and YARN provide a scalable, fault-tolerant and
distributed platform for storage and processing of very large datasets across clusters
of commodity computers. Hadoop uses the same set of nodes for data storage as
well as to perform the computations. This allows Hadoop to improve the
performance of large scale computations by combining computations along with the
storage.
Hadoop vs Apache Spark
Hadoop Distributed File System – HDFS
HDFS is a distributed filesystem that is designed to store large volume of
data reliably.
HDFS stores a single large file on different nodes across the cluster of
commodity machines.
HDFS overlays on top of the existing filesystem. Data is stored in fine
grained blocks, with default block size of 128MB.
HDFS also stores redundant copies of these data blocks in multiple nodes
to ensure reliability and fault tolerance. HDFS is a distributed, reliable and
scalable file system.
YARN (Yet Another Resource Negotiator), a central component in the
Hadoop ecosystem, is a framework for job scheduling and cluster resource
management. The basic idea of YARN is to split up the functionalities of
resource management and job scheduling/monitoring into separate
daemons.
Hadoop YARN
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for processing and
generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps
input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and
process to output the required values. Mapper processes the jobs in parallel on every
cluster and Reducer process them in any available node as directed by YARN.
 It is a framework for analysing data analytics on a distributed
computing cluster.
 It provides in-memory computations for increasing speed and
data processing over MapReduce.
 It utilizes the Hadoop Distributed File System (HDFS) and runs
on top of existing Hadoop cluster.
 It can also process both structured data in Hive and streaming
data from different sources like HDFS, Flume, Kafka, and
Twitter.
Spark Stream
Spark Streaming is an extension of the core Spark API.
Processing live data streams can be done using Spark Streaming, that enables
scalable, high-throughput, fault-tolerant stream.
Input Data can be from any sources like WebStream (TCP sockets), Flume,
Kafka, etc., and can be processed using complex algorithms with high-level
functions like map, reduce, join, etc. Finally, processed data can be pushed out
to filesystems (HDFS), databases, and live dashboards.
We can also apply Spark’s graph processing algorithms and machine learning
on data streams.
Spark SQL
 Apache Spark provides a separate module Spark SQL for processing
structured data.
 Spark SQL has an interface, which provides detailed information about the
structure of the data and the computation being performed.
 Internally, Spark SQL uses this additional information to perform extra
optimizations.
Datasets and DataFrames
 A distributed collection of data is called as Dataset in Apache Spark.
Dataset provides the benefits of RDDs along with utilizing the Spark
SQL’s optimized execution engine.
 A Dataset can be constructed from objects and then manipulated
using functional transformations.
 A DataFrame is a dataset organized into named columns. It is equally
related to a relational database table or a R/Python data frame, but
with richer optimizations under the hood.
 A DataFrame can be constructed, using various data source like
structured data file or Hive tables or external databases or existing
RDDs.
Resilient Distributed Datasets (RDDs)
Spark works on fault-tolerant collection of elements that can be operated
on in parallel, the concept called resilient distributed dataset (RDD). RDDs
can be created in two ways, parallelizing an existing collection in driver
program, or referencing a dataset in an external storage system, such as a
shared filesystem, HDFS, HBase, etc.
Here is a quick comparison guideline before concluding.
Aspects Hadoop Apache Spark
Difficulty MapReduce is difficult to program
and needs abstractions.
Spark is easy to program and does
not require any abstractions.
Interactive Mode
There is no in-built interactive
mode, except Pig and Hive.
It has interactive mode.
Streaming
Hadoop MapReduce just get to
process a batch of large stored
data.
Spark can be used to modify in
real time through Spark
Streaming.
Aspects Hadoop Apache Spark
Performance
MapReduce does not leverage the
memory of the Hadoop cluster to
the maximum.
Spark has been said to execute
batch processing jobs about 10 to
100 times faster than Hadoop
MapReduce.
Latency
MapReduce is disk oriented
completely.
Spark ensures lower latency
computations by caching the
partial results across its
memory of distributed workers.
Ease of coding
Writing Hadoop MapReduce
pipelines is complex and lengthy
process.
Writing Spark code is always more
compact.
CONTACT US
Write to us : business@altencalsoftlabs.com
Visit Our Website: https://www.altencalsoftlabs.com
USA | FRANCE | UK | INDIA | SINGAPORE

More Related Content

What's hot (20)

PPT
Hive(ppt)
Abhinav Tyagi
 
PDF
Machine Learning with Spark MLlib
Todd McGrath
 
PDF
The delta architecture
Prakash Chockalingam
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Introduction to Hadoop
Apache Apex
 
PPTX
Big Data Open Source Technologies
neeraj rathore
 
PPTX
Graph databases
Vinoth Kannan
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PPTX
Introduction to Pig
Prashanth Babu
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Big Data Analytics with Spark
Mohammed Guller
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Hadoop
ABHIJEET RAJ
 
PPT
7. Key-Value Databases: In Depth
Fabio Fumarola
 
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Hive(ppt)
Abhinav Tyagi
 
Machine Learning with Spark MLlib
Todd McGrath
 
The delta architecture
Prakash Chockalingam
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hadoop File system (HDFS)
Prashant Gupta
 
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Hadoop
Apache Apex
 
Big Data Open Source Technologies
neeraj rathore
 
Graph databases
Vinoth Kannan
 
Big Data Analytics with Hadoop
Philippe Julio
 
Introduction to Pig
Prashanth Babu
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Big Data Analytics with Spark
Mohammed Guller
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Apache Spark overview
DataArt
 
Hadoop
ABHIJEET RAJ
 
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 

Similar to Hadoop vs Apache Spark (20)

PPTX
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
PDF
2.1-HADOOP.pdf
MarianJRuben
 
PPT
Unit-3_BDA.ppt
PoojaShah174393
 
PPT
Hadoop distributed file system (HDFS), HDFS concept
kuthubussaman1
 
PPTX
Big Data Technology Stack : Nutshell
Khalid Imran
 
PPTX
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
PPTX
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 
PPTX
Big Data and Hadoop Guide
Simplilearn
 
PPTX
Hadoop_arunam_ppt
jerrin joseph
 
DOCX
Hadoop map reduce
VijayMohan Vasu
 
PPTX
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
PPTX
Distributed Systems Hadoop.pptx
Uttara University
 
PPTX
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
DOCX
project report on hadoop
Manoj Jangalva
 
PDF
What is Apache Hadoop and its ecosystem?
tommychauhan
 
PPT
Introduction to Apache hadoop
Omar Jaber
 
PPT
Taylor bosc2010
BOSC 2010
 
PPTX
Apache hadoop introduction and architecture
Harikrishnan K
 
PDF
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
 
PDF
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
AyeeshaParveen
 
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
2.1-HADOOP.pdf
MarianJRuben
 
Unit-3_BDA.ppt
PoojaShah174393
 
Hadoop distributed file system (HDFS), HDFS concept
kuthubussaman1
 
Big Data Technology Stack : Nutshell
Khalid Imran
 
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 
Big Data and Hadoop Guide
Simplilearn
 
Hadoop_arunam_ppt
jerrin joseph
 
Hadoop map reduce
VijayMohan Vasu
 
Big Data Analytics Presentation on the resourcefulness of Big data
nextstep013
 
Distributed Systems Hadoop.pptx
Uttara University
 
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
project report on hadoop
Manoj Jangalva
 
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Introduction to Apache hadoop
Omar Jaber
 
Taylor bosc2010
BOSC 2010
 
Apache hadoop introduction and architecture
Harikrishnan K
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
AyeeshaParveen
 
Ad

More from ALTEN Calsoft Labs (18)

PPTX
Harnessing the True Potential of Cloud Security.pptx
ALTEN Calsoft Labs
 
PPTX
How do SIS and LMS solutions support the entire student lifecycle?
ALTEN Calsoft Labs
 
PPTX
How can you keep supply chain management
ALTEN Calsoft Labs
 
PPTX
Lean TLF Mock Shells: A Programmer’s Boon
ALTEN Calsoft Labs
 
PPTX
Net suite erp implementation
ALTEN Calsoft Labs
 
PPTX
Case Study: Loan default prediction
ALTEN Calsoft Labs
 
PPTX
Introduction to Robotic Process Automation (rpa) and RPA Case Study
ALTEN Calsoft Labs
 
PPTX
Overview of IoT/M2M Capability
ALTEN Calsoft Labs
 
PPTX
Embedded System and IoT - ALTEN Calsoft Labs
ALTEN Calsoft Labs
 
PPTX
Business Intelligence and Analytics Capability
ALTEN Calsoft Labs
 
PPTX
Top 10 IoT Blogs
ALTEN Calsoft Labs
 
PPTX
Top 10 retail tech trends 2017
ALTEN Calsoft Labs
 
PPTX
Top 9 Retail IoT Use Cases
ALTEN Calsoft Labs
 
PPTX
Top 6 IoT Use Cases in Manufacturing
ALTEN Calsoft Labs
 
PPTX
Top 100 IoT Use Cases
ALTEN Calsoft Labs
 
PPTX
Intel DPDK - ALTEN Calsoft Lab's Expertise
ALTEN Calsoft Labs
 
PPTX
Healthcare Data Analytics Implementation
ALTEN Calsoft Labs
 
PPTX
Genomic Dashboard For Targeted Cancer Therapy
ALTEN Calsoft Labs
 
Harnessing the True Potential of Cloud Security.pptx
ALTEN Calsoft Labs
 
How do SIS and LMS solutions support the entire student lifecycle?
ALTEN Calsoft Labs
 
How can you keep supply chain management
ALTEN Calsoft Labs
 
Lean TLF Mock Shells: A Programmer’s Boon
ALTEN Calsoft Labs
 
Net suite erp implementation
ALTEN Calsoft Labs
 
Case Study: Loan default prediction
ALTEN Calsoft Labs
 
Introduction to Robotic Process Automation (rpa) and RPA Case Study
ALTEN Calsoft Labs
 
Overview of IoT/M2M Capability
ALTEN Calsoft Labs
 
Embedded System and IoT - ALTEN Calsoft Labs
ALTEN Calsoft Labs
 
Business Intelligence and Analytics Capability
ALTEN Calsoft Labs
 
Top 10 IoT Blogs
ALTEN Calsoft Labs
 
Top 10 retail tech trends 2017
ALTEN Calsoft Labs
 
Top 9 Retail IoT Use Cases
ALTEN Calsoft Labs
 
Top 6 IoT Use Cases in Manufacturing
ALTEN Calsoft Labs
 
Top 100 IoT Use Cases
ALTEN Calsoft Labs
 
Intel DPDK - ALTEN Calsoft Lab's Expertise
ALTEN Calsoft Labs
 
Healthcare Data Analytics Implementation
ALTEN Calsoft Labs
 
Genomic Dashboard For Targeted Cancer Therapy
ALTEN Calsoft Labs
 
Ad

Recently uploaded (20)

PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 

Hadoop vs Apache Spark

  • 2. Hadoop and Spark are popular Apache projects in the big data ecosystem. Apache Spark is an open-source platform, based on the original Hadoop MapReduce component of the Hadoop ecosystem.
  • 3.  Apache developed Hadoop project as open-source software for reliable, scalable, distributed computing.  A framework that allows distributed processing of large datasets across clusters of computers using simple programming models.  Hadoop can be easily scaled-up to multi cluster machines, each offering local storage and computation.  Hadoop libraries are designed in such a way that it can detect the failed cluster at application layer and can handle those failures by it.
  • 4.  Hadoop Common: These are Java libraries and utilities required for running other Hadoop modules. These libraries provide OS level and filesystem abstractions and contain the necessary Java files and scripts required to start and run Hadoop.  Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.  Hadoop YARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: A YARN-based system for parallel processing of large datasets. The project includes these modules:
  • 5. Hadoop MapReduce, HDFS and YARN provide a scalable, fault-tolerant and distributed platform for storage and processing of very large datasets across clusters of commodity computers. Hadoop uses the same set of nodes for data storage as well as to perform the computations. This allows Hadoop to improve the performance of large scale computations by combining computations along with the storage.
  • 7. Hadoop Distributed File System – HDFS HDFS is a distributed filesystem that is designed to store large volume of data reliably. HDFS stores a single large file on different nodes across the cluster of commodity machines. HDFS overlays on top of the existing filesystem. Data is stored in fine grained blocks, with default block size of 128MB. HDFS also stores redundant copies of these data blocks in multiple nodes to ensure reliability and fault tolerance. HDFS is a distributed, reliable and scalable file system.
  • 8. YARN (Yet Another Resource Negotiator), a central component in the Hadoop ecosystem, is a framework for job scheduling and cluster resource management. The basic idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. Hadoop YARN
  • 9. Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. Mapper maps input key/value pair to set of intermediate pairs. Reducer takes this intermediate pairs and process to output the required values. Mapper processes the jobs in parallel on every cluster and Reducer process them in any available node as directed by YARN.
  • 10.  It is a framework for analysing data analytics on a distributed computing cluster.  It provides in-memory computations for increasing speed and data processing over MapReduce.  It utilizes the Hadoop Distributed File System (HDFS) and runs on top of existing Hadoop cluster.  It can also process both structured data in Hive and streaming data from different sources like HDFS, Flume, Kafka, and Twitter.
  • 11. Spark Stream Spark Streaming is an extension of the core Spark API. Processing live data streams can be done using Spark Streaming, that enables scalable, high-throughput, fault-tolerant stream. Input Data can be from any sources like WebStream (TCP sockets), Flume, Kafka, etc., and can be processed using complex algorithms with high-level functions like map, reduce, join, etc. Finally, processed data can be pushed out to filesystems (HDFS), databases, and live dashboards. We can also apply Spark’s graph processing algorithms and machine learning on data streams.
  • 12. Spark SQL  Apache Spark provides a separate module Spark SQL for processing structured data.  Spark SQL has an interface, which provides detailed information about the structure of the data and the computation being performed.  Internally, Spark SQL uses this additional information to perform extra optimizations.
  • 13. Datasets and DataFrames  A distributed collection of data is called as Dataset in Apache Spark. Dataset provides the benefits of RDDs along with utilizing the Spark SQL’s optimized execution engine.  A Dataset can be constructed from objects and then manipulated using functional transformations.  A DataFrame is a dataset organized into named columns. It is equally related to a relational database table or a R/Python data frame, but with richer optimizations under the hood.  A DataFrame can be constructed, using various data source like structured data file or Hive tables or external databases or existing RDDs.
  • 14. Resilient Distributed Datasets (RDDs) Spark works on fault-tolerant collection of elements that can be operated on in parallel, the concept called resilient distributed dataset (RDD). RDDs can be created in two ways, parallelizing an existing collection in driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, etc.
  • 15. Here is a quick comparison guideline before concluding. Aspects Hadoop Apache Spark Difficulty MapReduce is difficult to program and needs abstractions. Spark is easy to program and does not require any abstractions. Interactive Mode There is no in-built interactive mode, except Pig and Hive. It has interactive mode. Streaming Hadoop MapReduce just get to process a batch of large stored data. Spark can be used to modify in real time through Spark Streaming.
  • 16. Aspects Hadoop Apache Spark Performance MapReduce does not leverage the memory of the Hadoop cluster to the maximum. Spark has been said to execute batch processing jobs about 10 to 100 times faster than Hadoop MapReduce. Latency MapReduce is disk oriented completely. Spark ensures lower latency computations by caching the partial results across its memory of distributed workers. Ease of coding Writing Hadoop MapReduce pipelines is complex and lengthy process. Writing Spark code is always more compact.
  • 17. CONTACT US Write to us : [email protected] Visit Our Website: https://www.altencalsoftlabs.com USA | FRANCE | UK | INDIA | SINGAPORE