SlideShare a Scribd company logo
4
Most read
Distributed Processing
Frameworks
Author: Antonios Katsarakis
Literature
• MapReduce: Simplified Data Processing on Large Clusters
Jeff Dean et al. - OSDI’04.
• Spark: Cluster Computing with Working Sets
M. Zaharia et al. - HotCloud’10.
Why Big Data?
• More data to process: IoT, smart devices, web applications
- About 2.3 trillion GB of new data are generated every day
• Growth of CPU performance cannot keep up with increasing
amount of data to process
• This leads us to the Big Data era
- Big data: Data sets are so large that the processing power of a
single machine is inadequate to deal with them
• We need to find ways to process these massive amounts of data
MapReduce
• Proposed by Jeff Dean et al. (Google) 2004
- Cited more than 18k
• A programming model that enables the parallel
and distributed processing of large data sets
• Typical MapReduce Program:
- Read Data
- Map: filtering of the data
- Shuffle and short
- Reduce: summary operation on data
- Write the Results
ReduceReduce
Input Data
1/3
Input
1/3
Input
1/3
Input
Map Map Map
Interm.
Data
Interm.
Data
Interm.
Data
Output
Data
Output
Data
Critical Reflection
• Outcome:
- Novel idea that lead to a whole new era of distributed systems
- Big impact in industry (Hadoop MapReduce)
- Lowered the cost of computations
• Limitations:
- Restricted to batch processing
- It only support map and reduce operations
- The shuffling phase introduces overheads
Spark
• Proposed by Matei Zaharia et al. 2010
- Cited 1.5k
• Another programming model based on
higher-ordered functions that execute
user-defined functions in parallel
• Aims to replace MapReduce in industry
• Main Ideas:
- Represent the computations as DAGs
- Cache datasets into memory
Spark Model
• Resilient Distributed
Datasets (RRDs):
immutable collections of
objects spread across a
cluster
• Operations over RDDs:
1.Transformations: lazy
operators that create new
RDDs
2.Actions: launch a
computation on an RDD
Pipelined
RDD1
var count = readFile(…)
.map(…)
.filter(..)
.reduceByKey()
.count()
File splited
into chunks
(RDD0)
RDD2
RDD3
RDD4
Result
Job (RDD) Graph
Stage1St.2
Critical Reflection
• Benefits:
- High level API
- Support more applications types
- Performance optimizations
• Limitations:
- Detailed performance analysis on the thread level is hard
- Multipurpose application support makes performance improvements and
tuning really challenging
- The shuffling phase introduces overheads
Conclusion
• Clusters provide the computational power to
process Big Data
• MapReduce allows developers to build programs for
clusters
• Spark tries to overcome limitations of MapReduce
• These systems introduce many challenges in terms
of measuring and improving their performance

More Related Content

What's hot (19)

PPTX
High performance computing with accelarators
Emmanuel college
 
PPT
Hadoop mapreduce performance study on arm cluster
airbots
 
PDF
High performance computing tutorial, with checklist and tips to optimize clus...
Pradeep Redddy Raamana
 
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
PPTX
APSys Presentation Final copy2
Junli Gu
 
PDF
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Junli Gu
 
PPTX
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
 
PPT
OpenCL caffe IWOCL 2016 presentation final
Junli Gu
 
PPTX
Modern processor art
waqasjadoon11
 
PPTX
Danish presentation
waqasjadoon11
 
PDF
High performance computing - building blocks, production & perspective
Jason Shih
 
PPTX
Взгляд на облака с точки зрения HPC
Olga Lavrentieva
 
PPTX
Lec04 gpu architecture
Taras Zakharchenko
 
PPTX
Greenplum-Spark November 2018
KongYew Chan, MBA
 
PPTX
Optimizing High Performance Computing Applications for Energy
David Lecomber
 
PDF
MapReduce and Hadoop
Nicola Cadenelli
 
PPT
Advanced Hadoop Tuning and Optimization
Shivkumar Babshetty
 
PPTX
Exascale Capabl
Sagar Dolas
 
PPT
Map Reduce
openak
 
High performance computing with accelarators
Emmanuel college
 
Hadoop mapreduce performance study on arm cluster
airbots
 
High performance computing tutorial, with checklist and tips to optimize clus...
Pradeep Redddy Raamana
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
APSys Presentation Final copy2
Junli Gu
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Junli Gu
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
 
OpenCL caffe IWOCL 2016 presentation final
Junli Gu
 
Modern processor art
waqasjadoon11
 
Danish presentation
waqasjadoon11
 
High performance computing - building blocks, production & perspective
Jason Shih
 
Взгляд на облака с точки зрения HPC
Olga Lavrentieva
 
Lec04 gpu architecture
Taras Zakharchenko
 
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Optimizing High Performance Computing Applications for Energy
David Lecomber
 
MapReduce and Hadoop
Nicola Cadenelli
 
Advanced Hadoop Tuning and Optimization
Shivkumar Babshetty
 
Exascale Capabl
Sagar Dolas
 
Map Reduce
openak
 

Similar to Distributed Processing Frameworks (20)

PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
PDF
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
PPTX
Big Data Processing
Michael Ming Lei
 
PPTX
Introduction to Hadoop and MapReduce
Csaba Toth
 
PPT
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
PPTX
Hadoop MapReduce Paradigm
TarjMehta1
 
PPTX
In memory grids IMDG
Prateek Jain
 
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
PDF
Spark Driven Big Data Analytics
inoshg
 
PPTX
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
PDF
A Survey on Big Data Analysis Techniques
ijsrd.com
 
PDF
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
PDF
Big Data Architecture and Deployment
Cisco Canada
 
PPTX
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
PPTX
Introduction to Apache Hadoop
Christopher Pezza
 
PDF
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
PPTX
Analysing of big data using map reduce
Paladion Networks
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Big Data Processing
Michael Ming Lei
 
Introduction to Hadoop and MapReduce
Csaba Toth
 
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Hadoop MapReduce Paradigm
TarjMehta1
 
In memory grids IMDG
Prateek Jain
 
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Spark Driven Big Data Analytics
inoshg
 
Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk
Andrii Vozniuk
 
A Survey on Big Data Analysis Techniques
ijsrd.com
 
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Big Data Architecture and Deployment
Cisco Canada
 
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
 
Introduction to Apache Hadoop
Christopher Pezza
 
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Analysing of big data using map reduce
Paladion Networks
 
Ad

More from Antonios Katsarakis (9)

PDF
Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...
Antonios Katsarakis
 
PDF
Dandelion Hashtable: beyond billion requests per second on a commodity server...
Antonios Katsarakis
 
PPTX
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
PDF
The L2AW theorem
Antonios Katsarakis
 
PDF
Invalidation-Based Protocols for Replicated Datastores
Antonios Katsarakis
 
PDF
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Antonios Katsarakis
 
PDF
Hermes Reliable Replication Protocol - Poster
Antonios Katsarakis
 
PDF
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Antonios Katsarakis
 
PDF
Scale-out ccNUMA - Eurosys'18
Antonios Katsarakis
 
Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...
Antonios Katsarakis
 
Dandelion Hashtable: beyond billion requests per second on a commodity server...
Antonios Katsarakis
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
The L2AW theorem
Antonios Katsarakis
 
Invalidation-Based Protocols for Replicated Datastores
Antonios Katsarakis
 
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Antonios Katsarakis
 
Hermes Reliable Replication Protocol - Poster
Antonios Katsarakis
 
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Antonios Katsarakis
 
Scale-out ccNUMA - Eurosys'18
Antonios Katsarakis
 
Ad

Recently uploaded (20)

PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PPTX
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
PDF
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PDF
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
intro_to_cpp_namespace_robotics_corner.pdf
MohamedSaied877003
 
PPTX
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
Is Framer the Future of AI Powered No-Code Development?
Isla Pandora
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptx
Dele Amefo
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Technical-Careers-Roadmap-in-Software-Market.pdf
Hussein Ali
 
MiniTool Partition Wizard Free Crack + Full Free Download 2025
bashirkhan333g
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
intro_to_cpp_namespace_robotics_corner.pdf
MohamedSaied877003
 
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Is Framer the Future of AI Powered No-Code Development?
Isla Pandora
 

Distributed Processing Frameworks

  • 2. Literature • MapReduce: Simplified Data Processing on Large Clusters Jeff Dean et al. - OSDI’04. • Spark: Cluster Computing with Working Sets M. Zaharia et al. - HotCloud’10.
  • 3. Why Big Data? • More data to process: IoT, smart devices, web applications - About 2.3 trillion GB of new data are generated every day • Growth of CPU performance cannot keep up with increasing amount of data to process • This leads us to the Big Data era - Big data: Data sets are so large that the processing power of a single machine is inadequate to deal with them • We need to find ways to process these massive amounts of data
  • 4. MapReduce • Proposed by Jeff Dean et al. (Google) 2004 - Cited more than 18k • A programming model that enables the parallel and distributed processing of large data sets • Typical MapReduce Program: - Read Data - Map: filtering of the data - Shuffle and short - Reduce: summary operation on data - Write the Results ReduceReduce Input Data 1/3 Input 1/3 Input 1/3 Input Map Map Map Interm. Data Interm. Data Interm. Data Output Data Output Data
  • 5. Critical Reflection • Outcome: - Novel idea that lead to a whole new era of distributed systems - Big impact in industry (Hadoop MapReduce) - Lowered the cost of computations • Limitations: - Restricted to batch processing - It only support map and reduce operations - The shuffling phase introduces overheads
  • 6. Spark • Proposed by Matei Zaharia et al. 2010 - Cited 1.5k • Another programming model based on higher-ordered functions that execute user-defined functions in parallel • Aims to replace MapReduce in industry • Main Ideas: - Represent the computations as DAGs - Cache datasets into memory
  • 7. Spark Model • Resilient Distributed Datasets (RRDs): immutable collections of objects spread across a cluster • Operations over RDDs: 1.Transformations: lazy operators that create new RDDs 2.Actions: launch a computation on an RDD Pipelined RDD1 var count = readFile(…) .map(…) .filter(..) .reduceByKey() .count() File splited into chunks (RDD0) RDD2 RDD3 RDD4 Result Job (RDD) Graph Stage1St.2
  • 8. Critical Reflection • Benefits: - High level API - Support more applications types - Performance optimizations • Limitations: - Detailed performance analysis on the thread level is hard - Multipurpose application support makes performance improvements and tuning really challenging - The shuffling phase introduces overheads
  • 9. Conclusion • Clusters provide the computational power to process Big Data • MapReduce allows developers to build programs for clusters • Spark tries to overcome limitations of MapReduce • These systems introduce many challenges in terms of measuring and improving their performance

Editor's Notes

  • #9: HL API - (in Scala, Java, Python) - usable by non computer scientists SMAT - (streaming, iterative and interactive) PO - (memory caching, transformation pipelining etc.)
  • #10: 3* (in terms of performance, application support and user friendliness)