Distributed Processing Frameworks

Download as PPTX, PDF

1 like841 views

This document discusses distributed processing frameworks for big data. It introduces MapReduce as a programming model that enables parallel processing of large datasets across clusters. While MapReduce was novel, it was limited to batch processing and only supported map and reduce operations. Spark was then proposed as another framework to replace MapReduce, representing computations as directed acyclic graphs and caching datasets in memory for better performance. Both systems introduced challenges in measuring and improving performance at scale.

Software

More Related Content

What's hot (19)

PPTX

High performance computing with accelaratorsEmmanuel college

PPT

Hadoop mapreduce performance study on arm clusterairbots

PDF

High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana

PPTX

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju

PPTX

APSys Presentation Final copy2Junli Gu

PDF

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu

PPTX

GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc

PPT

OpenCL caffe IWOCL 2016 presentation finalJunli Gu

PPTX

Modern processor artwaqasjadoon11

PPTX

Danish presentationwaqasjadoon11

PDF

High performance computing - building blocks, production & perspectiveJason Shih

PPTX

Взгляд на облака с точки зрения HPCOlga Lavrentieva

PPTX

Lec04 gpu architectureTaras Zakharchenko

PPTX

Greenplum-Spark November 2018KongYew Chan, MBA

PPTX

Optimizing High Performance Computing Applications for EnergyDavid Lecomber

PDF

MapReduce and HadoopNicola Cadenelli

PPT

Advanced Hadoop Tuning and Optimization Shivkumar Babshetty

PPTX

Exascale CapablSagar Dolas

PPT

Map Reduceopenak

High performance computing with accelaratorsEmmanuel college

Hadoop mapreduce performance study on arm clusterairbots

High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju

APSys Presentation Final copy2Junli Gu

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu

GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc

OpenCL caffe IWOCL 2016 presentation finalJunli Gu

Modern processor artwaqasjadoon11

Danish presentationwaqasjadoon11

High performance computing - building blocks, production & perspectiveJason Shih

Взгляд на облака с точки зрения HPCOlga Lavrentieva

Lec04 gpu architectureTaras Zakharchenko

Greenplum-Spark November 2018KongYew Chan, MBA

Optimizing High Performance Computing Applications for EnergyDavid Lecomber

MapReduce and HadoopNicola Cadenelli

Advanced Hadoop Tuning and Optimization Shivkumar Babshetty

Exascale CapablSagar Dolas

Map Reduceopenak

Similar to Distributed Processing Frameworks (20)

PPTX

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event

PDF

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

PPTX

Big Data ProcessingMichael Ming Lei

PPTX

Introduction to Hadoop and MapReduceCsaba Toth

PDF

tryLamha Agarwal

PPT

11. From Hadoop to Spark 1:2Fabio Fumarola

PPTX

Hadoop MapReduce ParadigmTarjMehta1

PPTX

In memory grids IMDGPrateek Jain

PPTX

Spark.pptx to knowledge gaining in wdm days agoPreethamMCPreethamMC

PDF

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

PDF

Spark Driven Big Data Analyticsinoshg

PPTX

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

PDF

A Survey on Big Data Analysis Techniquesijsrd.com

PDF

Cisco connect toronto 2015 big data sean mc keownCisco Canada

PDF

Big Data Architecture and DeploymentCisco Canada

PPTX

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

PPT

HadoopRamakrishna Reddy Bijjam

PPTX

Introduction to Apache HadoopChristopher Pezza

PDF

What is Distributed Computing, Why we use Apache SparkAndy Petrella

PPTX

Analysing of big data using map reducePaladion Networks

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

Big Data ProcessingMichael Ming Lei

Introduction to Hadoop and MapReduceCsaba Toth

tryLamha Agarwal

11. From Hadoop to Spark 1:2Fabio Fumarola

Hadoop MapReduce ParadigmTarjMehta1

In memory grids IMDGPrateek Jain

Spark.pptx to knowledge gaining in wdm days agoPreethamMCPreethamMC

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

Spark Driven Big Data Analyticsinoshg

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

A Survey on Big Data Analysis Techniquesijsrd.com

Cisco connect toronto 2015 big data sean mc keownCisco Canada

Big Data Architecture and DeploymentCisco Canada

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

HadoopRamakrishna Reddy Bijjam

Introduction to Apache HadoopChristopher Pezza

What is Distributed Computing, Why we use Apache SparkAndy Petrella

Analysing of big data using map reducePaladion Networks

More from Antonios Katsarakis (9)

PDF

Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...Antonios Katsarakis

PDF

Dandelion Hashtable: beyond billion requests per second on a commodity server...Antonios Katsarakis

PPTX

Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis

PDF

The L2AW theoremAntonios Katsarakis

PDF

Invalidation-Based Protocols for Replicated DatastoresAntonios Katsarakis

PDF

Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Antonios Katsarakis

PDF

Hermes Reliable Replication Protocol - Poster Antonios Katsarakis

PDF

Hermes Reliable Replication Protocol - ASPLOS'20 PresentationAntonios Katsarakis

PDF

Scale-out ccNUMA - Eurosys'18Antonios Katsarakis

Dandelion: Hundreds of Millions of Distributed Replicated Transactions with F...Antonios Katsarakis

Dandelion Hashtable: beyond billion requests per second on a commodity server...Antonios Katsarakis

Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis

The L2AW theoremAntonios Katsarakis

Invalidation-Based Protocols for Replicated DatastoresAntonios Katsarakis

Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Antonios Katsarakis

Hermes Reliable Replication Protocol - Poster Antonios Katsarakis

Hermes Reliable Replication Protocol - ASPLOS'20 PresentationAntonios Katsarakis

Scale-out ccNUMA - Eurosys'18Antonios Katsarakis

Recently uploaded (20)

PDF

AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025bashirkhan333g

PDF

Download Canva Pro 2025 PC Crack Full Latest Versionbashirkhan333g

PDF

TheFutureIsDynamic-BoxLang witch Luis Majano.pdfOrtus Solutions, Corp

PPTX

Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5klpathrudu

PPTX

Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptxDele Amefo

PPTX

Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...bbedford2

PPTX

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

PDF

Technical-Careers-Roadmap-in-Software-Market.pdfHussein Ali

PDF

MiniTool Partition Wizard Free Crack + Full Free Download 2025bashirkhan333g

PDF

Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...logixshapers59

PPTX

Function & Procedure: Function Vs Procedure in PL/SQLShani Tiwari

PDF

AI + DevOps = Smart Automation with devseccops.ai.pdfDevseccops.ai

PDF

Meet in the Middle: Solving the Low-Latency Challenge for Agentic AIAlluxio, Inc.

PDF

Ready Layer One: Intro to the Model Context Protocolmmckenna1

PDF

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

PDF

intro_to_cpp_namespace_robotics_corner.pdfMohamedSaied877003

PPTX

UI5con_2025_Accessibility_Ever_Evolving_gerganakremenska1

PPTX

iaas vs paas vs saas :choosing your cloud strategyCloudlayaTechnology

PDF

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

PDF

Is Framer the Future of AI Powered No-Code Development?Isla Pandora

AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025bashirkhan333g

Download Canva Pro 2025 PC Crack Full Latest Versionbashirkhan333g

TheFutureIsDynamic-BoxLang witch Luis Majano.pdfOrtus Solutions, Corp

Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5klpathrudu

Prompt Like a Pro. Leveraging Salesforce Data to Power AI Workflows.pptxDele Amefo

Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...bbedford2

Agentic Automation: Build & Deploy Your First UiPath Agentklpathrudu

Technical-Careers-Roadmap-in-Software-Market.pdfHussein Ali

MiniTool Partition Wizard Free Crack + Full Free Download 2025bashirkhan333g

Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...logixshapers59

Function & Procedure: Function Vs Procedure in PL/SQLShani Tiwari

AI + DevOps = Smart Automation with devseccops.ai.pdfDevseccops.ai

Meet in the Middle: Solving the Low-Latency Challenge for Agentic AIAlluxio, Inc.

Ready Layer One: Intro to the Model Context Protocolmmckenna1

MiniTool Power Data Recovery 8.8 With Crack New Latest 2025bashirkhan333g

intro_to_cpp_namespace_robotics_corner.pdfMohamedSaied877003

UI5con_2025_Accessibility_Ever_Evolving_gerganakremenska1

iaas vs paas vs saas :choosing your cloud strategyCloudlayaTechnology

IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025bashirkhan333g

Is Framer the Future of AI Powered No-Code Development?Isla Pandora

Distributed Processing Frameworks

1. Distributed Processing Frameworks Author: Antonios Katsarakis

2. Literature • MapReduce: Simplified Data Processing on Large Clusters Jeff Dean et al. - OSDI’04. • Spark: Cluster Computing with Working Sets M. Zaharia et al. - HotCloud’10.

3. Why Big Data? • More data to process: IoT, smart devices, web applications - About 2.3 trillion GB of new data are generated every day • Growth of CPU performance cannot keep up with increasing amount of data to process • This leads us to the Big Data era - Big data: Data sets are so large that the processing power of a single machine is inadequate to deal with them • We need to find ways to process these massive amounts of data

4. MapReduce • Proposed by Jeff Dean et al. (Google) 2004 - Cited more than 18k • A programming model that enables the parallel and distributed processing of large data sets • Typical MapReduce Program: - Read Data - Map: filtering of the data - Shuffle and short - Reduce: summary operation on data - Write the Results ReduceReduce Input Data 1/3 Input 1/3 Input 1/3 Input Map Map Map Interm. Data Interm. Data Interm. Data Output Data Output Data

5. Critical Reflection • Outcome: - Novel idea that lead to a whole new era of distributed systems - Big impact in industry (Hadoop MapReduce) - Lowered the cost of computations • Limitations: - Restricted to batch processing - It only support map and reduce operations - The shuffling phase introduces overheads

6. Spark • Proposed by Matei Zaharia et al. 2010 - Cited 1.5k • Another programming model based on higher-ordered functions that execute user-defined functions in parallel • Aims to replace MapReduce in industry • Main Ideas: - Represent the computations as DAGs - Cache datasets into memory

7. Spark Model • Resilient Distributed Datasets (RRDs): immutable collections of objects spread across a cluster • Operations over RDDs: 1.Transformations: lazy operators that create new RDDs 2.Actions: launch a computation on an RDD Pipelined RDD1 var count = readFile(…) .map(…) .filter(..) .reduceByKey() .count() File splited into chunks (RDD0) RDD2 RDD3 RDD4 Result Job (RDD) Graph Stage1St.2

8. Critical Reflection • Benefits: - High level API - Support more applications types - Performance optimizations • Limitations: - Detailed performance analysis on the thread level is hard - Multipurpose application support makes performance improvements and tuning really challenging - The shuffling phase introduces overheads

9. Conclusion • Clusters provide the computational power to process Big Data • MapReduce allows developers to build programs for clusters • Spark tries to overcome limitations of MapReduce • These systems introduce many challenges in terms of measuring and improving their performance

Editor's Notes

#9: HL API - (in Scala, Java, Python) - usable by non computer scientists SMAT - (streaming, iterative and interactive) PO - (memory caching, transformation pipelining etc.)
#10: 3* (in terms of performance, application support and user friendliness)