Block Sampling: Efficient Accurate Online Aggregation in MapReduce

1 like608 views

The paper discusses block sampling as a method for efficient online aggregation in MapReduce, enabling quicker access to useful results even before job completion. It highlights challenges such as slow random disk access and the need for effective sampling strategies given unstructured data. The authors propose a technique that uses in-memory shuffling to improve sampling rates and accuracy while reducing the communication costs among mapper tasks.

Technology

Problem and Motivation
Luckily, in many cases results can be
useful even before job completion
○ tolerate some inaccuracy
○ benefit from faster answers
2
Big data processing is usually very time-
consuming...
… but many applications require results
really fast or can only use results for a
limited window of time

MapReduce vs. MapReduce Online
mapper
reducer
Local
Disk
Input
Record map
function
Output
Record
HTTP request
In original MR, a reducer task cannot
fetch the output of a map task which
hasn't committed its output to disk
mapper
reducer
Input
Record map
function
Output
Record
TCP- push/pull
3

Online Aggregation
● Apply the reduce function to the data seen so far
● % input processed to estimate accuracy
4

Sampling Challenges
● Data in HDFS
○ Disk already access is terribly slow
○ Random disk access for sampling is even slower
● Unstructured Data
○ Sample based on what?
○ We don’t know the query, we don’t know the
key or the value!
5

MapReduce Online vs. Block Sampling
Average Temperature Estimation on Weather Data
Unsorted Sorted
7

Takeaway
8
● Useful results even before job completion
● Disk random access is prohibitively
expensive → efficiently emulate sampling
using in-memory shuffling
● Higher sampling rate improves accuracy but
also increases communication costs among
mapper tasks

Average Temperature Estimation on
Sorted and Unsorted Weather Data
Unsorted Sorted
6
How do the block sampling rate and the % of processed input
affect accuracy?

Performance - Bias Reduction
snapshot freq = 10%

Experimental Setup
● 8 large-instance OpenStack VMs
○ 4 vCPUs, 8 GB memory, 90 GB disk
● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14
● up to 17 map tasks and 5 reduce tasks per job, HDFS
block size of 64MB
● weather station data from the National Climatic
Data Center ftp server (available years 1901 to 2013)
● the complete Project Gutenberg e-books catalog
(30615 e-books in .txt format)

Bias Reduction
● Access Phase: Store the entire input split
in the reader task’s local memory
● Shuffling Phase: Shuffle the records of
the block in-place
● Processing Phase: Serve a record to the
mapper task from local memory (avoids
additional disk I/O)

Future Work
● Integrate statistical estimators
○ provide error bounds for users
● Automatically fine-tune sampling
parameters based on system
configuration
● Explore alternative sampling techniques
and wavelet-approximation

More Related Content

What's hot (20)

PDF

Predictive Datacenter Analytics with StrymonVasia Kalavri

PDF

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

PDF

A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx

PPTX

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

PDF

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward

PPT

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin

PDF

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

PPTX

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

PDF

Introduction to Real-time data processingYogi Devendra Vyavahare

PPTX

First Flink Bay Area meetupKostas Tzoumas

PPTX

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...BJ Jang

PDF

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

PDF

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

PPTX

Case study- Real-time OLAP Cubes Ziemowit Jankowski

PDF

Pregel: A System For Large Scale Graph ProcessingRiyad Parvez

PPTX

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward

PDF

Map-Side Merge Joins for Scalable SPARQL BGP ProcessingAlexander Schätzle

PDF

Mikio Braun – Data flow vs. procedural programming Flink Forward

ODP

Google's DremelMaria Stylianou

PDF

Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi

Predictive Datacenter Analytics with StrymonVasia Kalavri

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward

Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

Introduction to Real-time data processingYogi Devendra Vyavahare

First Flink Bay Area meetupKostas Tzoumas

Managing Multi-DBMS on a Single UI, a Web-based Spatial DB Manager-FOSS4G A...BJ Jang

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit

Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue

Case study- Real-time OLAP Cubes Ziemowit Jankowski

Pregel: A System For Large Scale Graph ProcessingRiyad Parvez

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward

Map-Side Merge Joins for Scalable SPARQL BGP ProcessingAlexander Schätzle

Mikio Braun – Data flow vs. procedural programming Flink Forward

Google's DremelMaria Stylianou

Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi

Viewers also liked (8)

PDF

Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri

PDF

The shortest path is not always a straight lineVasia Kalavri

PDF

Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri

PDF

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri

PDF

Apache Flink Deep DiveVasia Kalavri

PDF

A Skype case study (2011)Vasia Kalavri

PDF

Demystifying Distributed Graph ProcessingVasia Kalavri

PPTX

Flink vs. SparkSlim Baltagi

Like a Pack of Wolves: Community Structure of Web TrackersVasia Kalavri

The shortest path is not always a straight lineVasia Kalavri

Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri

Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri

Apache Flink Deep DiveVasia Kalavri

A Skype case study (2011)Vasia Kalavri

Demystifying Distributed Graph ProcessingVasia Kalavri

Flink vs. SparkSlim Baltagi

Recently uploaded (20)

PPTX

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

PPTX

Building Search Using OpenSearch: Limitations and WorkaroundsSease

PDF

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

PDF

From Code to Challenge: Crafting Skill-Based Games That Engage and Rewardaiyshauae

PDF

SWEBOK Guide and Software Services Engineering EducationHironori Washizaki

PDF

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

PPTX

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

PDF

How Startups Are Growing Faster with App Developers in Australia.pdfIndia App Developer

PDF

Python basic programing language for automationDanialHabibi2

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

PDF

Blockchain Transactions Explained For EveryoneCIFDAQ

PPTX

AI Penetration Testing Essentials: A Cybersecurity Guide for 2025defencerabbit Team

PDF

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

PPTX

Webinar: Introduction to LF Energy EVerestDanBrown980551

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

PDF

Agentic AI lifecycle for Enterprise Hyper-AutomationDebmalya Biswas

PDF

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

PDF

LLMs.txt: Easily Control How AI Crawls Your SiteKeploy

PDF

July Patch TuesdayIvanti

PDF

Exolore The Essential AI Tools in 2025.pdfSrinivasan M

WooCommerce Workshop: Bring Your LaptopLaura Hartwig

Building Search Using OpenSearch: Limitations and WorkaroundsSease

Empower Inclusion Through Accessible Java ApplicationsAna-Maria Mihalceanu

From Code to Challenge: Crafting Skill-Based Games That Engage and Rewardaiyshauae

SWEBOK Guide and Software Services Engineering EducationHironori Washizaki

CIFDAQ Weekly Market Wrap for 11th July 2025CIFDAQ

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

How Startups Are Growing Faster with App Developers in Australia.pdfIndia App Developer

Python basic programing language for automationDanialHabibi2

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

Blockchain Transactions Explained For EveryoneCIFDAQ

AI Penetration Testing Essentials: A Cybersecurity Guide for 2025defencerabbit Team

Log-Based Anomaly Detection: Enhancing System Reliability with Machine LearningMohammed BEKKOUCHE

Webinar: Introduction to LF Energy EVerestDanBrown980551

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

Agentic AI lifecycle for Enterprise Hyper-AutomationDebmalya Biswas

[Newgen] NewgenONE Marvin Brochure 1.pdfdarshakparmar

LLMs.txt: Easily Control How AI Crawls Your SiteKeploy

July Patch TuesdayIvanti

Exolore The Essential AI Tools in 2025.pdfSrinivasan M

Block Sampling: Efficient Accurate Online Aggregation in MapReduce

1. Block Sampling: Efficient Accurate Online Aggregation in MapReduce 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013) Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov {kalavri, vaidas, vladv}@kth.se 3 December 2013, Bristol, UK

2. Problem and Motivation Luckily, in many cases results can be useful even before job completion ○ tolerate some inaccuracy ○ benefit from faster answers 2 Big data processing is usually very time- consuming... … but many applications require results really fast or can only use results for a limited window of time

3. MapReduce vs. MapReduce Online mapper reducer Local Disk Input Record map function Output Record HTTP request In original MR, a reducer task cannot fetch the output of a map task which hasn't committed its output to disk mapper reducer Input Record map function Output Record TCP- push/pull 3

4. Online Aggregation ● Apply the reduce function to the data seen so far ● % input processed to estimate accuracy 4

5. Sampling Challenges ● Data in HDFS ○ Disk already access is terribly slow ○ Random disk access for sampling is even slower ● Unstructured Data ○ Sample based on what? ○ We don’t know the query, we don’t know the key or the value! 5

6. The Block Sampling Technique 6

7. MapReduce Online vs. Block Sampling Average Temperature Estimation on Weather Data Unsorted Sorted 7

8. Takeaway 8 ● Useful results even before job completion ● Disk random access is prohibitively expensive → efficiently emulate sampling using in-memory shuffling ● Higher sampling rate improves accuracy but also increases communication costs among mapper tasks

9. Block Sampling: Efficient Accurate Online Aggregation in MapReduce 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013) Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov {kalavri, vaidas, vladv}@kth.se 3 December 2013, Bristol, UK

10. Average Temperature Estimation on Sorted and Unsorted Weather Data Unsorted Sorted 6 How do the block sampling rate and the % of processed input affect accuracy?

11. Performance - Sampling Rate

12. Performance - Bias Reduction snapshot freq = 10%

13. Experimental Setup ● 8 large-instance OpenStack VMs ○ 4 vCPUs, 8 GB memory, 90 GB disk ● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14 ● up to 17 map tasks and 5 reduce tasks per job, HDFS block size of 64MB ● weather station data from the National Climatic Data Center ftp server (available years 1901 to 2013) ● the complete Project Gutenberg e-books catalog (30615 e-books in .txt format)

14. System Configuration Parameters

15. Bias Reduction ● Access Phase: Store the entire input split in the reader task’s local memory ● Shuffling Phase: Shuffle the records of the block in-place ● Processing Phase: Serve a record to the mapper task from local memory (avoids additional disk I/O)

16. Future Work ● Integrate statistical estimators ○ provide error bounds for users ● Automatically fine-tune sampling parameters based on system configuration ● Explore alternative sampling techniques and wavelet-approximation