SlideShare a Scribd company logo
Block Sampling:
Efficient Accurate Online
Aggregation in MapReduce
5th IEEE International Conference on Cloud Computing
Technology and Science (CloudCom 2013)
Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov
{kalavri, vaidas, vladv}@kth.se
3 December 2013, Bristol, UK
Problem and Motivation
Luckily, in many cases results can be
useful even before job completion
○ tolerate some inaccuracy
○ benefit from faster answers
2
Big data processing is usually very time-
consuming...
… but many applications require results
really fast or can only use results for a
limited window of time
MapReduce vs. MapReduce Online
mapper
reducer
Local
Disk
Input
Record map
function
Output
Record
HTTP request
In original MR, a reducer task cannot
fetch the output of a map task which
hasn't committed its output to disk
mapper
reducer
Input
Record map
function
Output
Record
TCP- push/pull
3
Online Aggregation
● Apply the reduce function to the data seen so far
● % input processed to estimate accuracy
4
Sampling Challenges
● Data in HDFS
○ Disk already access is terribly slow
○ Random disk access for sampling is even slower
● Unstructured Data
○ Sample based on what?
○ We don’t know the query, we don’t know the
key or the value!
5
The Block Sampling Technique
6
MapReduce Online vs. Block Sampling
Average Temperature Estimation on Weather Data
Unsorted Sorted
7
Takeaway
8
● Useful results even before job completion
● Disk random access is prohibitively
expensive → efficiently emulate sampling
using in-memory shuffling
● Higher sampling rate improves accuracy but
also increases communication costs among
mapper tasks
Block Sampling:
Efficient Accurate Online
Aggregation in MapReduce
5th IEEE International Conference on Cloud Computing
Technology and Science (CloudCom 2013)
Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov
{kalavri, vaidas, vladv}@kth.se
3 December 2013, Bristol, UK
Average Temperature Estimation on
Sorted and Unsorted Weather Data
Unsorted Sorted
6
How do the block sampling rate and the % of processed input
affect accuracy?
Performance - Sampling Rate
Performance - Bias Reduction
snapshot freq = 10%
Experimental Setup
● 8 large-instance OpenStack VMs
○ 4 vCPUs, 8 GB memory, 90 GB disk
● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14
● up to 17 map tasks and 5 reduce tasks per job, HDFS
block size of 64MB
● weather station data from the National Climatic
Data Center ftp server (available years 1901 to 2013)
● the complete Project Gutenberg e-books catalog
(30615 e-books in .txt format)
System Configuration Parameters
Bias Reduction
● Access Phase: Store the entire input split
in the reader task’s local memory
● Shuffling Phase: Shuffle the records of
the block in-place
● Processing Phase: Serve a record to the
mapper task from local memory (avoids
additional disk I/O)
Future Work
● Integrate statistical estimators
○ provide error bounds for users
● Automatically fine-tune sampling
parameters based on system
configuration
● Explore alternative sampling techniques
and wavelet-approximation

More Related Content

What's hot (20)

PDF
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
PDF
A time energy performance analysis of map reduce on heterogeneous systems wit...
newmooxx
 
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
PDF
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
PPTX
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
PDF
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
PPTX
First Flink Bay Area meetup
Kostas Tzoumas
 
PPTX
Managing Multi-DBMS on a Single UI , a Web-based Spatial DB Manager-FOSS4G A...
BJ Jang
 
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
PDF
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
PDF
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Alexander Schätzle
 
PDF
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
ODP
Google's Dremel
Maria Stylianou
 
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Predictive Datacenter Analytics with Strymon
Vasia Kalavri
 
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Vasia Kalavri
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
newmooxx
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
Apache Flink: API, runtime, and project roadmap
Kostas Tzoumas
 
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
First Flink Bay Area meetup
Kostas Tzoumas
 
Managing Multi-DBMS on a Single UI , a Web-based Spatial DB Manager-FOSS4G A...
BJ Jang
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Map-Side Merge Joins for Scalable SPARQL BGP Processing
Alexander Schätzle
 
Mikio Braun – Data flow vs. procedural programming
Flink Forward
 
Google's Dremel
Maria Stylianou
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 

Viewers also liked (8)

PDF
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
PDF
The shortest path is not always a straight line
Vasia Kalavri
 
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
PDF
Apache Flink Deep Dive
Vasia Kalavri
 
PDF
A Skype case study (2011)
Vasia Kalavri
 
PDF
Demystifying Distributed Graph Processing
Vasia Kalavri
 
PPTX
Flink vs. Spark
Slim Baltagi
 
Like a Pack of Wolves: Community Structure of Web Trackers
Vasia Kalavri
 
The shortest path is not always a straight line
Vasia Kalavri
 
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Vasia Kalavri
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Vasia Kalavri
 
Apache Flink Deep Dive
Vasia Kalavri
 
A Skype case study (2011)
Vasia Kalavri
 
Demystifying Distributed Graph Processing
Vasia Kalavri
 
Flink vs. Spark
Slim Baltagi
 
Ad

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Python basic programing language for automation
DanialHabibi2
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
July Patch Tuesday
Ivanti
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Ad

Block Sampling: Efficient Accurate Online Aggregation in MapReduce

  • 1. Block Sampling: Efficient Accurate Online Aggregation in MapReduce 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013) Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov {kalavri, vaidas, vladv}@kth.se 3 December 2013, Bristol, UK
  • 2. Problem and Motivation Luckily, in many cases results can be useful even before job completion ○ tolerate some inaccuracy ○ benefit from faster answers 2 Big data processing is usually very time- consuming... … but many applications require results really fast or can only use results for a limited window of time
  • 3. MapReduce vs. MapReduce Online mapper reducer Local Disk Input Record map function Output Record HTTP request In original MR, a reducer task cannot fetch the output of a map task which hasn't committed its output to disk mapper reducer Input Record map function Output Record TCP- push/pull 3
  • 4. Online Aggregation ● Apply the reduce function to the data seen so far ● % input processed to estimate accuracy 4
  • 5. Sampling Challenges ● Data in HDFS ○ Disk already access is terribly slow ○ Random disk access for sampling is even slower ● Unstructured Data ○ Sample based on what? ○ We don’t know the query, we don’t know the key or the value! 5
  • 6. The Block Sampling Technique 6
  • 7. MapReduce Online vs. Block Sampling Average Temperature Estimation on Weather Data Unsorted Sorted 7
  • 8. Takeaway 8 ● Useful results even before job completion ● Disk random access is prohibitively expensive → efficiently emulate sampling using in-memory shuffling ● Higher sampling rate improves accuracy but also increases communication costs among mapper tasks
  • 9. Block Sampling: Efficient Accurate Online Aggregation in MapReduce 5th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2013) Vasiliki Kalavri, Vaidas Brundza, Vladimir Vlassov {kalavri, vaidas, vladv}@kth.se 3 December 2013, Bristol, UK
  • 10. Average Temperature Estimation on Sorted and Unsorted Weather Data Unsorted Sorted 6 How do the block sampling rate and the % of processed input affect accuracy?
  • 12. Performance - Bias Reduction snapshot freq = 10%
  • 13. Experimental Setup ● 8 large-instance OpenStack VMs ○ 4 vCPUs, 8 GB memory, 90 GB disk ● Linux Ubuntu 12.04.2 LTS OSm Java 1.7.0 14 ● up to 17 map tasks and 5 reduce tasks per job, HDFS block size of 64MB ● weather station data from the National Climatic Data Center ftp server (available years 1901 to 2013) ● the complete Project Gutenberg e-books catalog (30615 e-books in .txt format)
  • 15. Bias Reduction ● Access Phase: Store the entire input split in the reader task’s local memory ● Shuffling Phase: Shuffle the records of the block in-place ● Processing Phase: Serve a record to the mapper task from local memory (avoids additional disk I/O)
  • 16. Future Work ● Integrate statistical estimators ○ provide error bounds for users ● Automatically fine-tune sampling parameters based on system configuration ● Explore alternative sampling techniques and wavelet-approximation