SlideShare a Scribd company logo
Platforms
Marat Zhanikeev
maratishe@gmail.com
maratishe.github.io
Hadoop versus Bigdata Replay
Tokyo Univ. of Science
O n P e r f o r m a n c e U n d e r H o t s p o t s i n
WebDB Forum 2017@お茶の水女子大
PDF → bit.do/170920
Background on Hadoop
• Hadoop performance measurement
◦ creators on performance limits 09
◦ superlinear effect 08
◦ various benchmarks on Hadoop vs Spark 07
◦ inconsistencies in measurements 11
• Hadoop/MapReduce optimization in 14 and a ton of other papers
• the ”Do We (actually) Need Hadoop?” argument in 10 and few recent
papers
09 K.Shvachko+0 ”HDFS scalability: the limits to growth” Usenix Login (2010)
08 N.Gunther+2 ”Hadoop Superlinear Scalability” ACM Queue (2015)
07 J.Shi+6 ”Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics” Very Large Data Bases (2015)
11 M.Xia+3 ”Performance Inconsistency in Large Scale Data Processing Clusters” 10th USENIX ICAC (2013)
14 A.Rasooli+1 ”COSHH: A Classiffication and Optimization based Scheduler for Heterogeneous Hadoop Systems” Future Gen.Comp.Sys. (2014)
10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 2/12
2/12
Modeling Hadoop Bottlenecks
Network
(NW)
Bulk
Storage
(BS)
Shared
Memory
(SM)
Core Output
Big Data Processing
HPC, Simulators, Modeling
Small
Data
Bulk
Storage
(BS)
On-Chip
Shared
Memory
(hSM)
Numberofparallelaccesses
Network
(NW)
Ability to isolate
Bottleneck
(pipe width)
RAM-based
Shared Memory
(sSM) Bulk
Storage
(BS)
Network
(NW)1
RAM-based
Shared Memory
(sSM)
Parallelaccesses
Ability to isolate
Core Output
Small
Data
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 3/12
3/12
Hadoop’s Answer: Rack Awareness
Rack
Switch
Datanode
Datanode
Datanode
…
Rack
Switch
Datanode
Datanode
……
Core
Switch
Client
Client
Logical
Client
Own Rack
Switch
Other Rack Switch
Other Rack Switch
Other Rack Switch
Datanodes
• official Hadoop feature
(not a bug) 12
• some dynamics, goes
off-rack when local
nodes have too many jobs
• sadly, manual
configuration of rack
affiliation (much potential here for
research on virtual network coordinates –
Meridian, Vivaldi...)
12 ”Hadoop: Rack Awareness” https://hadoop.apache.org (2017)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 4/12
4/12
Hadoop vs Bigdata Replay Method
• basic idea similar to 10 but uses circuits 02 to transfer shards and multicore
01 to parallel-process them
Name Node
Storage Node (shard)
file A
file B
file C
…
Hadoop Space
Manager
Hadoop Job
(your code)
Hadoop Job
(your code)
Hadoop Job
(your code)
MapReduce
job (your code)
manymany
Name
Server(s)
Client Machine
Hadoop Client
Your
Code
You
Start Use
Deploy
FindRead/parse
many
Internals (DC)
Users
Storage Node
(shard)
Time-Aware
Sub-Store(s)
Manager
Client Machine
Client
Your
Sketcher
You
Start Use
Schedule
Multicore
Replay
Replay Node
many
10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012)
02 myself+0 ”Circuit Emulation for Big Data Transfers in Clouds” Networking for Big Data, CRC (2015)
01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 5/12
5/12
Replay Environment is Highly Flexible!
• replay is time-aligned, so jobs can pick any spot on the timeline
• similar to Spark in going beyond key-value datatype but more – the full scope
of streaming algorithms 01
• massively multicore environments 04 with 100+ cores, dynamic re-packing of
job batches, etc.
Core 1
Core 1
Core X
Replay
Manager
Now(replay)
….
Time-Aligned Big Data
Cursor
Time
Direction
One Sketch One SketchOne Sketch
Start End End End
Read/prepare
Shared Memory
Start
….
Time
Now
(buffer head)
Manager
Job
Job
Buffer
tail
pos
pos
Controller
Kill
2 Report
Manage
in realtime
One Replay Batch
One
Buffer
One
Buffer
One
BufferJobs
Jobs
Jobs
Replay at
a scale
1
01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015)
04 myself+0 ”Volume and Irregularity Effects on Massively Multicore Packet Processors” APNOMS (2016)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 6/12
6/12
Performance under hotspots
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 7/12
7/12
The Hotspot Distribution
0 20 40 60 80 100
Decreasing order
0
0.35
0.7
1.05
1.4
1.75
2.1
2.45
2.8
log(value)
Class A Class B Class C Class D Class E
• models Flash/Hotspot/
Killerapp/Blackswan
events using extreme variance
in popularity
• generation method:
stick-breaking process,
Dirichlet distribution with
parallel beta sources 05
• final step: classify based on
the number of hot/flash items
05 myself+1 ”Popularity-Based Modeling of Flash Events in Synthetic Packet Traces” CQ研 (2012)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 8/12
8/12
The Binary ”Till Contention” Metric
• not a common, but very realistic
way to model performance under load
• note: even more applicable under
hotspot-y input
Rack
Rack
Border
(switch)
Client
Data
Shards
Data
Shards
…
Volume
Contention
Contention -free
to contention -ful
threshold
• example: function of server response
time to load can be expressed as:
T =
1
2
[
(L − n) +
√
(L − n)2 + k
1 − L
]
• ...where T is response time, L is load,
and k is the knee = contention point!
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 9/12
9/12
Performance Models
• shard size as S and in-job traffic to shard size ratio r
◦ so, Hadoop jobs generate rS versus always strictly S under Replay
• contention threshold as C (for both contention and/or capacity)
• list of shard hotness (popularity)
{
h1, h2, h3, ..., hn
}
and sizes{
S1, S2, S3, ...., Sn
}
• then we have (job/traffic) volume for Hadoop:
Vhadoop =
∑
i=1..n
rhiSi
• ... and for Replay method:
Vreplay =
∑
i=1..n
Si (1)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 10/12
10/12
Results
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.7
1.4
2.1
2.8
3.5
4.2
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 10
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.8
1.6
2.4
3.2
4
4.8
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 50
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.9
1.8
2.7
3.6
4.5
5.4
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 200
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 11/12
11/12
That’s all, thank you ...
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 12/12
12/12

More Related Content

What's hot (20)

PPTX
Big Data in the Real World
Mark Kromer
 
PDF
Bio bigdata
Mk Kim
 
PPT
My other computer is a datacentre - 2012 edition
Steve Loughran
 
PDF
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
PDF
Introduction to Big Data
Joey Li
 
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
PDF
Introduction to Bigdata and HADOOP
vinoth kumar
 
PPTX
Big data Analytics Hadoop
Mishika Bharadwaj
 
PDF
BigData HUB Workshop
Ahmed Salman
 
PPTX
Hadoop and BigData - July 2016
Ranjith Sekar
 
PPTX
Data lake ppt
SwarnaLatha177
 
PPT
Big Data: An Overview
C. Scyphers
 
PPTX
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
PDF
Introduction to Big Data
IMC Institute
 
PPTX
Bigdata " new level"
Vamshikrishna Goud
 
PDF
Big Data simplified
Praveen Hanchinal
 
PDF
Big Data Analytics
Sreedhar Chowdam
 
PPTX
Bigdata
NithiDazz
 
Big Data in the Real World
Mark Kromer
 
Bio bigdata
Mk Kim
 
My other computer is a datacentre - 2012 edition
Steve Loughran
 
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Introduction to Big Data
Joey Li
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Nishant Gandhi
 
Introduction to Bigdata and HADOOP
vinoth kumar
 
Big data Analytics Hadoop
Mishika Bharadwaj
 
BigData HUB Workshop
Ahmed Salman
 
Hadoop and BigData - July 2016
Ranjith Sekar
 
Data lake ppt
SwarnaLatha177
 
Big Data: An Overview
C. Scyphers
 
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
Introduction to Big Data
IMC Institute
 
Bigdata " new level"
Vamshikrishna Goud
 
Big Data simplified
Praveen Hanchinal
 
Big Data Analytics
Sreedhar Chowdam
 
Bigdata
NithiDazz
 

Similar to On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms (20)

PPTX
Hadoop Distributed File System
Vaibhav Jain
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PPTX
Big Data for QAs
Ahmed Misbah
 
PDF
InternReport
Swetha Tanamala
 
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
PDF
Big data and hadoop overvew
Kunal Khanna
 
PDF
C cerin piv2017_c
Bertrand Tavitian
 
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
PDF
Big data and hadoop
Kishor Parkhe
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
PPTX
Apache-Hadoop-Slides.pptx
MURINDANYISUDI
 
PPTX
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
shilpabl1803
 
PPTX
MOD-2 presentation on engineering students
rishavkumar1402
 
PPTX
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
James McGalliard
 
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Mahantesh Angadi
 
PPTX
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
PDF
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
Big Data for QAs
Ahmed Misbah
 
InternReport
Swetha Tanamala
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
DataWorks Summit
 
Big data and hadoop overvew
Kunal Khanna
 
C cerin piv2017_c
Bertrand Tavitian
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Mahantesh Angadi
 
Big data and hadoop
Kishor Parkhe
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 
Apache-Hadoop-Slides.pptx
MURINDANYISUDI
 
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
shilpabl1803
 
MOD-2 presentation on engineering students
rishavkumar1402
 
SOME WORKLOAD SCHEDULING ALTERNATIVES 11.07.2013
James McGalliard
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
Mahantesh Angadi
 
The rise of “Big Data” on cloud computing
Minhazul Arefin
 
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Ad

More from Tokyo University of Science (20)

PDF
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
Tokyo University of Science
 
PDF
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Tokyo University of Science
 
PDF
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
Tokyo University of Science
 
PDF
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
Tokyo University of Science
 
PDF
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Tokyo University of Science
 
PDF
Taking the Step from Software to Product Development \\ when teaching PBL at ...
Tokyo University of Science
 
PDF
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
Tokyo University of Science
 
PDF
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
Tokyo University of Science
 
PDF
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
Tokyo University of Science
 
PDF
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Tokyo University of Science
 
PDF
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Tokyo University of Science
 
PDF
On a Hybrid Packets-and-Circuits Switching Logic
Tokyo University of Science
 
PDF
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Tokyo University of Science
 
PDF
Complexity Resolution Control for Context Based on Metromaps
Tokyo University of Science
 
PDF
The Declarative-Coordinated Model for Self-Optimization of Service Networks
Tokyo University of Science
 
PDF
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
Tokyo University of Science
 
PDF
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
Tokyo University of Science
 
PDF
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Tokyo University of Science
 
PDF
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Tokyo University of Science
 
PDF
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
Tokyo University of Science
 
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
Tokyo University of Science
 
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Tokyo University of Science
 
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
Tokyo University of Science
 
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
Tokyo University of Science
 
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Tokyo University of Science
 
Taking the Step from Software to Product Development \\ when teaching PBL at ...
Tokyo University of Science
 
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
Tokyo University of Science
 
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
Tokyo University of Science
 
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
Tokyo University of Science
 
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Tokyo University of Science
 
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Tokyo University of Science
 
On a Hybrid Packets-and-Circuits Switching Logic
Tokyo University of Science
 
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Tokyo University of Science
 
Complexity Resolution Control for Context Based on Metromaps
Tokyo University of Science
 
The Declarative-Coordinated Model for Self-Optimization of Service Networks
Tokyo University of Science
 
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
Tokyo University of Science
 
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
Tokyo University of Science
 
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Tokyo University of Science
 
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Tokyo University of Science
 
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
Tokyo University of Science
 
Ad

Recently uploaded (20)

PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 

On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms

  • 1. Platforms Marat Zhanikeev [email protected] maratishe.github.io Hadoop versus Bigdata Replay Tokyo Univ. of Science O n P e r f o r m a n c e U n d e r H o t s p o t s i n WebDB Forum 2017@お茶の水女子大 PDF → bit.do/170920
  • 2. Background on Hadoop • Hadoop performance measurement ◦ creators on performance limits 09 ◦ superlinear effect 08 ◦ various benchmarks on Hadoop vs Spark 07 ◦ inconsistencies in measurements 11 • Hadoop/MapReduce optimization in 14 and a ton of other papers • the ”Do We (actually) Need Hadoop?” argument in 10 and few recent papers 09 K.Shvachko+0 ”HDFS scalability: the limits to growth” Usenix Login (2010) 08 N.Gunther+2 ”Hadoop Superlinear Scalability” ACM Queue (2015) 07 J.Shi+6 ”Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics” Very Large Data Bases (2015) 11 M.Xia+3 ”Performance Inconsistency in Large Scale Data Processing Clusters” 10th USENIX ICAC (2013) 14 A.Rasooli+1 ”COSHH: A Classiffication and Optimization based Scheduler for Heterogeneous Hadoop Systems” Future Gen.Comp.Sys. (2014) 10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012) M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 2/12 2/12
  • 3. Modeling Hadoop Bottlenecks Network (NW) Bulk Storage (BS) Shared Memory (SM) Core Output Big Data Processing HPC, Simulators, Modeling Small Data Bulk Storage (BS) On-Chip Shared Memory (hSM) Numberofparallelaccesses Network (NW) Ability to isolate Bottleneck (pipe width) RAM-based Shared Memory (sSM) Bulk Storage (BS) Network (NW)1 RAM-based Shared Memory (sSM) Parallelaccesses Ability to isolate Core Output Small Data M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 3/12 3/12
  • 4. Hadoop’s Answer: Rack Awareness Rack Switch Datanode Datanode Datanode … Rack Switch Datanode Datanode …… Core Switch Client Client Logical Client Own Rack Switch Other Rack Switch Other Rack Switch Other Rack Switch Datanodes • official Hadoop feature (not a bug) 12 • some dynamics, goes off-rack when local nodes have too many jobs • sadly, manual configuration of rack affiliation (much potential here for research on virtual network coordinates – Meridian, Vivaldi...) 12 ”Hadoop: Rack Awareness” https://hadoop.apache.org (2017) M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 4/12 4/12
  • 5. Hadoop vs Bigdata Replay Method • basic idea similar to 10 but uses circuits 02 to transfer shards and multicore 01 to parallel-process them Name Node Storage Node (shard) file A file B file C … Hadoop Space Manager Hadoop Job (your code) Hadoop Job (your code) Hadoop Job (your code) MapReduce job (your code) manymany Name Server(s) Client Machine Hadoop Client Your Code You Start Use Deploy FindRead/parse many Internals (DC) Users Storage Node (shard) Time-Aware Sub-Store(s) Manager Client Machine Client Your Sketcher You Start Use Schedule Multicore Replay Replay Node many 10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012) 02 myself+0 ”Circuit Emulation for Big Data Transfers in Clouds” Networking for Big Data, CRC (2015) 01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015) M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 5/12 5/12
  • 6. Replay Environment is Highly Flexible! • replay is time-aligned, so jobs can pick any spot on the timeline • similar to Spark in going beyond key-value datatype but more – the full scope of streaming algorithms 01 • massively multicore environments 04 with 100+ cores, dynamic re-packing of job batches, etc. Core 1 Core 1 Core X Replay Manager Now(replay) …. Time-Aligned Big Data Cursor Time Direction One Sketch One SketchOne Sketch Start End End End Read/prepare Shared Memory Start …. Time Now (buffer head) Manager Job Job Buffer tail pos pos Controller Kill 2 Report Manage in realtime One Replay Batch One Buffer One Buffer One BufferJobs Jobs Jobs Replay at a scale 1 01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015) 04 myself+0 ”Volume and Irregularity Effects on Massively Multicore Packet Processors” APNOMS (2016) M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 6/12 6/12
  • 7. Performance under hotspots M.Zhanikeev [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 7/12 7/12
  • 8. The Hotspot Distribution 0 20 40 60 80 100 Decreasing order 0 0.35 0.7 1.05 1.4 1.75 2.1 2.45 2.8 log(value) Class A Class B Class C Class D Class E • models Flash/Hotspot/ Killerapp/Blackswan events using extreme variance in popularity • generation method: stick-breaking process, Dirichlet distribution with parallel beta sources 05 • final step: classify based on the number of hot/flash items 05 myself+1 ”Popularity-Based Modeling of Flash Events in Synthetic Packet Traces” CQ研 (2012) M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 8/12 8/12
  • 9. The Binary ”Till Contention” Metric • not a common, but very realistic way to model performance under load • note: even more applicable under hotspot-y input Rack Rack Border (switch) Client Data Shards Data Shards … Volume Contention Contention -free to contention -ful threshold • example: function of server response time to load can be expressed as: T = 1 2 [ (L − n) + √ (L − n)2 + k 1 − L ] • ...where T is response time, L is load, and k is the knee = contention point! M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 9/12 9/12
  • 10. Performance Models • shard size as S and in-job traffic to shard size ratio r ◦ so, Hadoop jobs generate rS versus always strictly S under Replay • contention threshold as C (for both contention and/or capacity) • list of shard hotness (popularity) { h1, h2, h3, ..., hn } and sizes{ S1, S2, S3, ...., Sn } • then we have (job/traffic) volume for Hadoop: Vhadoop = ∑ i=1..n rhiSi • ... and for Replay method: Vreplay = ∑ i=1..n Si (1) M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 10/12 10/12
  • 11. Results A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.7 1.4 2.1 2.8 3.5 4.2 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 10 A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.8 1.6 2.4 3.2 4 4.8 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 50 A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.9 1.8 2.7 3.6 4.5 5.4 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 200 M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 11/12 11/12
  • 12. That’s all, thank you ... M.Zhanikeev – [email protected] On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 12/12 12/12