Scheduling in distributed systems - Andrii Vozniuk

Scheduling In Distributed Systems
Candidacy exam

 Andrii Vozniuk
 EPFL
 July 4, 2012

Big Data
 Data explosion
 Processing gets more complicated

Generates: 25 TB/day Generates: 40 TB/day
Stores: 10 PB/year Stores: 20 PB/year

Resources of many computers should be used
2

Typical Data Processing Pipeline

Log Sensor
data data

ETL-like batch Clean Analyze Using resources of
processing data data many organizations

Particle found!
Efficient query Query
execution data

User model

No one-size-fits-all system currently exists
3

Outline
Ɣ Gamma - parallel database
MapReduce - data-intensive system

Condor - compute-intensive system

Conclusions
Future Research

4

Scheduling In Distributed Systems
 Scheduling
 Policy: setting an ordering of tasks task
task
 Assigning resources to tasks
task
task

How to match resources and tasks?

Scheduling is challenging in distributed systems
5

Matching Tasks With Resources
 Perspectives
 Data model
 Execution model

System/Perspecti Data model Execution model
ve
Gamma Relational Multioperator
MapReduce Unconstrained MapReduce
Condor Unconstrained Unconstrained

How scheduling is influenced by data and execution
6 models?

Gamma Ɣ
 Pioneering parallel database
 Data model: constrained
 Relational data model
 Relations are horizontally partitioned
 Execution model: constrained
 Multioperator queries
 Operators employ hash-based algorithms

7

Gamma: Scheduler Ɣ
SELECT r FROM R Query Host
WHERE r < ‘k’ query Manager Catalog
Machine

Gamma
Optimizes query Schedules
Scheduler Database
Compiles plan operators
Process

Operator Operator
Node 1 Process Process Node 2
Execution on
relevant nodes a-m n-z

Scheduling is done at the operator level
8

Gamma: Batch Scheduling Ɣ
 Exploit sharing by scheduling in a batch
 Example of selection sharing

σ1 σ2 σ1 σ2
Shared scan

A A A

 Reads of A can be shared applying predicates in turn
 Shared relation A is scanned only once

Batch scheduling trades latency for throughput
9

Gamma: Batch Scheduling Joins Ɣ
 Several hash-joins in a batch of queries
 Hash table for the same relation can be shared
 Example assumes 100% selectivity of σ
Shared hash-table for A

⋈ ⋈ ⋈ ⋈

σ σ σ σ σ σ σ

A Β A C B A C

 Sharing reduces I/O and memory usage

Sharing among joins reduces total execution time
10

Limitations Of Gamma Ɣ
 Gamma offers
 Efficient query execution
 Sharing in a batch of queries
 Gamma operates on structured data
 Gamma is not suitable for
 Unstructured data processing
 ETL type of workload
 Running on large scale

A different system for ETL processing is needed
11

MapReduce
 System for data-intensive applications
 Execution model: constrained
 Job is a set of map and reduce tasks
 Tasks are independent
 Data model: unconstrained
 Arbitrary data format
 Files are partitioned into chunks
 Each chunk is replicated several times

12

MapReduce: Scheduling
Map
Reduc Map
1e 2
Example:
Chunk1 Chunk2
MapReduce job
Result1
Temp1 Temp2
4 Map tasks

2 Reduce task Map Reduc
Map
3 4e
Chunk3 Chunk4
Temp3 Result2
Temp4
 Tasks are scheduled close to data
 Execution is scalable and fault-tolerant
 Execution is elastic
Fine grain scheduling improves fault tolerance and
13 elasticity

MapReduce: Speculative Execution
 Nodes may become slow
 Speculative execution minimizes job’s response time
 Launch if progress is 20% less than average
backup
Normal node

straggler

Temporary slow node

Speculative execution works well in homogeneous
14 environment

Emerging Heterogeneous Infrastructures
 Replacement of failed components
 Extending existing cluster with new machines
 Virtualized data centers of cloud providers
 CPU and RAM are isolated
 Contention for disk and network
IO Performance per

60
VM (MB/s)

40

20

0
1 2 3 4 5 6 7
VMs on Physical Host

In many real-life cases the infrastructure is heterogeneous
15

MapReduce: Heterogeneous Cluster
Fast node

Slow node

 Performance degrades on heterogeneous cluster
 Slow nodes are wasted
 Backup tasks on slow nodes
 All straggling tasks are treated equally
 Thrashing due to excessive speculative execution

Speculative execution should be improved for heterogeneous
16 cluster

MapReduce: LATE Scheduler
 Idea: back up the task with the largest estimated finish
time (Longest Approximate Time to End)
progress score
progress rate =
execution time

1 – progress score
estimated time left =
progress rate
 Thresholds
 Limit the number of backup tasks
 Launch backup tasks on fast nodes
 Backup only sufficiently slow tasks
LATE looks forward to prioritize tasks to speculate
17

MapReduce: LATE Example
 Back up the task with Longest Approximate Time to End
2 min

1 Estimated time left:
(1-0.66) / (1/3) = 1
1 task/min

2 Progress = 66%
Estimated time left:
(1-0.05) / (1/1.9) = 1.8
3x slower
Progress = 5.3%
3
1.9x slower

Time (min) improvement

LATE correctly identifies task which hurts the response time the
18 most

Limitations Of MapReduce
 MapReduce offers
 High scalability
 Good fault tolerance
 Handling of unstructured data
 MapReduce is not suitable for
 Running on multi organization infrastructure
 Harvesting idle resources in organization

A different system for multi organization infrastructure is
19 needed

Condor
 Compute-intensive system harvesting idle resources
 Data model: arbitrary
 Execution model: arbitrary
How to increase utilization
and respect the owners?

job

job
job
job
Increase resources utilization by scheduling jobs on idle
20 machines

Condor Scheduler: Centralized?
Scheduler

job

job
job
job
Efficient but not reliable, possible bottleneck
21

Condor Scheduler: Distributed?
Scheduler

Scheduler

Scheduler

Scheduler

job

job
job
job
Reliable but inefficient
22

Condor Scheduler: Hybrid!

Information about tasks Matchmaker Information about nodes

Scheduler 1
3 1
1
2
3 Scheduler

Scheduler

4
job

job
job
job
Hybrid approach has the best of both worlds
23

ClassAds: Describing Jobs and Resources
Job Description Machine Description

[MyType=“Job” [MyType=“Machine“
TargetType = “Machine“ TargetType=“Job“
Department=“CompSci“ Machine=“nostos.cs.wisc.edu“
Requirements = OpSys=“LINUX“
(other.OpSys==LINUX && Disk=3076077
other.Disk > 10000000) Requirement = (LoadAvg <= 0.3) &&
Rank=Memory] (KeyboardIdle > (15*60))
Rank =
other.Department==self.Department]
 Requirements should be satisfied
 Candidate with the highest rank is returned
Matchmaker is suitable for heterogeneous shared clusters
24

Conclusions
 Scheduling done at different levels
 Gamma: operator level scheduling enables sharing
 MR and Condor: arbitrary code => sharing is hard
 Condor: matchmaking gives control on job placement

 Hybrid approaches are promising for big data processing
 Scheduling in heterogeneous deployments is challenging

25

Thank you for your attention!

Feedback & Question?
Andrii.Vozniuk@epfl.ch

26

References
 Matchmaking: Distributed Resource Management for
High Throughput Computing by Rajesh Raman, Miron
Livny and Marvin Solomon.
 Batch Scheduling in Parallel Database Systems by Manish
Mehta, Valery Soloviev and David J. DeWitt.
 Improving MapReduce performance in heterogeneous
environments by Matei Zaharia, Andy Konwinski, Anthony
D. Joseph, Randy Katz and Ion Stoica
 Slides 14 and 18 exploit presentation ideas from the LATE
slides for OSDI 2008 by Matei Zaharia

27

Scheduling in distributed systems - Andrii Vozniuk

More Related Content

What's hot (20)

Similar to Scheduling in distributed systems - Andrii Vozniuk (20)

More from Andrii Vozniuk (11)

Recently uploaded (20)

Scheduling in distributed systems - Andrii Vozniuk