Adaptive Replication for Elastic Data Stream Processing

An Adaptive Replication Scheme For Elastic
Data Stream Processing
Thomas Heinze, Mariam Zia, Robert Krahn, Zbigniew Jerzak, Christof Fetzer
July 02, 2015

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2InternalPublic
Elasticity
 Utilization below 30% in most cloud data centers
 Users needs to reserve required resources
 Limited understanding of the performance of the system
 Limited knowledge of characteristics of the workload
Workload/
Resources
Load
Static Provisioning
Elastic Provisioning
time
Underprovisioning
Overprovisioning

Configuring an Elastic Scaling System
 Data Stream Processing highly suited for elasticity due to highly variable
load and small state size (e.g. StreamCloud[1] or SEEP[2])
 Key challenge: Minimize the overprovisioning and number of SLA
violations by optimizing scaling decisions

Enabling Fault Tolerance
 Elasticity requires horizontal scaling → we need fault tolerance
 Two mechanisms: Active Replication vs. Upstream Backup
User-defined
Threshold
Financial
0.0
2.5
5.0
7.5
10.0
12.5
0.0 0.5 1.0 1.5 2.0
Monetary Cost($)
RecoveryTime(insec.)
Active Upstream

Outline
1. Introduction
2. An Adaptive Replication Scheme
3. Evaluation
4. Conclusion and Future Work

Related Work
a) Improving Upstream Backup:
 Sweeping checkpoints, …
 Faster Recovery by using Micro Batch Processing (D-Stream [3],
TimeStream [4])
 But: no user-configurable recovery time threshold
b) Combination of both mechanisms:
 Already proposed for Borealis by Hwang et al. [5]
 Static Optimizer proposed by Updahyaya et al. [6]
 Dynamic switching to handle overload/fault case by Martin et al. [7]/
Zhang et al.[8]
 But: static or without user-configurable recovery time threshold

Adaptive Replication Scheme
 Dynamically switch between upstream backup and active replication
during runtime
 Replication Scheme describes current replication mode for all operators
Active Replication
Upstream BackupUpstream Backup
process
process
passive
reserved
Switch
Roles
Switch
Instance 2
Switch
Instance 1
process
process
reserved
Key Questions:
1) When we need to switch replication mode?
→ Estimation Model for Upstream Backup
2) How to integrate with our elastic scaling system?
→ Adaption Algorithm

Recovery Time Estimation
 Many factors influence the recovery time:
 Operator type and Checkpointing time (static)
 State Size and Queue Length (changing with the current workload)
 Our solution: estimation based on historical samples
 Accurancy: 0.3 sec. error for 10 sec. recovery time (sample size: 1000)
Clustering EstimationHistorical
Samples
Clustered
Samples
Estimated
Recovery Time
Current workload
characteristics

Single Operator Scenario
 Observe current state size and queue length of all operators
 Adapt replication scheme if user threshold is not met
t
Operating
interval
Recovery Time
Threshold
Estimated
Recovery
Time
Estimated
Recovery Time
1
2Active Replication
Upstream Backup

Integration with Elastic Scaling System
 Architecture of an elastic scaling system
 Process many queries in parallel
 Places operators on a varying number of hosts based CPU + network
consumption
 Scaling requires moving operators between hosts
 Integration
 Replication-aware operator placement
 System recovery time = max(recovery time per query)
 Monitor the recovery time for the crash of host h

Example: Multi Query Scenario
System Recovery
Time:
max(trec(q1) , trec(q2))
F1 A1S
A1‘F1‘
q1:
trec(q1)= max(trec(F1), trec(A1) , trec(D1))
D1
D1‘
F2 A1S
A2‘F2‘
q2:
trec(q2)= max(trec(F2), trec(A2) , trec(D2))
D1
D1‘

Example: Operator Placement
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
q1:
Host 3
D2‘
D1
A1‘
Recovery
Time (max):
trec(F1), trec(F2),
trec(A1)
trec(A2) trec(D1) trec(A2), trec(D2)

Example: Too High Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
trec(A2) trec(D1)
q1:
Host 3
D2‘
D1
A1‘
trec(A2), trec(D2)Recovery
Time (max):

Example: Too High Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
trec(A2), trec (D1‘) trec(D1), trec(A1)
q1:
Host 3
D2‘
D1
A1‘
Time (max):

Example: Too Low Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
q1:
Host 3
D2‘
D1
A1‘
Time (max):

Example: Too Low Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
q1:
Host 3
D2‘
D1
A1‘
trec(D2)Recovery
Time (max):

Setup
 Private cloud environment with up to 12 hosts
 Three Workloads: Financial, Twitter, Energy Sensors
 Measure characteristics like CPU load, latency, etc. in 10 seconds
intervals
 20 crashes of a random host (immediately trigger recovery process)
 Recovery Time measured as maximal latency peak observed after a host
crash
 Two baseline algorithms: Active Replication and Upstream Backup

Recovery Time For Different Thresholds

Adaptive Replication Scheme

Summary
 Active replication/upstream backup forces a hard trade-off between
resource overhead and recovery time
 Our adaptive replication scheme allows to customize trade-off based on
user configuration
Future work
 Formalize approach for replication degree >2
 Network-bound workloads
 Replication Placement

Adaptive Replication for Elastic Data Stream Processing

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Adaptive Replication for Elastic Data Stream Processing (20)

More from Zbigniew Jerzak (10)

Recently uploaded (20)

Adaptive Replication for Elastic Data Stream Processing