SlideShare a Scribd company logo
An Adaptive Replication Scheme For Elastic
Data Stream Processing
Thomas Heinze, Mariam Zia, Robert Krahn, Zbigniew Jerzak, Christof Fetzer
July 02, 2015
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2InternalPublic
Elasticity
 Utilization below 30% in most cloud data centers
 Users needs to reserve required resources
 Limited understanding of the performance of the system
 Limited knowledge of characteristics of the workload
Workload/
Resources
Load
Static Provisioning
Elastic Provisioning
time
Underprovisioning
Overprovisioning
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 3InternalPublic
Configuring an Elastic Scaling System
 Data Stream Processing highly suited for elasticity due to highly variable
load and small state size (e.g. StreamCloud[1] or SEEP[2])
 Key challenge: Minimize the overprovisioning and number of SLA
violations by optimizing scaling decisions
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4InternalPublic
Enabling Fault Tolerance
 Elasticity requires horizontal scaling → we need fault tolerance
 Two mechanisms: Active Replication vs. Upstream Backup
User-defined
Threshold
Financial
0.0
2.5
5.0
7.5
10.0
12.5
0.0 0.5 1.0 1.5 2.0
Monetary Cost($)
RecoveryTime(insec.)
Active Upstream
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5InternalPublic
Outline
1. Introduction
2. An Adaptive Replication Scheme
3. Evaluation
4. Conclusion and Future Work
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 6InternalPublic
Related Work
a) Improving Upstream Backup:
 Sweeping checkpoints, …
 Faster Recovery by using Micro Batch Processing (D-Stream [3],
TimeStream [4])
 But: no user-configurable recovery time threshold
b) Combination of both mechanisms:
 Already proposed for Borealis by Hwang et al. [5]
 Static Optimizer proposed by Updahyaya et al. [6]
 Dynamic switching to handle overload/fault case by Martin et al. [7]/
Zhang et al.[8]
 But: static or without user-configurable recovery time threshold
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 7InternalPublic
Adaptive Replication Scheme
 Dynamically switch between upstream backup and active replication
during runtime
 Replication Scheme describes current replication mode for all operators
Active Replication
Upstream BackupUpstream Backup
process
process
passive
reserved
Switch
Roles
Switch
Instance 2
Switch
Instance 1
process
process
reserved
Key Questions:
1) When we need to switch replication mode?
→ Estimation Model for Upstream Backup
2) How to integrate with our elastic scaling system?
→ Adaption Algorithm
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8InternalPublic
Recovery Time Estimation
 Many factors influence the recovery time:
 Operator type and Checkpointing time (static)
 State Size and Queue Length (changing with the current workload)
 Our solution: estimation based on historical samples
 Accurancy: 0.3 sec. error for 10 sec. recovery time (sample size: 1000)
Clustering EstimationHistorical
Samples
Clustered
Samples
Estimated
Recovery Time
Current workload
characteristics
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 9InternalPublic
Single Operator Scenario
 Observe current state size and queue length of all operators
 Adapt replication scheme if user threshold is not met
t
Operating
interval
Recovery Time
Threshold
Estimated
Recovery
Time
Estimated
Recovery Time
1
2Active Replication
Upstream Backup
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10InternalPublic
Integration with Elastic Scaling System
 Architecture of an elastic scaling system
 Process many queries in parallel
 Places operators on a varying number of hosts based CPU + network
consumption
 Scaling requires moving operators between hosts
 Integration
 Replication-aware operator placement
 System recovery time = max(recovery time per query)
 Monitor the recovery time for the crash of host h
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11InternalPublic
Example: Multi Query Scenario
System Recovery
Time:
max(trec(q1) , trec(q2))
F1 A1S
A1‘F1‘
q1:
trec(q1)= max(trec(F1), trec(A1) , trec(D1))
D1
D1‘
F2 A1S
A2‘F2‘
q2:
trec(q2)= max(trec(F2), trec(A2) , trec(D2))
D1
D1‘
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12InternalPublic
Example: Operator Placement
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
q1:
Host 3
D2‘
D1
A1‘
Recovery
Time (max):
trec(F1), trec(F2),
trec(A1)
trec(A2) trec(D1) trec(A2), trec(D2)
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13InternalPublic
Example: Too High Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
trec(A2) trec(D1)
q1:
Host 3
D2‘
D1
A1‘
trec(A2), trec(D2)Recovery
Time (max):
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14InternalPublic
Example: Too High Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
trec(A2), trec (D1‘) trec(D1), trec(A1)
q1:
Host 3
D2‘
D1
A1‘
trec(A2), trec(D2)Recovery
Time (max):
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15InternalPublic
Example: Too Low Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
trec(A2), trec (D1‘) trec(D1), trec(A1)
q1:
Host 3
D2‘
D1
A1‘
trec(A2), trec(D2)Recovery
Time (max):
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 16InternalPublic
Example: Too Low Recovery Time
F1 A1S
A1‘F1‘
D1
D1‘
F2 A2S
A2‘F2‘
q2:
D2
D2‘
Placement:
Host 1
F1
F2
Host 2
A2
D1‘
F1‘
Host 4
A2‘
D2
F1‘
A1
trec(F1), trec(F2),
trec(A1)
trec(A2), trec (D1‘) trec(D1), trec(A1)
q1:
Host 3
D2‘
D1
A1‘
trec(D2)Recovery
Time (max):
Evaluation
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18InternalPublic
Setup
 Private cloud environment with up to 12 hosts
 Three Workloads: Financial, Twitter, Energy Sensors
 Measure characteristics like CPU load, latency, etc. in 10 seconds
intervals
 20 crashes of a random host (immediately trigger recovery process)
 Recovery Time measured as maximal latency peak observed after a host
crash
 Two baseline algorithms: Active Replication and Upstream Backup
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 19InternalPublic
Recovery Time For Different Thresholds
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20InternalPublic
Adaptive Replication Scheme
© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21InternalPublic
Summary
 Active replication/upstream backup forces a hard trade-off between
resource overhead and recovery time
 Our adaptive replication scheme allows to customize trade-off based on
user configuration
Future work
 Formalize approach for replication degree >2
 Network-bound workloads
 Replication Placement
© 2015 SAP SE or an SAP affiliate company. All rights reserved.
Thank you
Contact information:
Thomas Heinze
Research Associate
thomas.heinze@sap.com

More Related Content

What's hot (20)

PPTX
A Comparative Study between Honeybee Foraging Behaviour Algorithm and Round ...
sondhicse
 
PPT
Load Balancing In Cloud Computing newppt
Utshab Saha
 
PDF
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
IRJET Journal
 
PPTX
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
Aisha Kalsoom
 
PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
PPTX
STUDY ON PROJECT MANAGEMENT THROUGH GENETIC ALGORITHM
Avay Minni
 
PDF
Configuration Optimization for Big Data Software
Pooyan Jamshidi
 
PDF
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
 
PPTX
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
PDF
capacityshifting1
Gokul Vasan
 
PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
PDF
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
PPTX
LOAD BALANCING ALGORITHMS
tanmayshah95
 
PDF
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
NECST Lab @ Politecnico di Milano
 
PDF
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
PDF
Detecting Lateral Movement with a Compute-Intense Graph Kernel
Data Works MD
 
PPTX
IEEE CLOUD \'11
David Ribeiro Alves
 
PPTX
load balancing in public cloud ppt
Krishna Kumar
 
PPTX
Hadoop fault tolerance
Pallav Jha
 
A Comparative Study between Honeybee Foraging Behaviour Algorithm and Round ...
sondhicse
 
Load Balancing In Cloud Computing newppt
Utshab Saha
 
Scheduling of Heterogeneous Tasks in Cloud Computing using Multi Queue (MQ) A...
IRJET Journal
 
An Efficient Decentralized Load Balancing Algorithm in Cloud Computing
Aisha Kalsoom
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
STUDY ON PROJECT MANAGEMENT THROUGH GENETIC ALGORITHM
Avay Minni
 
Configuration Optimization for Big Data Software
Pooyan Jamshidi
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
IRJET Journal
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Xiao Qin
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
capacityshifting1
Gokul Vasan
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
LOAD BALANCING ALGORITHMS
tanmayshah95
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
NECST Lab @ Politecnico di Milano
 
Buzz Words Dunning Real-Time Learning
MapR Technologies
 
Detecting Lateral Movement with a Compute-Intense Graph Kernel
Data Works MD
 
IEEE CLOUD \'11
David Ribeiro Alves
 
load balancing in public cloud ppt
Krishna Kumar
 
Hadoop fault tolerance
Pallav Jha
 

Viewers also liked (18)

PDF
Visualization-Driven Data Aggregation
Zbigniew Jerzak
 
PDF
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
PPT
Shn Overview Updated 2009 06 P21 23
joaovox
 
PDF
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Zbigniew Jerzak
 
PDF
Cloud-based Data Stream Processing
Zbigniew Jerzak
 
DOCX
Research Paper Presentation Rubric
epfund
 
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
PPTX
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
PPT
Pansig2010 - Cypris Chat
Mike McKay
 
PPTX
More amazing photoshop tut
ShdwClaw
 
PDF
Git WorkFlow & Best Practice
Hiraq Citra M
 
PDF
Shn, permaculture pilot, 2008 april, 21 30
joaovox
 
PPT
Doug Altman 15 Jan09 V4
US Cochrane Center
 
PPTX
Ddd part 2 modelling qiscus
Hiraq Citra M
 
ODP
Чести проблеми в сигурността на уеб проектите
Veselin Nikolov
 
PPT
Moodle and Second Life Registration
Mike McKay
 
Visualization-Driven Data Aggregation
Zbigniew Jerzak
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Shn Overview Updated 2009 06 P21 23
joaovox
 
Latency-aware Elastic Scaling for Distributed Data Stream Processing Systems
Zbigniew Jerzak
 
Cloud-based Data Stream Processing
Zbigniew Jerzak
 
Research Paper Presentation Rubric
epfund
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
DoiT International
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Navina Ramesh
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Pansig2010 - Cypris Chat
Mike McKay
 
More amazing photoshop tut
ShdwClaw
 
Git WorkFlow & Best Practice
Hiraq Citra M
 
Shn, permaculture pilot, 2008 april, 21 30
joaovox
 
Doug Altman 15 Jan09 V4
US Cochrane Center
 
Ddd part 2 modelling qiscus
Hiraq Citra M
 
Чести проблеми в сигурността на уеб проектите
Veselin Nikolov
 
Moodle and Second Life Registration
Mike McKay
 
Ad

Similar to Adaptive Replication for Elastic Data Stream Processing (20)

PDF
Sybase ASE 15.7- Two Case Studies of Successful Migration
SAP Technology
 
PPTX
Skytap parasoft webinar new years resolution- accelerate sdlc
Skytap Cloud
 
PPTX
OS-CPU-Scheduling-chap5.pptx
DrAmarNathDhebla
 
PPTX
Operating systems chapter 5 silberschatz
GiulianoRanauro
 
PPTX
CS_10_DR_CFD
ajaya gummadi
 
PPTX
ch5.pptx CUP Scheduling and its details in OS
23017156038
 
PDF
SAP HANA SPS10- Scale-Out, High Availability and Disaster Recovery
SAP Technology
 
PPTX
Public Sector Virtual Town Hall: High Availability for PostgreSQL
EDB
 
PDF
Adaptive Computing Using PlateSpin Orchestrate
Novell
 
PDF
ch5_EN_CPUSched_2022.pdf
CuracaoJTR
 
PPTX
Ejecución de sizer para SimpliVity Partners
JessMoreno901369
 
PDF
Best Practice for Supercharging CA Workload Automation dSeries (DE) for Optim...
CA Technologies
 
PPT
nZDM.ppt
Navin Somal
 
PPTX
CPU SCHEDULINGCPU SCHEDULINGCPU SCHEDULINGCPU SCHEDULING.pptx
ridmoon40318
 
PPTX
Comparison of various streaming technologies
Sachin Aggarwal
 
PDF
Presentation v mware roi tco calculator
solarisyourep
 
PPTX
Beginner's Guide to High Availability for Postgres
EDB
 
PDF
cloud computing chapter one in computer science
TSha7
 
PDF
operating system in computer science .pdf
TSha7
 
PDF
operating system in computer science ch05.pdf
TSha7
 
Sybase ASE 15.7- Two Case Studies of Successful Migration
SAP Technology
 
Skytap parasoft webinar new years resolution- accelerate sdlc
Skytap Cloud
 
OS-CPU-Scheduling-chap5.pptx
DrAmarNathDhebla
 
Operating systems chapter 5 silberschatz
GiulianoRanauro
 
CS_10_DR_CFD
ajaya gummadi
 
ch5.pptx CUP Scheduling and its details in OS
23017156038
 
SAP HANA SPS10- Scale-Out, High Availability and Disaster Recovery
SAP Technology
 
Public Sector Virtual Town Hall: High Availability for PostgreSQL
EDB
 
Adaptive Computing Using PlateSpin Orchestrate
Novell
 
ch5_EN_CPUSched_2022.pdf
CuracaoJTR
 
Ejecución de sizer para SimpliVity Partners
JessMoreno901369
 
Best Practice for Supercharging CA Workload Automation dSeries (DE) for Optim...
CA Technologies
 
nZDM.ppt
Navin Somal
 
CPU SCHEDULINGCPU SCHEDULINGCPU SCHEDULINGCPU SCHEDULING.pptx
ridmoon40318
 
Comparison of various streaming technologies
Sachin Aggarwal
 
Presentation v mware roi tco calculator
solarisyourep
 
Beginner's Guide to High Availability for Postgres
EDB
 
cloud computing chapter one in computer science
TSha7
 
operating system in computer science .pdf
TSha7
 
operating system in computer science ch05.pdf
TSha7
 
Ad

More from Zbigniew Jerzak (10)

PDF
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Zbigniew Jerzak
 
PDF
ThesisXSiena: The Content-Based Publish/Subscribe System
Zbigniew Jerzak
 
PDF
Clock Synchronization in Distributed Systems
Zbigniew Jerzak
 
PDF
XSiena: The Content-Based Publish/Subscribe System
Zbigniew Jerzak
 
PDF
Soft State in Publish/Subscribe
Zbigniew Jerzak
 
PDF
Highly Available Publish/Subscribe
Zbigniew Jerzak
 
PDF
Prefix Forwarding for Publish/Subscribe
Zbigniew Jerzak
 
PDF
Fail-Aware Publish/Subscribe
Zbigniew Jerzak
 
PDF
Bloom Filter Based Routing for Content-Based Publish/Subscribe
Zbigniew Jerzak
 
PDF
Adaptive Internal Clock Synchronization
Zbigniew Jerzak
 
Elastic Scaling of a High-Throughput Content-Based Publish/Subscribe Engine
Zbigniew Jerzak
 
ThesisXSiena: The Content-Based Publish/Subscribe System
Zbigniew Jerzak
 
Clock Synchronization in Distributed Systems
Zbigniew Jerzak
 
XSiena: The Content-Based Publish/Subscribe System
Zbigniew Jerzak
 
Soft State in Publish/Subscribe
Zbigniew Jerzak
 
Highly Available Publish/Subscribe
Zbigniew Jerzak
 
Prefix Forwarding for Publish/Subscribe
Zbigniew Jerzak
 
Fail-Aware Publish/Subscribe
Zbigniew Jerzak
 
Bloom Filter Based Routing for Content-Based Publish/Subscribe
Zbigniew Jerzak
 
Adaptive Internal Clock Synchronization
Zbigniew Jerzak
 

Recently uploaded (20)

PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 

Adaptive Replication for Elastic Data Stream Processing

  • 1. An Adaptive Replication Scheme For Elastic Data Stream Processing Thomas Heinze, Mariam Zia, Robert Krahn, Zbigniew Jerzak, Christof Fetzer July 02, 2015
  • 2. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 2InternalPublic Elasticity  Utilization below 30% in most cloud data centers  Users needs to reserve required resources  Limited understanding of the performance of the system  Limited knowledge of characteristics of the workload Workload/ Resources Load Static Provisioning Elastic Provisioning time Underprovisioning Overprovisioning
  • 3. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 3InternalPublic Configuring an Elastic Scaling System  Data Stream Processing highly suited for elasticity due to highly variable load and small state size (e.g. StreamCloud[1] or SEEP[2])  Key challenge: Minimize the overprovisioning and number of SLA violations by optimizing scaling decisions
  • 4. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 4InternalPublic Enabling Fault Tolerance  Elasticity requires horizontal scaling → we need fault tolerance  Two mechanisms: Active Replication vs. Upstream Backup User-defined Threshold Financial 0.0 2.5 5.0 7.5 10.0 12.5 0.0 0.5 1.0 1.5 2.0 Monetary Cost($) RecoveryTime(insec.) Active Upstream
  • 5. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 5InternalPublic Outline 1. Introduction 2. An Adaptive Replication Scheme 3. Evaluation 4. Conclusion and Future Work
  • 6. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 6InternalPublic Related Work a) Improving Upstream Backup:  Sweeping checkpoints, …  Faster Recovery by using Micro Batch Processing (D-Stream [3], TimeStream [4])  But: no user-configurable recovery time threshold b) Combination of both mechanisms:  Already proposed for Borealis by Hwang et al. [5]  Static Optimizer proposed by Updahyaya et al. [6]  Dynamic switching to handle overload/fault case by Martin et al. [7]/ Zhang et al.[8]  But: static or without user-configurable recovery time threshold
  • 7. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 7InternalPublic Adaptive Replication Scheme  Dynamically switch between upstream backup and active replication during runtime  Replication Scheme describes current replication mode for all operators Active Replication Upstream BackupUpstream Backup process process passive reserved Switch Roles Switch Instance 2 Switch Instance 1 process process reserved Key Questions: 1) When we need to switch replication mode? → Estimation Model for Upstream Backup 2) How to integrate with our elastic scaling system? → Adaption Algorithm
  • 8. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 8InternalPublic Recovery Time Estimation  Many factors influence the recovery time:  Operator type and Checkpointing time (static)  State Size and Queue Length (changing with the current workload)  Our solution: estimation based on historical samples  Accurancy: 0.3 sec. error for 10 sec. recovery time (sample size: 1000) Clustering EstimationHistorical Samples Clustered Samples Estimated Recovery Time Current workload characteristics
  • 9. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 9InternalPublic Single Operator Scenario  Observe current state size and queue length of all operators  Adapt replication scheme if user threshold is not met t Operating interval Recovery Time Threshold Estimated Recovery Time Estimated Recovery Time 1 2Active Replication Upstream Backup
  • 10. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 10InternalPublic Integration with Elastic Scaling System  Architecture of an elastic scaling system  Process many queries in parallel  Places operators on a varying number of hosts based CPU + network consumption  Scaling requires moving operators between hosts  Integration  Replication-aware operator placement  System recovery time = max(recovery time per query)  Monitor the recovery time for the crash of host h
  • 11. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 11InternalPublic Example: Multi Query Scenario System Recovery Time: max(trec(q1) , trec(q2)) F1 A1S A1‘F1‘ q1: trec(q1)= max(trec(F1), trec(A1) , trec(D1)) D1 D1‘ F2 A1S A2‘F2‘ q2: trec(q2)= max(trec(F2), trec(A2) , trec(D2)) D1 D1‘
  • 12. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 12InternalPublic Example: Operator Placement F1 A1S A1‘F1‘ D1 D1‘ F2 A2S A2‘F2‘ q2: D2 D2‘ Placement: Host 1 F1 F2 Host 2 A2 D1‘ F1‘ Host 4 A2‘ D2 F1‘ A1 q1: Host 3 D2‘ D1 A1‘ Recovery Time (max): trec(F1), trec(F2), trec(A1) trec(A2) trec(D1) trec(A2), trec(D2)
  • 13. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 13InternalPublic Example: Too High Recovery Time F1 A1S A1‘F1‘ D1 D1‘ F2 A2S A2‘F2‘ q2: D2 D2‘ Placement: Host 1 F1 F2 Host 2 A2 D1‘ F1‘ Host 4 A2‘ D2 F1‘ A1 trec(F1), trec(F2), trec(A1) trec(A2) trec(D1) q1: Host 3 D2‘ D1 A1‘ trec(A2), trec(D2)Recovery Time (max):
  • 14. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 14InternalPublic Example: Too High Recovery Time F1 A1S A1‘F1‘ D1 D1‘ F2 A2S A2‘F2‘ q2: D2 D2‘ Placement: Host 1 F1 F2 Host 2 A2 D1‘ F1‘ Host 4 A2‘ D2 F1‘ A1 trec(F1), trec(F2), trec(A1) trec(A2), trec (D1‘) trec(D1), trec(A1) q1: Host 3 D2‘ D1 A1‘ trec(A2), trec(D2)Recovery Time (max):
  • 15. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 15InternalPublic Example: Too Low Recovery Time F1 A1S A1‘F1‘ D1 D1‘ F2 A2S A2‘F2‘ q2: D2 D2‘ Placement: Host 1 F1 F2 Host 2 A2 D1‘ F1‘ Host 4 A2‘ D2 F1‘ A1 trec(F1), trec(F2), trec(A1) trec(A2), trec (D1‘) trec(D1), trec(A1) q1: Host 3 D2‘ D1 A1‘ trec(A2), trec(D2)Recovery Time (max):
  • 16. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 16InternalPublic Example: Too Low Recovery Time F1 A1S A1‘F1‘ D1 D1‘ F2 A2S A2‘F2‘ q2: D2 D2‘ Placement: Host 1 F1 F2 Host 2 A2 D1‘ F1‘ Host 4 A2‘ D2 F1‘ A1 trec(F1), trec(F2), trec(A1) trec(A2), trec (D1‘) trec(D1), trec(A1) q1: Host 3 D2‘ D1 A1‘ trec(D2)Recovery Time (max):
  • 18. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 18InternalPublic Setup  Private cloud environment with up to 12 hosts  Three Workloads: Financial, Twitter, Energy Sensors  Measure characteristics like CPU load, latency, etc. in 10 seconds intervals  20 crashes of a random host (immediately trigger recovery process)  Recovery Time measured as maximal latency peak observed after a host crash  Two baseline algorithms: Active Replication and Upstream Backup
  • 19. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 19InternalPublic Recovery Time For Different Thresholds
  • 20. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 20InternalPublic Adaptive Replication Scheme
  • 21. © 2015 SAP SE or an SAP affiliate company. All rights reserved. 21InternalPublic Summary  Active replication/upstream backup forces a hard trade-off between resource overhead and recovery time  Our adaptive replication scheme allows to customize trade-off based on user configuration Future work  Formalize approach for replication degree >2  Network-bound workloads  Replication Placement
  • 22. © 2015 SAP SE or an SAP affiliate company. All rights reserved. Thank you Contact information: Thomas Heinze Research Associate [email protected]