SlideShare a Scribd company logo
CloudStack Scalability

     By Alex Huang
Current Status
• 10k resources managed per management server
  node
• Scales out horizontally (must disable stats
  collector)
• Real production deployment of tens of thousands
  of resources
• Internal testing with software simulators up to
  30k physical resources with 30k VMs managed by
  4 management server nodes
• We believe we can at least double that scale per
  management server node
Balancing Incoming Requests
• Each management server has two worker thread pools for incoming
  requests: effectively two servers in one.
   – Executor threads provided by tomcat
   – Job threads waiting on job queue
• All incoming requests that requires mostly DB operations are short
  in duration and are executed by executor threads because incoming
  requests are already load balanced by the load balancer
• All incoming requests needing resources, which often have long
  running durations, are checked against ACL by the executor threads
  and then queued and picked up by job threads.
• # of job threads are scaled to the # of DB connections available to
  the management server
• Requests may take a long time depending on the constraint of the
  resources but they don’t fail.
The Much Harder Problem
• CloudStack performs a number of tasks on behalf of
  the users and those tasks increases with the number of
  virtual and physical resources available
   –   VM Sync
   –   SG Sync
   –   Hardware capacity monitoring
   –   Virtual resource usage statistics collection
   –   More to come
• When done in number of hundreds, no big deal.
• As numbers increase, this problem magnifies.
• How to scale this horizontally across management
  servers?
Comparison of two Approaches
• Stats Collector – collects capacity statistics
   – Fires every five minutes to collect stats about host CPU and
     memory capacity
   – Smart server and dumb client model: Resource only
     collects info and management server processes
   – Runs the same way on every management server
• VM Sync
   – Fires every minute
   – Peer to peer model: Resource does a full sync on
     connection and delta syncs thereafter. Management
     server trusts on resource for correct information.
   – Only runs against resources connected to the management
     server node
Numbers
•   Assume 10k hosts and 500k VMs (50 VMs per host)
•   Stats Collector
     – Fires off 10k requests every 5 minutes or 33 requests a second.
     – Bad but not too bad: Occupies 33 threads every second.
     – But just wait:
          •   2 management servers: 66 requests
          •   3 management servers: 99 requests
     – It gets worse as # of management servers increase because it did not auto-balance across
       management servers
     – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers.
       While it’s 99 requests generated, the number of threads involved is three-fold because
       requests need to be routed to the right management server.
     – It keeps the management server at 20% busy even at no load from incoming requests
•   VM Sync
     – Fires off 1 request at resource connection to sync about 50 VMs
     – Then, push from resource as resource knows what it has pushed before and only pushes
       changes that are out-of-band.
     – So essentially no threads occupied for a much larger data set.
What’s the Down Side?
• Resources must reconcile between VM states
  caused by management server commands and
  VM states it collects from the physical
  hardware so it requires more CPU
• Resources must use more memory to keep
  track of what amounts to a journal of changes
  since the last sync point.
• But data centers are full of these two
  resources.
Resource Load Balancing
• As management server is added into the cluster, resources are rebalanced
  seamlessly.
    –   MS2 signals to MS1 to hand over a resource
    –   MS1 wait for the commands on the resources to finish
    –   MS1 holds further commands in a queue
    –   MS1 signals to MS2 to take over
    –   MS2 connects
    –   MS2 signals to MS1 to complete transfer
    –   MS1 discards its resource and flows the commands being held to MS2
• Listeners are provided to business logic to listen on connection status and
  adjusts work based on who’s connected.
• By only working on resources that are connected to the management
  server the process is on, work is auto-balanced between management
  servers.
• Also reduces the message routing between the management servers.
Designing for Scalability
• Take advantage of the most abundant resources in a data center
  (CPU, RAM)
• Auto-scale to the least abundant resource (DB)
• Do not hold DB connections/Transactions across resource calls.
   – Use lock table implementation (Merovingian2 or
     GenericDao.acquireLockInTable() call) over database row locks in this
     situation.
   – Database row locks are still fine quick short lock outs.
• Balance the resource intensive tasks as # of management server
  nodes increases and decreases
   – Use job queues to balance long running processes across management
     servers
   – Make use of resource rebalancing in CloudStack to auto-balance your
     world load.
Reliability

By Alex Huang
The Five W’s of Unreliability
• What is unreliable? Everything
• Who is unreliable? Developers & administrators
• When does unreliability happen? 3:04 a.m. no
  matter which time zone… Any time.
• Where does unreliability happen? In carefully
  planned, everything has been considered data
  centers.
• How does unreliability happen? Rather
  nonchalantly
Dealing with Unreliability
•   Don’t assume!
•   Don’t bang your head against the wall!
•   Know when you don’t know any better.
•   Ask for help!
Designs against Unreliability
• Management Servers keeps an heartbeat with the DB. One
  ping a minute.
• Management Servers self-fences if it cannot write the
  heartbeat
• Other management servers wait to make sure the down
  management server is no longer writing to the heartbeat
  and then signal interested software to recover
• Check points at every call to a resource and code to deal
  with recovering from those check points
• Database records are not actually deleted to help with
  manual recovery when needed
• Write code that is idempotent
• Respect modularity when writing your code

More Related Content

DOCX
data replication
Hassanein Alwan
 
PDF
02 2017 emea_roadshow_milan_ha
mlraviol
 
PPTX
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
Suneet Grover
 
PPT
Load balancing
Soujanya V
 
PPT
Real time database
RasikhaCSEngineering
 
PPT
Weblogic Domain Activity
subash prakash
 
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
data replication
Hassanein Alwan
 
02 2017 emea_roadshow_milan_ha
mlraviol
 
Apache Kafka Bay Area Sep Meetup - 24/7 Customer, Inc.
Suneet Grover
 
Load balancing
Soujanya V
 
Real time database
RasikhaCSEngineering
 
Weblogic Domain Activity
subash prakash
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Yahoo Developer Network
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 

What's hot (20)

PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
PPTX
Server load balancer ppt
Shilpi Tandon
 
PDF
Architecting for Failure in a Containerized World
Tom Faulhaber
 
PPT
Load balancing
Ahmed Sherief El-Dakhakhny
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
Load Balancing Server
abhishek16pradhan
 
PPTX
Load Balancing from the Cloud - Layer 7 Aware Solution
Imperva Incapsula
 
PDF
Building your own Distributed System The easy way - Cassandra Summit EU 2014
KĂŠvin LOVATO
 
PDF
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
confluent
 
PDF
Russell spring one2gx_messaging_india
GaryPRussell
 
PDF
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Continuent
 
PPT
Load Balancing
nashniv
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 
PPTX
Database , 13 Replication
Ali Usman
 
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
PDF
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Bob Pusateri
 
PPT
Client Centric Consistency Model
Rajat Kumar
 
PPTX
Achieving Zero Downtime for SQL
ScaleArc
 
PPTX
Decoupling Decisions with Apache Kafka
Grant Henke
 
PPT
clustering and load balancing
Prabhat gangwar
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
confluent
 
Server load balancer ppt
Shilpi Tandon
 
Architecting for Failure in a Containerized World
Tom Faulhaber
 
Load balancing
Ahmed Sherief El-Dakhakhny
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Load Balancing Server
abhishek16pradhan
 
Load Balancing from the Cloud - Layer 7 Aware Solution
Imperva Incapsula
 
Building your own Distributed System The easy way - Cassandra Summit EU 2014
KĂŠvin LOVATO
 
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
confluent
 
Russell spring one2gx_messaging_india
GaryPRussell
 
Webinar Slides: Tungsten Connector / Proxy – The Secret Sauce Behind Zero-Dow...
Continuent
 
Load Balancing
nashniv
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 
Database , 13 Replication
Ali Usman
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Bob Pusateri
 
Client Centric Consistency Model
Rajat Kumar
 
Achieving Zero Downtime for SQL
ScaleArc
 
Decoupling Decisions with Apache Kafka
Grant Henke
 
clustering and load balancing
Prabhat gangwar
 
Ad

Similar to CloudStack Scalability (20)

PDF
Architecting for the cloud scability-availability
Len Bass
 
PDF
Txlf2012
Joe Brockmeier
 
PDF
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
IndicThreads
 
PPTX
Cloud stack overview
howie YU
 
PDF
Orchestration for the rest of us
JĂŠrĂ´me Petazzoni
 
PPT
Cloud Computing Basics I
RightScale
 
PPTX
Presentation
Jaspreet1192
 
PPTX
Deploying Apache CloudStack from API to UI
Joe Brockmeier
 
PDF
Cloud stack for_beginners
Radhika Puthiyetath
 
PPT
Cloud Computing with .Net
Wesley Faler
 
PDF
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
IndicThreads
 
PDF
How to Build a Compute Cluster
Ramsay Key
 
PPTX
CloudStack Overview
sedukull
 
PDF
Sameer Mitter - Management Responsibilities by Cloud service model types
Sameer Mitter
 
PDF
Tiger oracle
d0nn9n
 
PPTX
Introduction to Cloud Data Center and Network Issues
Jason TC HOU (侯宗成)
 
PDF
How DreamHost builds a public cloud with OpenStack.pdf
OpenStack Foundation
 
PDF
How DreamHost builds a Public Cloud with OpenStack
Carl Perry
 
PDF
Intro to SW Eng Principles for Cloud Computing - DNelson Apr2015
Darryl Nelson
 
Architecting for the cloud scability-availability
Len Bass
 
Txlf2012
Joe Brockmeier
 
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
IndicThreads
 
Cloud stack overview
howie YU
 
Orchestration for the rest of us
JĂŠrĂ´me Petazzoni
 
Cloud Computing Basics I
RightScale
 
Presentation
Jaspreet1192
 
Deploying Apache CloudStack from API to UI
Joe Brockmeier
 
Cloud stack for_beginners
Radhika Puthiyetath
 
Cloud Computing with .Net
Wesley Faler
 
Monitoring applications on cloud - Indicthreads cloud computing conference 2011
IndicThreads
 
How to Build a Compute Cluster
Ramsay Key
 
CloudStack Overview
sedukull
 
Sameer Mitter - Management Responsibilities by Cloud service model types
Sameer Mitter
 
Tiger oracle
d0nn9n
 
Introduction to Cloud Data Center and Network Issues
Jason TC HOU (侯宗成)
 
How DreamHost builds a public cloud with OpenStack.pdf
OpenStack Foundation
 
How DreamHost builds a Public Cloud with OpenStack
Carl Perry
 
Intro to SW Eng Principles for Cloud Computing - DNelson Apr2015
Darryl Nelson
 
Ad

More from CloudStack - Open Source Cloud Computing Project (20)

PPTX
Apache CloudStack from API to UI
CloudStack - Open Source Cloud Computing Project
 
PDF
CloudStack Hyderabad Meetup: How the Apache community works
CloudStack - Open Source Cloud Computing Project
 
PDF
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
CloudStack - Open Source Cloud Computing Project
 
PDF
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
CloudStack - Open Source Cloud Computing Project
 
PPTX
CloudStack technical overview
CloudStack - Open Source Cloud Computing Project
 
PPTX
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
CloudStack - Open Source Cloud Computing Project
 
PDF
vBACD July 2012 - Apache Hadoop, Now and Beyond
CloudStack - Open Source Cloud Computing Project
 
PDF
vBACD July 2012 - Scaling Storage with Ceph
CloudStack - Open Source Cloud Computing Project
 
PPTX
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
CloudStack - Open Source Cloud Computing Project
 
PPTX
vBACD July 2012 - Xen Cloud Platform
CloudStack - Open Source Cloud Computing Project
 
PPTX
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
CloudStack - Open Source Cloud Computing Project
 
PPTX
Virtualization in the cloud
CloudStack - Open Source Cloud Computing Project
 
PDF
Build a Cloud Day San Francisco - Ubuntu Cloud
CloudStack - Open Source Cloud Computing Project
 
PPTX
Cloudstack UI Customization
CloudStack - Open Source Cloud Computing Project
 
PPTX
Management server internals
CloudStack - Open Source Cloud Computing Project
 
PPTX
Introduction to CloudStack
CloudStack - Open Source Cloud Computing Project
 
PPT
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
CloudStack - Open Source Cloud Computing Project
 
PDF
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
CloudStack - Open Source Cloud Computing Project
 
PPTX
vBACD - Crash Course in Open Source Cloud Computing - 2/28
CloudStack - Open Source Cloud Computing Project
 
Apache CloudStack from API to UI
CloudStack - Open Source Cloud Computing Project
 
CloudStack Hyderabad Meetup: How the Apache community works
CloudStack - Open Source Cloud Computing Project
 
CloudStack Hyderabad Meetup: Migrating applications to IaaS clouds
CloudStack - Open Source Cloud Computing Project
 
CloudStack Hyderabad Meetup: Using CloudStack to build IaaS clouds
CloudStack - Open Source Cloud Computing Project
 
CloudStack technical overview
CloudStack - Open Source Cloud Computing Project
 
Introduction to CloudStack: How to Deploy and Manage Infrastructure-as-a-Serv...
CloudStack - Open Source Cloud Computing Project
 
vBACD July 2012 - Apache Hadoop, Now and Beyond
CloudStack - Open Source Cloud Computing Project
 
vBACD July 2012 - Scaling Storage with Ceph
CloudStack - Open Source Cloud Computing Project
 
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
CloudStack - Open Source Cloud Computing Project
 
vBACD July 2012 - Xen Cloud Platform
CloudStack - Open Source Cloud Computing Project
 
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
CloudStack - Open Source Cloud Computing Project
 
Build a Cloud Day San Francisco - Ubuntu Cloud
CloudStack - Open Source Cloud Computing Project
 
vBACD - Introduction to Puppet, Configuration Management and IT Automation So...
CloudStack - Open Source Cloud Computing Project
 
vBACD - Distributed Petabyte-Scale Cloud Storage with GlusterFS - 2/28
CloudStack - Open Source Cloud Computing Project
 
vBACD - Crash Course in Open Source Cloud Computing - 2/28
CloudStack - Open Source Cloud Computing Project
 

Recently uploaded (20)

PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
The Future of Artificial Intelligence (AI)
Mukul
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Software Development Methodologies in 2025
KodekX
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 

CloudStack Scalability

  • 1. CloudStack Scalability By Alex Huang
  • 2. Current Status • 10k resources managed per management server node • Scales out horizontally (must disable stats collector) • Real production deployment of tens of thousands of resources • Internal testing with software simulators up to 30k physical resources with 30k VMs managed by 4 management server nodes • We believe we can at least double that scale per management server node
  • 3. Balancing Incoming Requests • Each management server has two worker thread pools for incoming requests: effectively two servers in one. – Executor threads provided by tomcat – Job threads waiting on job queue • All incoming requests that requires mostly DB operations are short in duration and are executed by executor threads because incoming requests are already load balanced by the load balancer • All incoming requests needing resources, which often have long running durations, are checked against ACL by the executor threads and then queued and picked up by job threads. • # of job threads are scaled to the # of DB connections available to the management server • Requests may take a long time depending on the constraint of the resources but they don’t fail.
  • 4. The Much Harder Problem • CloudStack performs a number of tasks on behalf of the users and those tasks increases with the number of virtual and physical resources available – VM Sync – SG Sync – Hardware capacity monitoring – Virtual resource usage statistics collection – More to come • When done in number of hundreds, no big deal. • As numbers increase, this problem magnifies. • How to scale this horizontally across management servers?
  • 5. Comparison of two Approaches • Stats Collector – collects capacity statistics – Fires every five minutes to collect stats about host CPU and memory capacity – Smart server and dumb client model: Resource only collects info and management server processes – Runs the same way on every management server • VM Sync – Fires every minute – Peer to peer model: Resource does a full sync on connection and delta syncs thereafter. Management server trusts on resource for correct information. – Only runs against resources connected to the management server node
  • 6. Numbers • Assume 10k hosts and 500k VMs (50 VMs per host) • Stats Collector – Fires off 10k requests every 5 minutes or 33 requests a second. – Bad but not too bad: Occupies 33 threads every second. – But just wait: • 2 management servers: 66 requests • 3 management servers: 99 requests – It gets worse as # of management servers increase because it did not auto-balance across management servers – Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers. While it’s 99 requests generated, the number of threads involved is three-fold because requests need to be routed to the right management server. – It keeps the management server at 20% busy even at no load from incoming requests • VM Sync – Fires off 1 request at resource connection to sync about 50 VMs – Then, push from resource as resource knows what it has pushed before and only pushes changes that are out-of-band. – So essentially no threads occupied for a much larger data set.
  • 7. What’s the Down Side? • Resources must reconcile between VM states caused by management server commands and VM states it collects from the physical hardware so it requires more CPU • Resources must use more memory to keep track of what amounts to a journal of changes since the last sync point. • But data centers are full of these two resources.
  • 8. Resource Load Balancing • As management server is added into the cluster, resources are rebalanced seamlessly. – MS2 signals to MS1 to hand over a resource – MS1 wait for the commands on the resources to finish – MS1 holds further commands in a queue – MS1 signals to MS2 to take over – MS2 connects – MS2 signals to MS1 to complete transfer – MS1 discards its resource and flows the commands being held to MS2 • Listeners are provided to business logic to listen on connection status and adjusts work based on who’s connected. • By only working on resources that are connected to the management server the process is on, work is auto-balanced between management servers. • Also reduces the message routing between the management servers.
  • 9. Designing for Scalability • Take advantage of the most abundant resources in a data center (CPU, RAM) • Auto-scale to the least abundant resource (DB) • Do not hold DB connections/Transactions across resource calls. – Use lock table implementation (Merovingian2 or GenericDao.acquireLockInTable() call) over database row locks in this situation. – Database row locks are still fine quick short lock outs. • Balance the resource intensive tasks as # of management server nodes increases and decreases – Use job queues to balance long running processes across management servers – Make use of resource rebalancing in CloudStack to auto-balance your world load.
  • 11. The Five W’s of Unreliability • What is unreliable? Everything • Who is unreliable? Developers & administrators • When does unreliability happen? 3:04 a.m. no matter which time zone… Any time. • Where does unreliability happen? In carefully planned, everything has been considered data centers. • How does unreliability happen? Rather nonchalantly
  • 12. Dealing with Unreliability • Don’t assume! • Don’t bang your head against the wall! • Know when you don’t know any better. • Ask for help!
  • 13. Designs against Unreliability • Management Servers keeps an heartbeat with the DB. One ping a minute. • Management Servers self-fences if it cannot write the heartbeat • Other management servers wait to make sure the down management server is no longer writing to the heartbeat and then signal interested software to recover • Check points at every call to a resource and code to deal with recovering from those check points • Database records are not actually deleted to help with manual recovery when needed • Write code that is idempotent • Respect modularity when writing your code