SlideShare a Scribd company logo
© DataStax, All Rights Reserved.Confidential
Designing Fault-Tolerant
Applications with
DataStax Enterprise and
Apache Cassandra
1 © DataStax, All Rights Reserved. Confidential
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Designing Fault Tolerant
Applications
1. Why does it matter?
2. What can be done about it?
3. Let’s see it.
© DataStax, All Rights Reserved.Confidential
Why does it matter?
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Outages Happen
June 17, 2019
CEO apologizes for register outages: 'A tough weekend'
Saturday’s glitch was due to an internal technical issue that caused
registers to stop working in stores nationwide, while
Sunday’s malfunction was due to problems at its vendor’s data
center.
System was down for several hours as employees worked to sort out
the situation on the biggest shopping days of the week … confusion
and long delays at stores across the U.S.
Source: Fox Business
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Outages Happen
More than 60 flights to and from Heathrow and Gatwick were
canceled and more than 100 were delayed, according to the
departure boards at the two airports. The problems started when people
tried to check in for the first flights of the day and lasted for about
12 hours.
Airline would not confirm how many people have been affected but
said it had experienced a “systems issue” affecting check-in
and flight departures at Heathrow, Gatwick and London City
airports.
Source: Reuters
August 7, 2019
Airline resuming services after latest IT meltdown
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Outages Happen
August 11, 2019
Banking services across Mexico go down due to data center outage
An electronic transaction services firm responsible for processing card
payments, said an electrical fault at its data centre in Santa Fe,
Mexico, was responsible for the outage that affected customers.
Customers were unable to make purchases or withdraw cash
using their credit and debit cards for several hours on Saturday.
Several Mexican media outlets reported chaos in supermarkets as
hapless shoppers were forced to abandon shopping trolleys full of
food.
Source: Techerati
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Outages are Expensive
“Analysts also predicted the retailer most likely lost hundreds of millions of dollars in sales
due to the glitches.” - Fox Business
“Said a power outage that led to the cancellation of hundreds of flights last month probably
cost it about 80 million pounds ($102 million) in lost revenue and the expense of
accommodating, re-booking and compensating thousands of passengers” - Bloomberg
“Uptime Institute’s 2018 Data Center Survey polled nearly 1,500 respondents and key
findings revealed that nearly a third of all reported outages cost more than $250K; 41
respondents reported a single outage cost over $1M; and one specific incident cost over
$50M”
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
Network
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
Network Config
change
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
Network Config
change
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
Network Config
change
Maintenance
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
Network Config
change
Maintenance
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
What is the root cause of these outages?
Network Config
change
Maintenance
Humans!
© DataStax, All Rights Reserved.Confidential
Why does it matter?
Designing fault tolerant applications matters because outages
happen and they are expensive
… and the incidents show up in the news
© DataStax, All Rights Reserved.Confidential
What can be done about it?
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Infrastructure Terminology
Cloud
Instances
Availability Zones (AZ)
Regions
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
On-Premises
Servers
- Maps to Cloud concept of Instances
Racks
- Maps to Cloud concept of AZ
Physical Data Centers
- Maps to Cloud concept of Regions
Physical Data Center
Infrastructure Terminology
Virginia California
Physical Data Center
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Logical Groupings
Nodes
Racks
- controls placement of data replicas
within a data center / region
Data Centers
- controls placement of data replicas
across data centers / regions
Clusters
- encompases one to many data centers
Configured in cassandra-rackdc.properties
dc=dc1
rack=rack1example
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Masterless
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Data Distribution
Replication Factor
- Defines number of replicas for a single row
- Defines the data centers in which the data
should live
- Using “tokens”, Cassandra will evenly
distribute the data replicas across the logical
racks for each data center
CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3};
allocate_tokens_for_local_replication_factor
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Data Distribution
Replication Factor
- Defines number of replicas for a single row
- Defines the data centers in which the data
should live
- Using “tokens”, Cassandra will evenly
distribute the data replicas across the logical
racks for each data center
CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3};
allocate_tokens_for_local_replication_factor
To protect against single
node outage, have
multiple replicas
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Data Distribution
Replication Factor
- Defines number of replicas for a single row
- Defines the data centers in which the data
should live
- Using “tokens”, Cassandra will evenly
distribute the data replicas across the logical
racks for each data center
CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3};
allocate_tokens_for_local_replication_factor
To protect against
availability zone outage,
distribute data across
multiple racks or AZs
To protect against single
node outage, have
multiple replicas
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Data Distribution
Replication Factor
- Defines number of replicas for a single row
- Defines the data centers in which the data
should live
- Using “tokens”, Cassandra will evenly
distribute the data replicas across the logical
racks for each data center
CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3};
allocate_tokens_for_local_replication_factor
To protect against region
outage, distribute data
across multiple physical
data centers or regions
To protect against
availability zone outage,
distribute data across
multiple racks or AZs
To protect against single
node outage, have
multiple replicas
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
- The number of replicas that need to acknowledge the read or write operation success to
the coordinator of the query.
- Tunable consistency model, trade-off between availability and data consistency
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
- The number of replicas that need to acknowledge the read or write operation success to
the coordinator of the query.
- Tunable consistency model, trade-off between availability and data consistency
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
- The number of replicas that need to acknowledge the read or write operation success to
the coordinator of the query.
- Tunable consistency model, trade-off between availability and data consistency
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
- The number of replicas that need to acknowledge the read or write operation success to
the coordinator of the query.
- Tunable consistency model, trade-off between availability and data consistency
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
Local Data Center
- Best practice to pin driver instances to a single data center using the DCAwareRoundRobinPolicy
- Affects the coordinator selection and group of nodes that will need to respond for LOCAL consistency
levels ( LOCAL_ONE, LOCAL_QUORUM )
- The driver creates connection pools to each node in the local data center
DCAwareRoundRobinPolicy.builder().withLocalDc(“us-east-1”)
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
LOCAL_ONE
- A single replica in the local data center received the write and
only requires a single local node to be available and to confirm the
write for a request to succeed.
- The coordinator of the query will send the write request to replicas
in all data centers, the difference relies on whether the coordinator
should wait for remote replicas to acknowledge the write for the
operation to succeed.
- For RF=3 … LOCAL_ONE is 1 replica
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
LOCAL_QUORUM
- Majority of replicas in the local data center received the write and
only requires those local nodes to be available and to confirm the
write for a request to succeed.
- The coordinator of the query will send the write request to replicas
in all data centers, the difference relies on whether the coordinator
should wait for remote replicas to acknowledge the write for the
operation to succeed.
- For RF=3 … LOCAL_QUORUM is 2 replicas
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Cassandra Concepts
Consistency Levels
EACH_QUORUM
- Majority of replicas in each data center received the write and
requires a quorum of nodes in each data center to be available
and to confirm the write for a request to succeed.
- For RF=3 in 2 data centers … EACH_QUORUM is 4 total replicas
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Application Architecture
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Application Architecture
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Availability Zone Outage
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Region Outage
© DataStax, All Rights Reserved.Confidential
What can be done about it?
1. Understand the failure domains
2. Prepare and design to protect against scopes of failure
3. Use DataStax and Cassandra for a fault tolerant database
© DataStax, All Rights Reserved.Confidential
Lets see it.
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Demo
https://github.com/datastax/dc-failover-demo
● 6 EC2 instances for DataStax Distribution of Apache Cassandra nodes segregated in two
data-centers:
○ Region us-east-1: 3 EC2 m5.2xlarge instances across 3 Availability Zones (AZ).
○ Region us-west-2: 3 EC2 m5.2xlarge instances across 3 AZs.
● 6 EC2 m5.large instances to be used for application services, one in each AZ.
● 2 Elastic Load Balancers (ELB), one per region, with health checks enabled.
● 1 AWS Global Accelerator as ELBs anycast frontend, with health checks enabled.
● 2 EC2 t2.small instances to be used as clients, one in each region. ( Locust )
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Demo
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Demo
CREATE KEYSPACE IF NOT EXISTS shopping WITH REPLICATION =
{'class':'NetworkTopologyStrategy','us-east-1': 3,'us-west-2': 3};
CREATE TABLE IF NOT EXISTS shopping.carts (
username text,
item_id int,
date_added timestamp,
item_name text,
PRIMARY KEY (username, item_id, date_added))
SELECT * FROM shopping.carts WHERE username = ?
INSERT INTO shopping.carts (username, item_id, date_added, item_name) VALUES (?,
?, toTimestamp(now()), ?)
Uses DataStax Java Driver for Cassandra, version 4.x
Schema
Reads
Writes
© DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved.
Demo
https://locust.io - “An Open Source Load Testing Tool”
© DataStax, All Rights Reserved.Confidential
Questions?
© DataStax, All Rights Reserved.Confidential
Thank you!

More Related Content

What's hot (20)

PPTX
Webinar: Customer Experience in Banking - a CTO's Perspective
DataStax
 
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
PPTX
How to get Real-Time Value from your IoT Data - Datastax
DataStax
 
PPTX
Webinar - Delivering Enhanced Message Processing at Scale With an Always-on D...
DataStax
 
PPTX
Webinar - Data Management for the "Right-Now" Economy - The 5 Key Ingredients
DataStax
 
PPTX
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
DataStax
 
PPTX
Webinar - Fighting Bank Fraud with Real-time Graph Database
DataStax
 
PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
PPTX
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
DataStax
 
PPTX
Introduction: Architecting for Scale
DataStax
 
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
PPTX
Get Started with Cloudera’s Cyber Solution
Cloudera, Inc.
 
PDF
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
DataStax
 
PPTX
The Big Data Ecosystem for Financial Services
DataStax
 
PPTX
Webinar: DataStax Managed Cloud: focus on innovation, not administration
DataStax
 
PPTX
Webinar: Fighting Fraud with Graph Databases
DataStax
 
PPTX
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
PPTX
Rethink Analytics with an Enterprise Data Hub
Cloudera, Inc.
 
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
PPTX
Webinar: Become PSD2 ready with DataStax
DataStax
 
Webinar: Customer Experience in Banking - a CTO's Perspective
DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
How to get Real-Time Value from your IoT Data - Datastax
DataStax
 
Webinar - Delivering Enhanced Message Processing at Scale With an Always-on D...
DataStax
 
Webinar - Data Management for the "Right-Now" Economy - The 5 Key Ingredients
DataStax
 
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
DataStax
 
Webinar - Fighting Bank Fraud with Real-time Graph Database
DataStax
 
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
Webinar - Case Study: ProtectWise enhances network security with DataStax alw...
DataStax
 
Introduction: Architecting for Scale
DataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
Get Started with Cloudera’s Cyber Solution
Cloudera, Inc.
 
How Virtual Reality and Machine Learning Are Powering the New Age of Network ...
DataStax
 
The Big Data Ecosystem for Financial Services
DataStax
 
Webinar: DataStax Managed Cloud: focus on innovation, not administration
DataStax
 
Webinar: Fighting Fraud with Graph Databases
DataStax
 
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Rethink Analytics with an Enterprise Data Hub
Cloudera, Inc.
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
Webinar: Become PSD2 ready with DataStax
DataStax
 

Similar to Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cassandra (20)

PPTX
How much money do you lose every time your ecommerce site goes down?
DataStax
 
PDF
What are the risks that may affect the availability of a data center
Livin Jose
 
PPTX
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
DataStax
 
PPTX
IDERA Live | Have No Fear the DBA is Here: Protecting Data Resources
IDERA Software
 
PDF
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Johnny Miller
 
PPTX
IT Infrastructure Through The Public Network Challenges And Solutions
Martin Jackson
 
PDF
Apache Cassandra and The Multi-Cloud by Amanda Moran
Data Con LA
 
PDF
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
jaxLondonConference
 
PPTX
Avoiding Cloud Outage
Nati Shalom
 
PDF
Architecting for failures in the Cloud - Barcamp Bangalore 2013
P3 InfoTech Solutions Pvt. Ltd.
 
PPTX
Preventing Network Outages
HelpSystems
 
PDF
security data and chaos - implementing chaos engineering to address business ...
Trent Hornibrook
 
PDF
10 questions to ask your cloud provider
HighQ
 
PPTX
Datastax - Why Your RDBMS fails at scale
Ruth Mills
 
PPTX
Datacenter overview
Heather Brotherton
 
PPTX
Backup Solution
Jed Concepcion
 
PPTX
DATA CENTER
Shekar Reddy
 
PDF
Running Persistent Data in a Multi-Cloud Architecture
VMware Tanzu
 
PDF
QQCOS
RenatoLeal27
 
PPTX
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
How much money do you lose every time your ecommerce site goes down?
DataStax
 
What are the risks that may affect the availability of a data center
Livin Jose
 
Don't Let Your Shoppers Drop; 5 Rules for Today’s eCommerce
DataStax
 
IDERA Live | Have No Fear the DBA is Here: Protecting Data Resources
IDERA Software
 
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Johnny Miller
 
IT Infrastructure Through The Public Network Challenges And Solutions
Martin Jackson
 
Apache Cassandra and The Multi-Cloud by Amanda Moran
Data Con LA
 
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
jaxLondonConference
 
Avoiding Cloud Outage
Nati Shalom
 
Architecting for failures in the Cloud - Barcamp Bangalore 2013
P3 InfoTech Solutions Pvt. Ltd.
 
Preventing Network Outages
HelpSystems
 
security data and chaos - implementing chaos engineering to address business ...
Trent Hornibrook
 
10 questions to ask your cloud provider
HighQ
 
Datastax - Why Your RDBMS fails at scale
Ruth Mills
 
Datacenter overview
Heather Brotherton
 
Backup Solution
Jed Concepcion
 
DATA CENTER
Shekar Reddy
 
Running Persistent Data in a Multi-Cloud Architecture
VMware Tanzu
 
Gluecon 2013 - Netflix Cloud Native Tutorial Details (part 2)
Adrian Cockcroft
 
Ad

More from DataStax (13)

PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
PPTX
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 
PPTX
Innovation Around Data and AI for Fraud Detection
DataStax
 
PPTX
Webinar: Building a Multi-Cloud Strategy with Data Autonomy featuring 451 Res...
DataStax
 
PPTX
Real Time Customer Experience for today's Right-Now Economy
DataStax
 
PPTX
Accelerating Digital Transformation using Cloud Native Solutions
DataStax
 
PPTX
GDPR: The Catalyst for Customer 360
DataStax
 
PDF
Managing Smart Meter with DataStax DSE
DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 
Innovation Around Data and AI for Fraud Detection
DataStax
 
Webinar: Building a Multi-Cloud Strategy with Data Autonomy featuring 451 Res...
DataStax
 
Real Time Customer Experience for today's Right-Now Economy
DataStax
 
Accelerating Digital Transformation using Cloud Native Solutions
DataStax
 
GDPR: The Catalyst for Customer 360
DataStax
 
Managing Smart Meter with DataStax DSE
DataStax
 
Ad

Recently uploaded (20)

PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Digital Circuits, important subject in CS
contactparinay1
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 

Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cassandra

  • 1. © DataStax, All Rights Reserved.Confidential Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cassandra 1 © DataStax, All Rights Reserved. Confidential
  • 2. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Designing Fault Tolerant Applications 1. Why does it matter? 2. What can be done about it? 3. Let’s see it.
  • 3. © DataStax, All Rights Reserved.Confidential Why does it matter?
  • 4. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Outages Happen June 17, 2019 CEO apologizes for register outages: 'A tough weekend' Saturday’s glitch was due to an internal technical issue that caused registers to stop working in stores nationwide, while Sunday’s malfunction was due to problems at its vendor’s data center. System was down for several hours as employees worked to sort out the situation on the biggest shopping days of the week … confusion and long delays at stores across the U.S. Source: Fox Business
  • 5. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Outages Happen More than 60 flights to and from Heathrow and Gatwick were canceled and more than 100 were delayed, according to the departure boards at the two airports. The problems started when people tried to check in for the first flights of the day and lasted for about 12 hours. Airline would not confirm how many people have been affected but said it had experienced a “systems issue” affecting check-in and flight departures at Heathrow, Gatwick and London City airports. Source: Reuters August 7, 2019 Airline resuming services after latest IT meltdown
  • 6. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Outages Happen August 11, 2019 Banking services across Mexico go down due to data center outage An electronic transaction services firm responsible for processing card payments, said an electrical fault at its data centre in Santa Fe, Mexico, was responsible for the outage that affected customers. Customers were unable to make purchases or withdraw cash using their credit and debit cards for several hours on Saturday. Several Mexican media outlets reported chaos in supermarkets as hapless shoppers were forced to abandon shopping trolleys full of food. Source: Techerati
  • 7. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Outages are Expensive “Analysts also predicted the retailer most likely lost hundreds of millions of dollars in sales due to the glitches.” - Fox Business “Said a power outage that led to the cancellation of hundreds of flights last month probably cost it about 80 million pounds ($102 million) in lost revenue and the expense of accommodating, re-booking and compensating thousands of passengers” - Bloomberg “Uptime Institute’s 2018 Data Center Survey polled nearly 1,500 respondents and key findings revealed that nearly a third of all reported outages cost more than $250K; 41 respondents reported a single outage cost over $1M; and one specific incident cost over $50M”
  • 8. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages?
  • 9. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages? Network
  • 10. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages? Network Config change
  • 11. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages? Network Config change
  • 12. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages? Network Config change Maintenance
  • 13. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages? Network Config change Maintenance
  • 14. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. What is the root cause of these outages? Network Config change Maintenance Humans!
  • 15. © DataStax, All Rights Reserved.Confidential Why does it matter? Designing fault tolerant applications matters because outages happen and they are expensive … and the incidents show up in the news
  • 16. © DataStax, All Rights Reserved.Confidential What can be done about it?
  • 17. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Infrastructure Terminology Cloud Instances Availability Zones (AZ) Regions
  • 18. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. On-Premises Servers - Maps to Cloud concept of Instances Racks - Maps to Cloud concept of AZ Physical Data Centers - Maps to Cloud concept of Regions Physical Data Center Infrastructure Terminology Virginia California Physical Data Center
  • 19. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Logical Groupings Nodes Racks - controls placement of data replicas within a data center / region Data Centers - controls placement of data replicas across data centers / regions Clusters - encompases one to many data centers Configured in cassandra-rackdc.properties dc=dc1 rack=rack1example
  • 20. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Masterless
  • 21. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Data Distribution Replication Factor - Defines number of replicas for a single row - Defines the data centers in which the data should live - Using “tokens”, Cassandra will evenly distribute the data replicas across the logical racks for each data center CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3}; allocate_tokens_for_local_replication_factor
  • 22. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Data Distribution Replication Factor - Defines number of replicas for a single row - Defines the data centers in which the data should live - Using “tokens”, Cassandra will evenly distribute the data replicas across the logical racks for each data center CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3}; allocate_tokens_for_local_replication_factor To protect against single node outage, have multiple replicas
  • 23. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Data Distribution Replication Factor - Defines number of replicas for a single row - Defines the data centers in which the data should live - Using “tokens”, Cassandra will evenly distribute the data replicas across the logical racks for each data center CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3}; allocate_tokens_for_local_replication_factor To protect against availability zone outage, distribute data across multiple racks or AZs To protect against single node outage, have multiple replicas
  • 24. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Data Distribution Replication Factor - Defines number of replicas for a single row - Defines the data centers in which the data should live - Using “tokens”, Cassandra will evenly distribute the data replicas across the logical racks for each data center CREATE KEYSPACE shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','dc1': 3,'dc2': 3}; allocate_tokens_for_local_replication_factor To protect against region outage, distribute data across multiple physical data centers or regions To protect against availability zone outage, distribute data across multiple racks or AZs To protect against single node outage, have multiple replicas
  • 25. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels - The number of replicas that need to acknowledge the read or write operation success to the coordinator of the query. - Tunable consistency model, trade-off between availability and data consistency
  • 26. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels - The number of replicas that need to acknowledge the read or write operation success to the coordinator of the query. - Tunable consistency model, trade-off between availability and data consistency
  • 27. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels - The number of replicas that need to acknowledge the read or write operation success to the coordinator of the query. - Tunable consistency model, trade-off between availability and data consistency
  • 28. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels - The number of replicas that need to acknowledge the read or write operation success to the coordinator of the query. - Tunable consistency model, trade-off between availability and data consistency
  • 29. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels Local Data Center - Best practice to pin driver instances to a single data center using the DCAwareRoundRobinPolicy - Affects the coordinator selection and group of nodes that will need to respond for LOCAL consistency levels ( LOCAL_ONE, LOCAL_QUORUM ) - The driver creates connection pools to each node in the local data center DCAwareRoundRobinPolicy.builder().withLocalDc(“us-east-1”)
  • 30. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels LOCAL_ONE - A single replica in the local data center received the write and only requires a single local node to be available and to confirm the write for a request to succeed. - The coordinator of the query will send the write request to replicas in all data centers, the difference relies on whether the coordinator should wait for remote replicas to acknowledge the write for the operation to succeed. - For RF=3 … LOCAL_ONE is 1 replica
  • 31. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels LOCAL_QUORUM - Majority of replicas in the local data center received the write and only requires those local nodes to be available and to confirm the write for a request to succeed. - The coordinator of the query will send the write request to replicas in all data centers, the difference relies on whether the coordinator should wait for remote replicas to acknowledge the write for the operation to succeed. - For RF=3 … LOCAL_QUORUM is 2 replicas
  • 32. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Cassandra Concepts Consistency Levels EACH_QUORUM - Majority of replicas in each data center received the write and requires a quorum of nodes in each data center to be available and to confirm the write for a request to succeed. - For RF=3 in 2 data centers … EACH_QUORUM is 4 total replicas
  • 33. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Application Architecture
  • 34. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Application Architecture
  • 35. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Availability Zone Outage
  • 36. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Region Outage
  • 37. © DataStax, All Rights Reserved.Confidential What can be done about it? 1. Understand the failure domains 2. Prepare and design to protect against scopes of failure 3. Use DataStax and Cassandra for a fault tolerant database
  • 38. © DataStax, All Rights Reserved.Confidential Lets see it.
  • 39. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Demo https://github.com/datastax/dc-failover-demo ● 6 EC2 instances for DataStax Distribution of Apache Cassandra nodes segregated in two data-centers: ○ Region us-east-1: 3 EC2 m5.2xlarge instances across 3 Availability Zones (AZ). ○ Region us-west-2: 3 EC2 m5.2xlarge instances across 3 AZs. ● 6 EC2 m5.large instances to be used for application services, one in each AZ. ● 2 Elastic Load Balancers (ELB), one per region, with health checks enabled. ● 1 AWS Global Accelerator as ELBs anycast frontend, with health checks enabled. ● 2 EC2 t2.small instances to be used as clients, one in each region. ( Locust )
  • 40. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Demo
  • 41. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Demo CREATE KEYSPACE IF NOT EXISTS shopping WITH REPLICATION = {'class':'NetworkTopologyStrategy','us-east-1': 3,'us-west-2': 3}; CREATE TABLE IF NOT EXISTS shopping.carts ( username text, item_id int, date_added timestamp, item_name text, PRIMARY KEY (username, item_id, date_added)) SELECT * FROM shopping.carts WHERE username = ? INSERT INTO shopping.carts (username, item_id, date_added, item_name) VALUES (?, ?, toTimestamp(now()), ?) Uses DataStax Java Driver for Cassandra, version 4.x Schema Reads Writes
  • 42. © DataStax, All Rights Reserved.ConfidentialConfidential© DataStax, All Rights Reserved. Demo https://locust.io - “An Open Source Load Testing Tool”
  • 43. © DataStax, All Rights Reserved.Confidential Questions?
  • 44. © DataStax, All Rights Reserved.Confidential Thank you!

Editor's Notes

  • #26: Processing transactions must always read the most recent write requires stronger consistency levels Product Ratings or are using Spark for a batch job may be tolerable to have “old” reads could use looser, more eventual consistency levels
  • #27: Processing transactions must always read the most recent write requires stronger consistency levels Product Ratings or are using Spark for a batch job may be tolerable to have “old” reads could use looser, more eventual consistency levels
  • #28: Processing transactions must always read the most recent write requires stronger consistency levels Product Ratings or are using Spark for a batch job may be tolerable to have “old” reads could use looser, more eventual consistency levels
  • #29: Processing transactions must always read the most recent write requires stronger consistency levels Product Ratings or are using Spark for a batch job may be tolerable to have “old” reads could use looser, more eventual consistency levels
  • #34: The client in this diagram hits the load balancer (LB) after the domain name system (DNS) resolves the name of the host. The LB will then distribute the traffic within the region to the API gateway service instances in an availability zone. The API gateway then routes the traffic to each microservice instance, that in turn sends the database requests to the nodes in the local data center that the DataStax driver is connected to. For simplicity of the diagram, we show only a single service type “Order Service,” but the same principles apply to applications that are composed of many services.