SlideShare a Scribd company logo
Building a Multi-Region Cluster at Target
Presenters: Andrew From and Aaron Ploetz
1 Introduction
2 Target’s DCD Cluster
3 Problems
4 Solutions
5 Current State
6 Lessons Learned
2© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved. 3
TTS - CPE Cassandra
Data model consulting
Deployment
Operations
$ whoami_
• B.S.-MCS University of Wisconsin-Whitewater.
• M.S.-SED Regis University.
• 10+ years experience with distributed storage tech.
• Supporting Cassandra in production since v0.8.
• Contributor to the Apache Cassandra project (cqlsh).
• Contributor to the Cassandra tag on Stack Overflow.
• 3x DataStax MVP for Apache Cassandra (2014-17).
© DataStax, All Rights Reserved. 4
Aaron Ploetz
$ whoami_
• B.S. Computer Engineering University of Minnesota
• Using Cassandra in production since v2.0
• Contributor to the Ratpack Framework Project.
• Contributor to the Cassandra tag on Stack Overflow.
• Maintainer for statsd-kafka-backend plugin on the Target public Github org
• I am very passionate and interested in metrics, monitoring, and alerting (not just
on Cassandra)
© DataStax, All Rights Reserved. 5
Andrew From
Introduction
Cassandra at Target
Cassandra clusters at Target
© DataStax, All Rights Reserved. 7
Cartwheel
Personalization
GAM
DCD Subscriptions Enterprise
Services
GDMItem LocationsCheckoutAdaptive-SEO
Versions used at Target
• Depends on experience of the application team.
• Most clusters run DSE 4.0.3.
• Some clusters have been built with Apache Cassandra 2.1.12-15.
• Most new clusters built on 2.2.7.
© DataStax, All Rights Reserved. 8
Target’s DCD Cluster
• Data footprint = 350Gb
• Multi-tenant cluster, supporting several teams:
• Rating and Reviews
• Pricing
• Item
• Promotions
• Content
• A/B Testing
• Back To School, Shopping Lists
• …and others
© DataStax, All Rights Reserved. 9
2015 Peak (Black Friday -> Cyber Monday)
• DataStax Enterprise 4.0.3 (Cassandra 2.0.7.31)
• Java 1.7
• CMS GC
• 18 nodes (256GB RAM, 6-core, 24 HT CPUs, 1TB disk):
• 6 nodes at two Target data centers (3 on-the-metal nodes each).
• 12 nodes at two tel-co data centers (6 on-the-metal nodes each).
• 500 Mbit connection with Tel-co datacenters.
• Sustained between 5000 and 5500 TPS during Peak.
• Zero downtime!
© DataStax, All Rights Reserved. 10
2016 Q1: Expand DCD to the cloud
Plan to expand DCD Cassandra cluster to cloud East and West regions:
• Add six cloud instances in each region.
• VPN connection to Target network.
• Data locality: Support teams spinning-up scalable application servers in the cloud.
• Went “live” early March 2016:
• Cloud-West 6 nodes
• Cloud-East 6 nodes
• TTC – 3 nodes
• TTCE – 3 nodes
• Tel-co 1 – 6 nodes
• Tel-co 2 – 6 nodes
© DataStax, All Rights Reserved. 11
Problems
Cassandra at Target
Problems
• VPN not stable between Cloud and Target.
• GC Pauses causing application latency.
• Orphaned repair streams building up over time.
• Data inconsistency between Tel-co and Cloud.
• Issues with large batches.
• Issues with poor data models.
© DataStax, All Rights Reserved. 13
Solutions
Cassandra at Target
• VPN not stable between Cloud and Target
• Gossip between Tel-co and Cloud was sketchy.
• Nodes reporting as “DN” that were actually ”UN.”
• Worked with Cloud Provider admins, reviewed architecture.
• Set-up increased monitoring to determine DCD Cassandra downtimes,
cross-referenced that with VPN connection logs with both our (Target) Cloud
Platform Engineering (CPE) Network team and Cloud Provider admins.
• Our CPE Network team worked with Cloud Provider to:
• Ensure proper network configuration.
• “Bi-directional” dead peer detection (DPD).
• “DPD Responder” handles dead peer requests from Cloud endpoint.
• Upgrade our VPN connections to 1Gbit.
© DataStax, All Rights Reserved. 15
• GC Pauses causing application latency
• STW GC pauses of 10-20 seconds (or more) rendering nodes unresponsive,
during nightly batch jobs (9pm – 6am). Most often around 2am.
• Was a small issue just prior to 2015 Peak Season; became more of a
problem once we expanded to the cloud.
• Worked with our teams to refine their data models and write patterns.
• Upgraded to Java 1.8.
• Enabled G1GC…biggest “win.”
© DataStax, All Rights Reserved. 16
• Orphaned repair streams building up over time
• Due to VPN inconsistency, repair streams between Cloud and Tel-co could
not complete consistently.
• Orphaned Streams (pending repairs) built-up over time, system load
average in Tel-co and Target nodes rose, nodes eventually became
unresponsive.
• Examined our use cases, scheduled ”focused” repair jobs, and only for
certain applications.
nodetool –h <node_ip> repair –pr -hosts <source_ip1,source_ip2,etc…>
• Only run repairs between Cloud and Target or Tel-co and Target.
© DataStax, All Rights Reserved. 17
• Data inconsistency between Tel-co and Cloud
• Data inconsistency issues causing problems for certain teams.
• Upgraded our Tel-co connection to 1Gbit, expandable to 10Gbit on request.
• But this was ultimately the wall we could not get around.
• Problems with repairs, but also issues with boot-strapping and
decommissioning nodes
• Met with ALL application teams:
• Discussed a future plan to split-off the Cloud nodes (with six new Target nodes)
into a new “DCD Cloud” cluster.
• Talked through application requirements, determined who would need to
move/split to “DCD Cloud.”
• Also challenged the application teams on who really needed to be in both Cloud
and Tel-co. Turns out that only the Pricing team needed both, and the rest could
successfully serve their requirements from one or the other.
© DataStax, All Rights Reserved. 18
Issues with large batches
• Teams unknowingly using batches:
© DataStax, All Rights Reserved. 19
Poor Data Models
• Using Cassandra in a batch way:
• Reading large tables entirely with SELECT * FROM table;
• Re-inserting data to a table on a schedule, even when data has not changed
• Lack of de-normalized tables to support specific queries
• Queries using ALLOW FILTERING
• “XREF” tables
• Queue-like usage of tables
• Extremely large row sizes
• Abuse of Collection data types
• Read before writing (or writing then immediately reading)
© DataStax, All Rights Reserved. 20
Current State
Cassandra at Target
August 2016
• DataStax Enterprise 4.0.3 (Cassandra 2.0.7.31)… Upgrade to 2.1 planned.
• Java 1.8
• G1GC 32GB Heap, MaxGCPauseMillis=500
• DCD Classic - 18 nodes (256GB RAM, 24 HT CPUs, 1TB disk):
• 6 nodes at two Target data centers (3 on-the-metal nodes each).
• 12 nodes at two Tel-co data centers (6 on-the-metal nodes each).
• DCD Cloud - 18 nodes (256GB RAM, 24 HT CPUs):
• 6 nodes at two Target data centers (3 on-the-metal nodes each).
• 12 nodes at two Cloud data centers (6 i2.2xlarge nodes each).
• Upgraded connection (up to 10 Gbit) with Tel-co.
© DataStax, All Rights Reserved. 22
Lessons Learned
• Spend time on your data model!
• Most overlooked aspect of Cassandra architecting.
• Build relationships with your application teams.
• Build tables to suit the query patterns.
• Talk about consistency requirements with your app teams.
• Ask the questions:
• Do you really need to replicate everywhere?
• Do you really need to read/write at LOCAL_QUORUM?
• What is your anticipated read/write ratio?
• Watch for tombstones! TTL-ing data helps, but TTLs are not free!
© DataStax, All Rights Reserved. 23
Lessons Learned (part 2)
• Involve your network team.
• Use a metrics solution (Graphite/Grafana, OpsCenter, etc)
• Give them exact data to work with.
• When building a new cluster with data centers in the cloud, thoroughly test-
out the operational aspects of your cluster:
• Bootstrapping/decommissioning a node.
• Running repairs.
• Trigger a GC (cassandra-stress can help with that).
• G1GC helps for larger heap sizes, but it’s not a silver bullet.
© DataStax, All Rights Reserved. 24
Questions?
Cassandra at Target

More Related Content

What's hot (18)

PPTX
Load testing Cassandra applications
Ben Slater
 
PDF
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
DataStax
 
PDF
Understanding Cassandra internals to solve real-world problems
Acunu
 
PPTX
Large partition in Cassandra
Shogo Hoshii
 
PPTX
Everyday I’m scaling... Cassandra
Instaclustr
 
PPTX
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
PDF
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax
 
PDF
Pythian: My First 100 days with a Cassandra Cluster
DataStax Academy
 
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
DataStax Academy
 
PDF
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
DataStax
 
PDF
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
PPTX
Performance tuning - A key to successful cassandra migration
Ramkumar Nottath
 
ODP
Intro to cassandra
Aaron Ploetz
 
PPTX
Productizing a Cassandra-Based Solution (Brij Bhushan Ravat, Ericsson) | C* S...
DataStax
 
PDF
Advanced Operations
DataStax Academy
 
PDF
Mesosphere and Contentteam: A New Way to Run Cassandra
DataStax Academy
 
PDF
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
DataStax
 
PPTX
Processing 50,000 events per second with Cassandra and Spark
Ben Slater
 
Load testing Cassandra applications
Ben Slater
 
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
DataStax
 
Understanding Cassandra internals to solve real-world problems
Acunu
 
Large partition in Cassandra
Shogo Hoshii
 
Everyday I’m scaling... Cassandra
Instaclustr
 
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax
 
Pythian: My First 100 days with a Cassandra Cluster
DataStax Academy
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
DataStax Academy
 
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
DataStax
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Performance tuning - A key to successful cassandra migration
Ramkumar Nottath
 
Intro to cassandra
Aaron Ploetz
 
Productizing a Cassandra-Based Solution (Brij Bhushan Ravat, Ericsson) | C* S...
DataStax
 
Advanced Operations
DataStax Academy
 
Mesosphere and Contentteam: A New Way to Run Cassandra
DataStax Academy
 
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
DataStax
 
Processing 50,000 events per second with Cassandra and Spark
Ben Slater
 

Viewers also liked (20)

PDF
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
DataStax
 
PDF
PagerDuty: One Year of Cassandra Failures
DataStax Academy
 
PPTX
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
DataStax
 
PDF
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
DataStax
 
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 
PPTX
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Sebastian Verheughe
 
PPTX
Webinar | Target Modernizes Retail with Engaging Digital Experiences
DataStax
 
PPTX
Best buy strategic analysis (bb team) final
Richard Chan, MBA
 
PDF
Target Holding - Big Dikes and Big Data
Frens Jan Rumph
 
PPTX
Hadoop for the Masses
DataWorks Summit/Hadoop Summit
 
PPTX
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
 
PDF
Target: Performance Tuning Cassandra at Target
DataStax Academy
 
PDF
Using APIs to Create an Omni-Channel Retail Experience
CA API Management
 
PPTX
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Hortonworks
 
PDF
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff
 
PPTX
Electronics Industry (Marketing Management)
Shabbir Akhtar
 
PPTX
Operating Model
rmuse70
 
PPTX
Best buy
Sohan Paturkar
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
DataStax
 
PagerDuty: One Year of Cassandra Failures
DataStax Academy
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
DataStax
 
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
DataStax
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 
Strategic Design by Architecture and Organisation @ FINN.no - JavaZone 2016
Sebastian Verheughe
 
Webinar | Target Modernizes Retail with Engaging Digital Experiences
DataStax
 
Best buy strategic analysis (bb team) final
Richard Chan, MBA
 
Target Holding - Big Dikes and Big Data
Frens Jan Rumph
 
Hadoop for the Masses
DataWorks Summit/Hadoop Summit
 
Ceph Deployment at Target: Customer Spotlight
Red_Hat_Storage
 
Target: Performance Tuning Cassandra at Target
DataStax Academy
 
Using APIs to Create an Omni-Channel Retail Experience
CA API Management
 
Demystify Big Data Breakfast Briefing - Juergen Urbanski, T-Systems
Hortonworks
 
Apache Cassandra at Target - Cassandra Summit 2014
Dan Cundiff
 
Electronics Industry (Marketing Management)
Shabbir Akhtar
 
Operating Model
rmuse70
 
Best buy
Sohan Paturkar
 
Ad

Similar to Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra Summit 2016 (20)

PDF
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
jaxLondonConference
 
PDF
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
PDF
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
PPTX
Cassandra in Operation
niallmilton
 
PDF
dcVAST-Case-Study
Sholeh Gregory
 
PDF
Netflix at-disney-09-26-2014
Monal Daxini
 
PDF
Stampede con 2014 cassandra in the real world
zznate
 
PDF
Experiences building a multi region cassandra operations orchestrator on aws
Diego Pacheco
 
PPTX
Cassandra Operations at Netflix
greggulrich
 
PPTX
1. beyond mission critical virtualizing big data and hadoop
Chiou-Nan Chen
 
PPTX
Modern infrastructure for business data lake
EMC
 
PPTX
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
andrei.arion
 
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
PDF
Five Lessons in Distributed Databases
jbellis
 
PPTX
Webinar - Delivering Enhanced Message Processing at Scale With an Always-on D...
DataStax
 
PDF
GumGum: Multi-Region Cassandra in AWS
DataStax Academy
 
PDF
Designing a Distributed Cloud Database for Dummies
DataStax
 
PDF
Cassandra Day NY 2014: From Proof of Concept to Production
DataStax Academy
 
PDF
Persistent Storage with Kubernetes in Production
Cheryl Hung
 
PDF
Persistent Storage with Kubernetes in Production
Cheryl Hung
 
Designing Resilient Application Platforms with Apache Cassandra - Hayato Shim...
jaxLondonConference
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Cassandra in Operation
niallmilton
 
dcVAST-Case-Study
Sholeh Gregory
 
Netflix at-disney-09-26-2014
Monal Daxini
 
Stampede con 2014 cassandra in the real world
zznate
 
Experiences building a multi region cassandra operations orchestrator on aws
Diego Pacheco
 
Cassandra Operations at Netflix
greggulrich
 
1. beyond mission critical virtualizing big data and hadoop
Chiou-Nan Chen
 
Modern infrastructure for business data lake
EMC
 
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
andrei.arion
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
Five Lessons in Distributed Databases
jbellis
 
Webinar - Delivering Enhanced Message Processing at Scale With an Always-on D...
DataStax
 
GumGum: Multi-Region Cassandra in AWS
DataStax Academy
 
Designing a Distributed Cloud Database for Dummies
DataStax
 
Cassandra Day NY 2014: From Proof of Concept to Production
DataStax Academy
 
Persistent Storage with Kubernetes in Production
Cheryl Hung
 
Persistent Storage with Kubernetes in Production
Cheryl Hung
 
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
DataStax
 
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
DataStax
 
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
DataStax
 
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
DataStax
 
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
DataStax
 
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 
PDF
How to Evaluate Cloud Databases for eCommerce
DataStax
 
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
PPTX
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
DataStax
 
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 
PPTX
Innovation Around Data and AI for Fraud Detection
DataStax
 
Is Your Enterprise Ready to Shine This Holiday Season?
DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
DataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
DataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
DataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
DataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
DataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
DataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
DataStax
 
How to Evaluate Cloud Databases for eCommerce
DataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
DataStax
 
Datastax - The Architect's guide to customer experience (CX)
DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
DataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
DataStax
 
Innovation Around Data and AI for Fraud Detection
DataStax
 

Recently uploaded (20)

PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Executive Business Intelligence Dashboards
vandeslie24
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 

Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra Summit 2016

  • 1. Building a Multi-Region Cluster at Target Presenters: Andrew From and Aaron Ploetz
  • 2. 1 Introduction 2 Target’s DCD Cluster 3 Problems 4 Solutions 5 Current State 6 Lessons Learned 2© DataStax, All Rights Reserved.
  • 3. © DataStax, All Rights Reserved. 3 TTS - CPE Cassandra Data model consulting Deployment Operations
  • 4. $ whoami_ • B.S.-MCS University of Wisconsin-Whitewater. • M.S.-SED Regis University. • 10+ years experience with distributed storage tech. • Supporting Cassandra in production since v0.8. • Contributor to the Apache Cassandra project (cqlsh). • Contributor to the Cassandra tag on Stack Overflow. • 3x DataStax MVP for Apache Cassandra (2014-17). © DataStax, All Rights Reserved. 4 Aaron Ploetz
  • 5. $ whoami_ • B.S. Computer Engineering University of Minnesota • Using Cassandra in production since v2.0 • Contributor to the Ratpack Framework Project. • Contributor to the Cassandra tag on Stack Overflow. • Maintainer for statsd-kafka-backend plugin on the Target public Github org • I am very passionate and interested in metrics, monitoring, and alerting (not just on Cassandra) © DataStax, All Rights Reserved. 5 Andrew From
  • 7. Cassandra clusters at Target © DataStax, All Rights Reserved. 7 Cartwheel Personalization GAM DCD Subscriptions Enterprise Services GDMItem LocationsCheckoutAdaptive-SEO
  • 8. Versions used at Target • Depends on experience of the application team. • Most clusters run DSE 4.0.3. • Some clusters have been built with Apache Cassandra 2.1.12-15. • Most new clusters built on 2.2.7. © DataStax, All Rights Reserved. 8
  • 9. Target’s DCD Cluster • Data footprint = 350Gb • Multi-tenant cluster, supporting several teams: • Rating and Reviews • Pricing • Item • Promotions • Content • A/B Testing • Back To School, Shopping Lists • …and others © DataStax, All Rights Reserved. 9
  • 10. 2015 Peak (Black Friday -> Cyber Monday) • DataStax Enterprise 4.0.3 (Cassandra 2.0.7.31) • Java 1.7 • CMS GC • 18 nodes (256GB RAM, 6-core, 24 HT CPUs, 1TB disk): • 6 nodes at two Target data centers (3 on-the-metal nodes each). • 12 nodes at two tel-co data centers (6 on-the-metal nodes each). • 500 Mbit connection with Tel-co datacenters. • Sustained between 5000 and 5500 TPS during Peak. • Zero downtime! © DataStax, All Rights Reserved. 10
  • 11. 2016 Q1: Expand DCD to the cloud Plan to expand DCD Cassandra cluster to cloud East and West regions: • Add six cloud instances in each region. • VPN connection to Target network. • Data locality: Support teams spinning-up scalable application servers in the cloud. • Went “live” early March 2016: • Cloud-West 6 nodes • Cloud-East 6 nodes • TTC – 3 nodes • TTCE – 3 nodes • Tel-co 1 – 6 nodes • Tel-co 2 – 6 nodes © DataStax, All Rights Reserved. 11
  • 13. Problems • VPN not stable between Cloud and Target. • GC Pauses causing application latency. • Orphaned repair streams building up over time. • Data inconsistency between Tel-co and Cloud. • Issues with large batches. • Issues with poor data models. © DataStax, All Rights Reserved. 13
  • 15. • VPN not stable between Cloud and Target • Gossip between Tel-co and Cloud was sketchy. • Nodes reporting as “DN” that were actually ”UN.” • Worked with Cloud Provider admins, reviewed architecture. • Set-up increased monitoring to determine DCD Cassandra downtimes, cross-referenced that with VPN connection logs with both our (Target) Cloud Platform Engineering (CPE) Network team and Cloud Provider admins. • Our CPE Network team worked with Cloud Provider to: • Ensure proper network configuration. • “Bi-directional” dead peer detection (DPD). • “DPD Responder” handles dead peer requests from Cloud endpoint. • Upgrade our VPN connections to 1Gbit. © DataStax, All Rights Reserved. 15
  • 16. • GC Pauses causing application latency • STW GC pauses of 10-20 seconds (or more) rendering nodes unresponsive, during nightly batch jobs (9pm – 6am). Most often around 2am. • Was a small issue just prior to 2015 Peak Season; became more of a problem once we expanded to the cloud. • Worked with our teams to refine their data models and write patterns. • Upgraded to Java 1.8. • Enabled G1GC…biggest “win.” © DataStax, All Rights Reserved. 16
  • 17. • Orphaned repair streams building up over time • Due to VPN inconsistency, repair streams between Cloud and Tel-co could not complete consistently. • Orphaned Streams (pending repairs) built-up over time, system load average in Tel-co and Target nodes rose, nodes eventually became unresponsive. • Examined our use cases, scheduled ”focused” repair jobs, and only for certain applications. nodetool –h <node_ip> repair –pr -hosts <source_ip1,source_ip2,etc…> • Only run repairs between Cloud and Target or Tel-co and Target. © DataStax, All Rights Reserved. 17
  • 18. • Data inconsistency between Tel-co and Cloud • Data inconsistency issues causing problems for certain teams. • Upgraded our Tel-co connection to 1Gbit, expandable to 10Gbit on request. • But this was ultimately the wall we could not get around. • Problems with repairs, but also issues with boot-strapping and decommissioning nodes • Met with ALL application teams: • Discussed a future plan to split-off the Cloud nodes (with six new Target nodes) into a new “DCD Cloud” cluster. • Talked through application requirements, determined who would need to move/split to “DCD Cloud.” • Also challenged the application teams on who really needed to be in both Cloud and Tel-co. Turns out that only the Pricing team needed both, and the rest could successfully serve their requirements from one or the other. © DataStax, All Rights Reserved. 18
  • 19. Issues with large batches • Teams unknowingly using batches: © DataStax, All Rights Reserved. 19
  • 20. Poor Data Models • Using Cassandra in a batch way: • Reading large tables entirely with SELECT * FROM table; • Re-inserting data to a table on a schedule, even when data has not changed • Lack of de-normalized tables to support specific queries • Queries using ALLOW FILTERING • “XREF” tables • Queue-like usage of tables • Extremely large row sizes • Abuse of Collection data types • Read before writing (or writing then immediately reading) © DataStax, All Rights Reserved. 20
  • 22. August 2016 • DataStax Enterprise 4.0.3 (Cassandra 2.0.7.31)… Upgrade to 2.1 planned. • Java 1.8 • G1GC 32GB Heap, MaxGCPauseMillis=500 • DCD Classic - 18 nodes (256GB RAM, 24 HT CPUs, 1TB disk): • 6 nodes at two Target data centers (3 on-the-metal nodes each). • 12 nodes at two Tel-co data centers (6 on-the-metal nodes each). • DCD Cloud - 18 nodes (256GB RAM, 24 HT CPUs): • 6 nodes at two Target data centers (3 on-the-metal nodes each). • 12 nodes at two Cloud data centers (6 i2.2xlarge nodes each). • Upgraded connection (up to 10 Gbit) with Tel-co. © DataStax, All Rights Reserved. 22
  • 23. Lessons Learned • Spend time on your data model! • Most overlooked aspect of Cassandra architecting. • Build relationships with your application teams. • Build tables to suit the query patterns. • Talk about consistency requirements with your app teams. • Ask the questions: • Do you really need to replicate everywhere? • Do you really need to read/write at LOCAL_QUORUM? • What is your anticipated read/write ratio? • Watch for tombstones! TTL-ing data helps, but TTLs are not free! © DataStax, All Rights Reserved. 23
  • 24. Lessons Learned (part 2) • Involve your network team. • Use a metrics solution (Graphite/Grafana, OpsCenter, etc) • Give them exact data to work with. • When building a new cluster with data centers in the cloud, thoroughly test- out the operational aspects of your cluster: • Bootstrapping/decommissioning a node. • Running repairs. • Trigger a GC (cassandra-stress can help with that). • G1GC helps for larger heap sizes, but it’s not a silver bullet. © DataStax, All Rights Reserved. 24