SlideShare a Scribd company logo
a talk
Nelson Elhage, @nelhage
Operating Consul
As an Early Adopter
This Talk
• consul @ Stripe
• War Stories
• Lessons Learned
Consul at Stripe
The Good, The Bad, The Outages
Why Consul?
• Early 2014
• Stripe Infra gaining complexity
• Nightmarish in-house service registry
• Host lists distributed via puppet
Why Consul?
• Wanted a better service/host store
• consul had everything baked in
• Decided to do some test deployments
Initial Rollout
• Rolled out across all servers
• (started with bake-in in QA)
• No clients at all
What Could Go Wrong?
• We worried about memory leaks
Our First Production Issue
• Noticed one node taking >100M RAM
• (others all <50M)
• Reached out to armon for advice
• bug in the stats framework:
• https://github.com/armon/go-metrics/commit/02567bbc4f518a43853d262b651a3c8257c3f141
StartedAdding Clients
• Hooked into our deploy tool
• kept a manual emergency fallback
• Generated LB config from consul
• Noticed a surprising rate of errors
Raft Instability
• Seeing >1 failover/minute
• Reached out toArmon
• “Try 0.3”
• “consul is not optimized for spinning disk”
Rolling out 0.3
• Roll to QAfirst
• Nothing works!
• Check logs: TLS verification errors
Rolling out 0.3
• 0.3 changed TLS verification to check the
cert name
• Change our SSL issuing to add SANs
• 2014/06/16 16:52:57 [ERR] raft: Failed to make RequestVote
RPC to 10.100.29.175:8300: x509: certificate is valid for
[remote host], not [local host]
0.3 TLS Woes
• Whoops! consul was checking the remote
cert against the local node name
• armon> we just use "demo.consul.io" as
the CN for all of them
• 0.3 essentially completely broke TLS
0.3.1
• I wrote and got merged a patch to restore
0.2 behavior
• Rolled forward to 0.3.1
• Upgraded to SSD-backed servers
Increasing Rollout
• Switched various operational tools from
flatfile to consul
• Main app started using consul at startup
Consensus is Hard
consul-template
• Generating haproxy config using consul-template
• https://github.com/hashicorp/consul-template/
issues/168 – `consul-template` takes O(N²) time
with N services
consul-template
• Got that fixed, turned it on
• consul immediately fell over
• multiple elections/minute
• 2M allocations/minute
consul-template
• Service Watches churn when any service
changes health state
• Watching services on a large cluster →
self-DDOS
consul-template
• We use `consul-template -once` in cron
now
• Worse latency, but it works reliably
consul for leader election
• Our data team wanted a leader-election
primitive
• Built on top of consul, cribbing example
code
Sometime Later…
goroutine leak
• consul would rapidly eat all memory
• larger heap -> large GC pauses -> raft
instability
• manually restarted cluster 1/day
goroutine leak
• Reached out toArmon
• Very helpful in debugging
• Found several unrelated memory leaks
goroutine leak
• Tried to figure out what changed
• Eventually correlated to a session leak in
our leader election code
goroutine leak
• Fixed our leader-election code
• New policy: No non-discovery uses of
consul
consul DNS
• Increasingly reliant on consul for internal
discovery
• Unhappy at exposure to periodic instability
• Still have fallbacks, but outages remain painful
consul DNS
• Solution: Use consul-template to compile
consul DNS to a zone file
• Serve that out of a normal DNS server
• Refresh every 15s
Current Status
• Run consul everywhere
• Register all services
• Request-path lookups hit cached DNS
• Operational tools use HTTP interface
• Also generate config from consul-template
Final Stability Note
• consul 0.5.2 fixed our memory leaks
• consul has been quite stable for us of late
• consul-template watches still don’t scale
• 0.6 should help
Lessons Learned
being an early adopter without bringing down the site
(too many times)
Expect It To Be Rough
Monitoring, Monitoring, Monitoring
(graph all the things)
Incremental Rollout
Limit Scope
Isolation
UpgradeAggressively
Get To Know Upstream
Be Willing to Dive In
Questions?

More Related Content

PDF
Consul
Ariel Moskovich
 
PDF
Service Discovery in Distributed Systems
Ivan Voroshilin
 
PDF
Getting Started with Consul
Ramit Surana
 
PDF
Consul First Steps
Marc Cluet
 
PPTX
Service Discovery Like a Pro
Eran Harel
 
PPTX
High-speed, Reactive Microservices 2017
Rick Hightower
 
PPTX
WebSocket MicroService vs. REST Microservice
Rick Hightower
 
PPTX
Service Discovery with Consul - Arunvel Arunachalam
Neependra Khare
 
Service Discovery in Distributed Systems
Ivan Voroshilin
 
Getting Started with Consul
Ramit Surana
 
Consul First Steps
Marc Cluet
 
Service Discovery Like a Pro
Eran Harel
 
High-speed, Reactive Microservices 2017
Rick Hightower
 
WebSocket MicroService vs. REST Microservice
Rick Hightower
 
Service Discovery with Consul - Arunvel Arunachalam
Neependra Khare
 

What's hot (20)

PDF
HAProxyConf 2019: Building a Service Mesh at Criteo with Consul and HAProxy
Pierre Souchay
 
PPTX
2019 05-28 SRE Consul Criteo Meetup
Pierre Souchay
 
PPTX
Service Discovery with Consul
Ali Demirsoy
 
PPTX
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
confluent
 
PDF
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
confluent
 
PDF
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
NATS
 
PDF
Apache Kafka in Adobe Ad Cloud's Analytics Platform
confluent
 
PDF
Salt Air 19 - Intro to SaltStack RAET (reliable asyncronous event transport)
SaltStack
 
PDF
Network Infrastructure as Code with Chef and Cisco
Matt Ray
 
PDF
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
HostedbyConfluent
 
PPTX
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
PDF
TDC2017 | São Paulo - Trilha Containers How we figured out we had a SRE team ...
tdc-globalcode
 
PPTX
... No it's Apache Kafka!
makker_nl
 
PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
PDF
Let the alpakka pull your stream
Enno Runne
 
PDF
Simple Solutions for Complex Problems - Boulder Meetup
Apcera
 
PDF
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 
PDF
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
PPT
Flying to clouds - can it be easy? Cloud Native Applications
Jacek Bukowski
 
PPT
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
PROIDEA
 
HAProxyConf 2019: Building a Service Mesh at Criteo with Consul and HAProxy
Pierre Souchay
 
2019 05-28 SRE Consul Criteo Meetup
Pierre Souchay
 
Service Discovery with Consul
Ali Demirsoy
 
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
confluent
 
Kafka Summit SF 2017 - Running Kafka as a Service at Scale
confluent
 
NATS Connect Live | Serverless on Kubernetes with OpenFaaS & NATS
NATS
 
Apache Kafka in Adobe Ad Cloud's Analytics Platform
confluent
 
Salt Air 19 - Intro to SaltStack RAET (reliable asyncronous event transport)
SaltStack
 
Network Infrastructure as Code with Chef and Cisco
Matt Ray
 
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
HostedbyConfluent
 
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
TDC2017 | São Paulo - Trilha Containers How we figured out we had a SRE team ...
tdc-globalcode
 
... No it's Apache Kafka!
makker_nl
 
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
Let the alpakka pull your stream
Enno Runne
 
Simple Solutions for Complex Problems - Boulder Meetup
Apcera
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking VN
 
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
Flying to clouds - can it be easy? Cloud Native Applications
Jacek Bukowski
 
JDD 2016 - Jacek Bukowski - "Flying To Clouds" - Can It Be Easy?
PROIDEA
 
Ad

Similar to Operating Consul as an Early Adopter (19)

PDF
Service discovery like a pro (presented at reversimX)
Eran Harel
 
PDF
Introduction to Consul
Viswanath J
 
PDF
Infrastructure development using Consul
Grid Dynamics
 
PDF
Consul scale
Ariel Moskovich
 
PDF
Soa with consul
Rajesh Sharma
 
PDF
Consul tutorial
HarikaReddy115
 
PDF
Consul administration at scale
Pierre Souchay
 
PPTX
Discover/Register Everything in consul
Leandro Totino Pereira
 
PDF
Consul: Service Mesh for Microservices
ArmonDadgar
 
PDF
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Ontico
 
PDF
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
Ambassador Labs
 
PDF
PostgreSQL High-Availability and Geographic Locality using consul
Sean Chittenden
 
PDF
HashiStack. To the cloud and beyond...
Oleg Lobanov
 
PDF
Consul and docker swarm cluster
Eueung Mulyana
 
PPTX
Intro to Consul
Kristian Hellang
 
PPTX
Introduction to service discovery and self-organizing cluster orchestration. ...
Pivorak MeetUp
 
PDF
2019 hashiconf consul-templaterb
Pierre Souchay
 
PDF
Smart networking with service meshes
Mitchell Pronschinske
 
PDF
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
pierrecdn -
 
Service discovery like a pro (presented at reversimX)
Eran Harel
 
Introduction to Consul
Viswanath J
 
Infrastructure development using Consul
Grid Dynamics
 
Consul scale
Ariel Moskovich
 
Soa with consul
Rajesh Sharma
 
Consul tutorial
HarikaReddy115
 
Consul administration at scale
Pierre Souchay
 
Discover/Register Everything in consul
Leandro Totino Pereira
 
Consul: Service Mesh for Microservices
ArmonDadgar
 
Protecting the Web at a scale using consul and Elk / Valentin Chernozemski (S...
Ontico
 
The rise of Layer 7, microservices, and the proxy war with Envoy, NGINX, and ...
Ambassador Labs
 
PostgreSQL High-Availability and Geographic Locality using consul
Sean Chittenden
 
HashiStack. To the cloud and beyond...
Oleg Lobanov
 
Consul and docker swarm cluster
Eueung Mulyana
 
Intro to Consul
Kristian Hellang
 
Introduction to service discovery and self-organizing cluster orchestration. ...
Pivorak MeetUp
 
2019 hashiconf consul-templaterb
Pierre Souchay
 
Smart networking with service meshes
Mitchell Pronschinske
 
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
pierrecdn -
 
Ad

Recently uploaded (20)

PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PPTX
Presentation about variables and constant.pptx
safalsingh810
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PDF
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
Presentation about variables and constant.pptx
safalsingh810
 
Presentation about variables and constant.pptx
kr2589474
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Using licensed Data Loss Prevention (DLP) as a strategic proactive data secur...
Q-Advise
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 

Operating Consul as an Early Adopter