SlideShare a Scribd company logo
Lessons learned running large
real-world Docker environments
Oct 27th 2015
Alois Mayr
@mayralois
alois.mayr@ruxit.com
Dec 3rd 2015
Source: http://www.schoonoart.de/
What is a “large” environment?
Lessons learned running large real-world Docker environments
Campfire stories
#1 – The Death Star of Service Dependencies
#1 – Death Star of Service Dependencies
Load-balanced service
System-wide service
dependencies
Reverse proxies are essential
#1 – The Death Star of Service Dependencies
App #1
App #2
App #1 depends on App #2
Where is this specified?
Unwanted dependencies break architecture
#1 – The Death Star of Service Dependencies
Use proper versioning for
services, APIs, and images
#1 – The Death Star of Service Dependencies
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#2 – The Network Retransmission Episode
Retransmissions
Retransmissions Retransmissions
Retransmissions Retransmissions
Retransmissions
Retransmissions
• Hardware defect in a single network interface card
• NIC worked well under low load
• Retransmissions only under heavy load
• Affected communications to other machines
in datacenter
• Still not sure about exact defect on NIC
What was the problem?
#2 – The Network Retransmission Episode
#2 – The Network Retransmission Episode
Co-locate related containers.
Check network infrastructure.
#2 – The Network Retransmission Episode
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#3 – The Hungry Container Breakdown
Low disk space
Low disk space
• Shared /logs partition on host
• No log rotation, no archiving for app logs
• No proper log management used for Docker environment
• Shared /logs partition on a single host ran out of space
What was the problem?
#3 – The Hungry Container Breakdown
• Container health checks failed
• Marathon terminated task and rescheduled new one
• Still no free space on /logs
• Termination and rescheduling
• /var/lib/docker ran out of space
• Mesos slave unable to run Docker tasks
How the problem evolved over time
#3 – The Hungry Container Breakdown
• Log management tools for app logs, e.g. Fluentd and Logstash
--log-driver=none|syslog
• Remove container
--rm=true
• Run Mesos slave with
--docker_remove_delay=VALUE
How the problem could have been avoided
#3 – The Hungry Container Breakdown
Use log management tools
Empty /var/lib/docker
#3 – The Hungry Container Breakdown
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#4 – The Day Orchestration Stood Still
Queue and deployment
methods are slow
• Marathon 0.8.x keeps all versions of applications for recovery (by default)
• High frequency of microservices deployments
• Slowdown through zk overload
What was the problem?
#4 – The Day Orchestration Stood Still
• Respective parameter (zk_max_versions) was not set to proper limit
--zk_max_versions=20
How the problem could have been avoided
#4 – The Day Orchestration Stood Still
Track orchestration layer performance
Separate Mesos clusters
#4 – The Day Orchestration Stood Still
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
#5 – The Mushroom Cloud Effect
Way too many
components involved
820 BILLION dependencies!
• Massive load testing in preparation for Black Friday
• Tests ran for 3 days
• No impact to real users, only backend services affected
• Many components to take into account
What was the problem?
174 / 3.4k
22 / 13.3k
Service
Container
Host
1
1..*
*
1
#5 – The Mushroom Cloud Effect
Lessons learned running large real-world Docker environments
Automation needed for problem
analysis in large environments
#5 – The Mushroom Cloud Effect
Campfire stories
#1 – The Death Star of Service Dependencies
#2 – The Network Retransmission Episode
#3 – The Hungry Container Breakdown
#4 – The Day Orchestration Stood Still
#5 – The Mushroom Cloud Effect
Free trial - https://ruxit.com/docker-monitoring/
Blog - https://blog.ruxit.com/
@ruxit
What lessons have you learned?

More Related Content

What's hot (20)

PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
PPT
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
PPTX
Automated Deployment Using Jenkins Across Clusters
Naveen S.R
 
PPTX
Container Orchestration with Docker Swarm and Kubernetes
Will Hall
 
PDF
Windows container security
Docker, Inc.
 
PDF
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Security Conference
 
PDF
How to install and use Kubernetes
Luke Marsden
 
PDF
Docker {at,with} SignalFx
Maxime Petazzoni
 
PDF
Securing & Enforcing Network Policy and Encryption with Weave Net
Luke Marsden
 
PDF
Accessible hpc for everyone with docker and containers
Docker, Inc.
 
PDF
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
PDF
Lightning Fast Monitoring against Lightning Fast Outages
Maxime Petazzoni
 
PDF
How and why we got Prometheus working with Docker Swarm
Luke Marsden
 
PPTX
WebLogic Stability; Detect and Analyse Stuck Threads
Maarten Smeets
 
PDF
Build your own Service Bus V2
Kévin LOVATO
 
PDF
An empirical comparison of dependency issues in open source software packagin...
Tom Mens
 
PDF
Locking down your Kubernetes cluster with Linkerd
Buoyant
 
PDF
KubeCon London 2016 Ronana Cloud Native SDN
Romana Project
 
PDF
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
PDF
Docker casual alpine with nim nimlang 박승환_2016_03
Seunghwan Park
 
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
confluent
 
Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Jeff Holoman
 
Automated Deployment Using Jenkins Across Clusters
Naveen S.R
 
Container Orchestration with Docker Swarm and Kubernetes
Will Hall
 
Windows container security
Docker, Inc.
 
BlueHat Seattle 2019 || Kubernetes Practical Attack and Defense
BlueHat Security Conference
 
How to install and use Kubernetes
Luke Marsden
 
Docker {at,with} SignalFx
Maxime Petazzoni
 
Securing & Enforcing Network Policy and Encryption with Weave Net
Luke Marsden
 
Accessible hpc for everyone with docker and containers
Docker, Inc.
 
Netflix Container Runtime - Titus - for Container Camp 2016
aspyker
 
Lightning Fast Monitoring against Lightning Fast Outages
Maxime Petazzoni
 
How and why we got Prometheus working with Docker Swarm
Luke Marsden
 
WebLogic Stability; Detect and Analyse Stuck Threads
Maarten Smeets
 
Build your own Service Bus V2
Kévin LOVATO
 
An empirical comparison of dependency issues in open source software packagin...
Tom Mens
 
Locking down your Kubernetes cluster with Linkerd
Buoyant
 
KubeCon London 2016 Ronana Cloud Native SDN
Romana Project
 
How to build a Neutron Plugin (stadium edition)
Salvatore Orlando
 
Docker casual alpine with nim nimlang 박승환_2016_03
Seunghwan Park
 

Viewers also liked (7)

PPTX
Blue Whale in an Enterprise Pond
Digia Plc
 
PDF
Using Docker in the Real World
Tim Haak
 
PDF
Solving Real World Production Problems with Docker
Marc Campbell
 
PPTX
A Fabric/Puppet Build/Deploy System
adrian_nye
 
PPTX
Real World Experience of Running Docker in Development and Production
Ben Hall
 
PDF
Real-World Docker: 10 Things We've Learned
RightScale
 
PPTX
Programming the world with Docker
Patrick Chanezon
 
Blue Whale in an Enterprise Pond
Digia Plc
 
Using Docker in the Real World
Tim Haak
 
Solving Real World Production Problems with Docker
Marc Campbell
 
A Fabric/Puppet Build/Deploy System
adrian_nye
 
Real World Experience of Running Docker in Development and Production
Ben Hall
 
Real-World Docker: 10 Things We've Learned
RightScale
 
Programming the world with Docker
Patrick Chanezon
 
Ad

Similar to Lessons learned running large real-world Docker environments (20)

PPTX
The Mushroom Cloud Effect - What happens when containers fail?
Alois Mayr
 
PPTX
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
PDF
“Containerizing” applications with Docker: Ecosystem and Tools
Francisco Javier Ramírez Urea
 
PDF
Digital Transformation with Kubernetes, Containers, and Microservices
Lightbend
 
PDF
Microservices, Kubernetes, and Application Modernization Done Right
Lightbend
 
PPTX
The challenge of application distribution - Introduction to Docker (2014 dec ...
Sébastien Portebois
 
PDF
Getting started with docker
JEMLI Fathi
 
ODP
The journey to container adoption in enterprise
Igor Moochnick
 
PDF
Docker introduction
Julien Maitrehenry
 
PDF
DCSF19 Containers for Beginners
Docker, Inc.
 
PDF
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
PDF
DockerCon 2017 - General Session Day 1 - Solomon Hykes
Docker, Inc.
 
PPTX
Docker for the enterprise
Bert Poller
 
PDF
StackEngine Problem Space Demo
Boyd Hemphill
 
PDF
presentation @ docker meetup
Daniël van Gils
 
PDF
Docker: do's and don'ts
Paolo Tonin
 
PDF
OpenNebula Conf 2014 | Cloud Automation for OpenNebula by Kishorekumar Neelam...
NETWAYS
 
PDF
OpenNebulaConf 2014 - Cloud Automation for OpenNebula - Kishorekumar Neelamegam
OpenNebula Project
 
PDF
Handling 1 Billion Requests/hr with Minimal Latency Using Docker
Matomy
 
PDF
Accelerate your software development with Docker
Andrey Hristov
 
The Mushroom Cloud Effect - What happens when containers fail?
Alois Mayr
 
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
“Containerizing” applications with Docker: Ecosystem and Tools
Francisco Javier Ramírez Urea
 
Digital Transformation with Kubernetes, Containers, and Microservices
Lightbend
 
Microservices, Kubernetes, and Application Modernization Done Right
Lightbend
 
The challenge of application distribution - Introduction to Docker (2014 dec ...
Sébastien Portebois
 
Getting started with docker
JEMLI Fathi
 
The journey to container adoption in enterprise
Igor Moochnick
 
Docker introduction
Julien Maitrehenry
 
DCSF19 Containers for Beginners
Docker, Inc.
 
Series of Unfortunate Netflix Container Events - QConNYC17
aspyker
 
DockerCon 2017 - General Session Day 1 - Solomon Hykes
Docker, Inc.
 
Docker for the enterprise
Bert Poller
 
StackEngine Problem Space Demo
Boyd Hemphill
 
presentation @ docker meetup
Daniël van Gils
 
Docker: do's and don'ts
Paolo Tonin
 
OpenNebula Conf 2014 | Cloud Automation for OpenNebula by Kishorekumar Neelam...
NETWAYS
 
OpenNebulaConf 2014 - Cloud Automation for OpenNebula - Kishorekumar Neelamegam
OpenNebula Project
 
Handling 1 Billion Requests/hr with Minimal Latency Using Docker
Matomy
 
Accelerate your software development with Docker
Andrey Hristov
 
Ad

More from Alois Mayr (6)

PPTX
Automated distributed tracing - a first class citizen of monitoring
Alois Mayr
 
PDF
Monitoring a cloud native platform feature
Alois Mayr
 
PDF
When containers fail
Alois Mayr
 
PPTX
Running microservice environments is no free lunch
Alois Mayr
 
PDF
Managing and Scaling Microservices with Docker in the Wild
Alois Mayr
 
PDF
Scaling and Monitoring Docker environments
Alois Mayr
 
Automated distributed tracing - a first class citizen of monitoring
Alois Mayr
 
Monitoring a cloud native platform feature
Alois Mayr
 
When containers fail
Alois Mayr
 
Running microservice environments is no free lunch
Alois Mayr
 
Managing and Scaling Microservices with Docker in the Wild
Alois Mayr
 
Scaling and Monitoring Docker environments
Alois Mayr
 

Recently uploaded (20)

PPTX
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
PPTX
From spreadsheets and delays to real-time control
SatishKumar2651
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PPTX
BB FlashBack Pro 5.61.0.4843 With Crack Free Download
cracked shares
 
PDF
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
PDF
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
From spreadsheets and delays to real-time control
SatishKumar2651
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
BB FlashBack Pro 5.61.0.4843 With Crack Free Download
cracked shares
 
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
MiniTool Power Data Recovery 8.8 With Crack New Latest 2025
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
Everything you need to know about pricing & licensing Microsoft 365 Copilot f...
Q-Advise
 

Lessons learned running large real-world Docker environments

  • 1. Lessons learned running large real-world Docker environments Oct 27th 2015 Alois Mayr @mayralois [email protected] Dec 3rd 2015
  • 3. What is a “large” environment?
  • 5. Campfire stories #1 – The Death Star of Service Dependencies
  • 6. #1 – Death Star of Service Dependencies Load-balanced service System-wide service dependencies
  • 7. Reverse proxies are essential #1 – The Death Star of Service Dependencies
  • 8. App #1 App #2 App #1 depends on App #2 Where is this specified? Unwanted dependencies break architecture #1 – The Death Star of Service Dependencies
  • 9. Use proper versioning for services, APIs, and images #1 – The Death Star of Service Dependencies
  • 10. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode
  • 11. #2 – The Network Retransmission Episode Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions Retransmissions
  • 12. • Hardware defect in a single network interface card • NIC worked well under low load • Retransmissions only under heavy load • Affected communications to other machines in datacenter • Still not sure about exact defect on NIC What was the problem? #2 – The Network Retransmission Episode
  • 13. #2 – The Network Retransmission Episode
  • 14. Co-locate related containers. Check network infrastructure. #2 – The Network Retransmission Episode
  • 15. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown
  • 16. #3 – The Hungry Container Breakdown Low disk space Low disk space
  • 17. • Shared /logs partition on host • No log rotation, no archiving for app logs • No proper log management used for Docker environment • Shared /logs partition on a single host ran out of space What was the problem? #3 – The Hungry Container Breakdown
  • 18. • Container health checks failed • Marathon terminated task and rescheduled new one • Still no free space on /logs • Termination and rescheduling • /var/lib/docker ran out of space • Mesos slave unable to run Docker tasks How the problem evolved over time #3 – The Hungry Container Breakdown
  • 19. • Log management tools for app logs, e.g. Fluentd and Logstash --log-driver=none|syslog • Remove container --rm=true • Run Mesos slave with --docker_remove_delay=VALUE How the problem could have been avoided #3 – The Hungry Container Breakdown
  • 20. Use log management tools Empty /var/lib/docker #3 – The Hungry Container Breakdown
  • 21. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still
  • 22. #4 – The Day Orchestration Stood Still Queue and deployment methods are slow
  • 23. • Marathon 0.8.x keeps all versions of applications for recovery (by default) • High frequency of microservices deployments • Slowdown through zk overload What was the problem? #4 – The Day Orchestration Stood Still
  • 24. • Respective parameter (zk_max_versions) was not set to proper limit --zk_max_versions=20 How the problem could have been avoided #4 – The Day Orchestration Stood Still
  • 25. Track orchestration layer performance Separate Mesos clusters #4 – The Day Orchestration Stood Still
  • 26. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect
  • 27. #5 – The Mushroom Cloud Effect Way too many components involved 820 BILLION dependencies!
  • 28. • Massive load testing in preparation for Black Friday • Tests ran for 3 days • No impact to real users, only backend services affected • Many components to take into account What was the problem? 174 / 3.4k 22 / 13.3k Service Container Host 1 1..* * 1 #5 – The Mushroom Cloud Effect
  • 30. Automation needed for problem analysis in large environments #5 – The Mushroom Cloud Effect
  • 31. Campfire stories #1 – The Death Star of Service Dependencies #2 – The Network Retransmission Episode #3 – The Hungry Container Breakdown #4 – The Day Orchestration Stood Still #5 – The Mushroom Cloud Effect
  • 32. Free trial - https://ruxit.com/docker-monitoring/ Blog - https://blog.ruxit.com/ @ruxit What lessons have you learned?