SlideShare a Scribd company logo
How does the Cloud Foundry
Diego Project Run at Scale?
and updates on .NET Support
Who’s this guy?
• Amit Gupta
• https://akgupta.ca
• @amitkgupta84
Who’s this guy?
• Berkeley math grad school… dropout
• Rails consulting… deserter
• now I do BOSH, Cloud Foundry, Diego, etc.
Testing Diego Performance at Scale
• current Diego architecture
• performance testing approach
• test specifications
• test implementation and tools
• results
• bottom line
• next steps
Current Diego Architecture
+
Current Diego Architecture
What’s new-ish?
• consul for service discovery
• receptor (API) to decouple from CC
• SSH proxy for container access
• NATS-less auction
• garden-windows for .NET applications
Current Diego Architecture
Main components:
• etcd ephemeral data store
• consul service discovery
• receptor Diego API
• nsync sync CC desired state w/Diego
• route-emitter sync with gorouter
• converger health mgmt & consistency
• garden containerization
• rep sync garden actual state w/Diego
• auctioneer workload scheduling
Performance Testing Approach
• full end-to-end tests
• do a lot of stuff:
– is it correct, is it performant?
• kill a lot of stuff:
– is it correct, is it performant?
• emit logs and metrics (business as usual)
• plot & visualize
• fix stuff, repeat at higher scale*
Test Specifications
#1: #2:
#3: #4:
Test Specifications
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
n
Test Specifications
• Diego does tasks and long-running processes
• launch 10n, …, 400n tasks:
– workload distribution?
– scheduling time distribution?
– running time distribution?
– success rate?
– growth rate?
• launch 10n, …, 400n-instance LRP:
– same questions…
Test Specifications
• Diego+CF stages and runs apps
• > cf push
• upload source bits
• fetch buildpack and stage droplet (task)
• fetch droplet and run app (LRP)
• dynamic routing
• streaming logs
Test Specifications
• bring up n nodes in parallel
– from each node, push a apps in parallel
– from each node, repeat this for r rounds
• a is always ≈ 20
• r is always = 40
• n starts out = 1
Test Specifications
• the pushed apps have varying characteristics:
– 1-4 instances
– 128M-1024M memory
– 1M-200M source code payload
– 1-20 log lines/second
– crash never vs. every 30 s
Test Specifications
• starting with n=1:
– app instances ≈ 1k
– instances/cell ≈ 100
– memory utilization across cells ≈ 90%
– app instances crashing (by-design) ≈ 10%
Test Specifications
• evaluate:
– workload distribution
– success rate of pushes
– success rate of app routability
– times for all the things in the push lifecycles
– crash recovery behaviour
– all the metrics!
Test Specifications
• kill 10% of cells
– watch metrics for recovery behaviour
• kill moar cells… and etcd
– does system handle excess load gracefully?
• revive everything with > bosh cck
– does system recover gracefully…
– with no further manual intervention?
Test Specifications
– Figure Out What’s Broke –
– Fix Stuff –
– Move On Scale Up & Repeat –
Test Implementation and Tools
• S3 log, graph, plot backups
• ginkgo & gomega testing DSL
• BOSH parallel test-lab deploys
• tmux & ssh run test suites remotely
• papertrail log archives
• datadog metrics visualizations
• cicerone (custom) log visualizations
Results
400 tasks’ lifecycle timelines, dominated by container creation
Results
Maybe some cells’ gardens were running slower?
Results
Grouping by cell shows uniform container creation slowdown
Results
So that’s not it…
Also, what’s with the blue steps?
Let’s visualize logs a couple more ways
Then take stock of the questions raised
Results
Let’s just look at scheduling (ignore container creation, etc.)
Results
Scheduling again, grouped by which API node handled the request
Results
And how about some histograms of all the things?
Results
From the 400-task request from “Fezzik”:
• only 3-4 (out of 10) API nodes handle reqs?
• recording task reqs take increasing time?
• submitting auction reqs sometimes slow?
• later auctions take so long?
• outliers wtf?
• container creation takes increasing time?
Results
• only 3-4 (out of 10) API nodes handle reqs?
– when multiple address requests during DNS lookup, Golang
returns the DNS response to all requests; this results in only 3-4
API endpoint lookups for the whole set of tasks
• recording task reqs take increasing time?
– API servers use an etcd client with throttling on # of concurrent
requests
• submitting auction reqs sometimes slow?
– auction requests require API node to lookup auctioneer address
in etcd, using throttled etcd client
Results
• later auctions take so long?
– reps were taking longer to report their state to auctioneer,
because they were making expensive calls to garden,
sequentially, to determine current resource usage
• outliers wtf?
– combination of missing logs due to papertrail lossiness, +
cicerone handling missing data poorly
• container creation takes increasing time?
– garden team tasked with investigation
Results
Problems can come from:
• our software
– throttled etcd client
– sequential calls to garden
• software we consume
– garden container creation
• “experiment apparatus” (tools and services):
– papertrail lossiness
– cicerone sloppiness
• language runtime
– Golang’s DNS behaviour
Results
Fixed what we could control, and now it’s all garden
Results
Okay, so far, that’s just been
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
Results
Next, the timelines of pushing 1k app instances
Results
• for the fastest pushes
– dominated by red, blue, gold
– i.e. upload source & CC emit “start”, staging process,
upload droplet
• pushes get slower
– growth in green, light blue, fucsia, teal
– i.e. schedule staging, create staging container,
schedule running, create running container
• main concern: why is scheduling slowing down?
Results
• we had a theory (blame app log chattiness)
• reproduced experiment in BOSH-Lite
– with chattiness turned on
– with chattiness turned off
• appeared to work better
• tried it on AWS
• no improvement 
Results
• spelunked through more logs
• SSH’d onto nodes and tried hitting services
• eventually pinpointed it:
– auctioneer asks cells for state
– cell reps ask garden for usage
– garden gets container disk usage  bottleneck
Results
Garden stops sending disk usage stats, scheduling time disappears
Results
Let’s let things stew between
and
Results
Right after all app pushes, decent workload distribution
Results
… an hour later, something pretty bad happened
Results
• cells heartbeat their presence to etcd
• if ttl expires, converger reschedules LRPs
• cells may reappear after their workloads have
been reassigned
• they remain underutilized
• but why do cells disappear in the first place?
• added more logging, hope to catch in n=2 round
Results
With the one lingering question about cell disappearnce, on to n=2
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
?
Results
With 800 concurrent task reqs, found container cleanup garden bug
Results
With 800-instance LRP, found API node request scheduling serially
Results
• we added a story to the garden backlog
• the serial request issue was an easy fix
• then, with n=2 parallel test-lab nodes, we
pushed 2x the apps
– things worked correctly
– system was performant as a whole
– but individual components showed signs of scale
issues
Results
Our “bulk durations” doubled
Results
• nsync fetches state from CC and etcd to make
sure CC desired state is reflected in diego
• converger fetches desired and actual state
from etcd to make sure things are consistent
• route-emitter fetches state from etcd to keep
gorouter in sync
• bulk loop times doubled from n=1
Results
… and this happened again
Results
– the etcd and consul story –
Results
Fast-forward to today
#1: #2:
#3: #4:
x 1
#1: #2:
#3: #4:
x 2
#1: #2:
#3: #4:
x 5
#1: #2:
#3: #4:
x 10
✓✓
✓ ✓
? ✓✓
✓ ✓
?
✓✓
✓ ✓
? ✓ ???
Bottom Line
At the highest scale:
• 4000 concurrent tasks ✓
• 4000-instance LRP ✓
• 10k “real app” instances @ 100 instances/cell:
– etcd (ephemeral data store) ✓
– consul (service discovery) ? (… it’s a long story)
– receptor (Diego API) ? (bulk JSON)
– nsync (CC desired state sync) ? (because of receptor)
– route-emitter (gorouter sync) ? (because of receptor)
– garden (containerizer) ✓
– rep (garden actual state sync) ✓
– auctioneer (scheduler) ✓
Next Steps
• Security
– mutual SSL between all components
– encrypting data-at-rest
• Versioning
– handle breaking API changes gracefully
– production hardening
• Optimize data models
– hand-in-hand with versioning
– shrink payload for bulk reqs
– investigate faster encodings; protobufs > JSON
– initial experiments show 100x speedup
Updates on .NET Support
Updates on .NET Support
• what’s currently supported?
– ASP.NET MVC
– nothing too exotic
– most CF/Diego features, e.g. security groups
– VisualStudio plugin, similar to the Eclipse CF plugin for
Java
• what are the limitations?
– some newer Diego features, e.g. SSH
– in α/β stage, dev-only
Updates on .NET Support
• what’s coming up?
– make it easier to deploy Windows cell
– more VisualStudio plugin features
– hardening testing/CI
• further down the line?
– remote debugging
– the “Spring experience”
Updates on .NET Support
• shout outs
– CenturyLink
– HP
• feedback & questions?
– Mark Kropf (PM): mkropf@pivotal.io
– David Morhovich (Lead): dmorhovich@pivotal.io

More Related Content

PPTX
Cloud Foundry Roadmap (Cloud Foundry Summit 2014)
VMware Tanzu
 
PDF
Cloud Foundry - An Open Innovation Platform
All Things Open
 
PPTX
vCloud Automation Center and Pivotal Cloud Foundry – Better PaaS Solution (VM...
VMware Tanzu
 
PDF
OS + CF Austin meetup
ragss
 
PDF
Cloudfoundry architecture
Ramnivas Laddad
 
PDF
Cloud foundry
Isuru Perera
 
PDF
MongoDB-as-a-Service on Pivotal Cloud Foundry
VMware Tanzu
 
PPTX
Pivotal Cloud Platform Roadshow Keynote
cornelia davis
 
Cloud Foundry Roadmap (Cloud Foundry Summit 2014)
VMware Tanzu
 
Cloud Foundry - An Open Innovation Platform
All Things Open
 
vCloud Automation Center and Pivotal Cloud Foundry – Better PaaS Solution (VM...
VMware Tanzu
 
OS + CF Austin meetup
ragss
 
Cloudfoundry architecture
Ramnivas Laddad
 
Cloud foundry
Isuru Perera
 
MongoDB-as-a-Service on Pivotal Cloud Foundry
VMware Tanzu
 
Pivotal Cloud Platform Roadshow Keynote
cornelia davis
 

What's hot (20)

PPTX
V mware v realize orchestrator 6.0 knowledge transfer kit
solarisyougood
 
PDF
Cloudfoundry Introduction
Yitao Jiang
 
PPTX
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Lucas Jellema
 
PDF
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Chocolatey Software
 
PDF
Why Your Digital Transformation Strategy Demands Middleware Modernization
VMware Tanzu
 
PDF
Platform as a Service (PaaS) - A cloud service for Developers
Ravindra Dastikop
 
PPTX
CF SUMMIT: Partnerships, Business and Cloud Foundry
Nima Badiey
 
PPTX
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
Arraya Solutions
 
PPTX
What is Serverless Computing?
AIMDek Technologies
 
PPTX
Ensuring Cloud Native Success: Organization Transformation
Chloe Jackson
 
PDF
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
VMware Tanzu
 
PDF
Technology choices for Apache Kafka and Change Data Capture
Andrew Schofield
 
PPTX
Microservices in the Enterprise
Jesus Rodriguez
 
PDF
Kubernetes: Dive into the Future of Infrastructure
GlobalLogic Ukraine
 
PPTX
Deploy your Multi-tier Application in Cloud Foundry
cornelia davis
 
PPTX
Cache-Aside Cloud Design Pattern
Siva Rama Krishna Chunduru
 
PDF
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
Cloud Native Day Tel Aviv
 
PDF
Pivotal Cloud Foundry 2.4: A First Look
VMware Tanzu
 
PDF
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
confluent
 
PPTX
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner
 
V mware v realize orchestrator 6.0 knowledge transfer kit
solarisyougood
 
Cloudfoundry Introduction
Yitao Jiang
 
Dutch Oracle Architects Platform - Reviewing Oracle OpenWorld 2017 and New Tr...
Lucas Jellema
 
Facilitating continuous delivery in a FinTech world with Salt, Jenkins, Nexus...
Chocolatey Software
 
Why Your Digital Transformation Strategy Demands Middleware Modernization
VMware Tanzu
 
Platform as a Service (PaaS) - A cloud service for Developers
Ravindra Dastikop
 
CF SUMMIT: Partnerships, Business and Cloud Foundry
Nima Badiey
 
Gain Insights, Make Decisions, and Take Action Across a Streamlined and Autom...
Arraya Solutions
 
What is Serverless Computing?
AIMDek Technologies
 
Ensuring Cloud Native Success: Organization Transformation
Chloe Jackson
 
Part 4: Custom Buildpacks and Data Services (Pivotal Cloud Platform Roadshow)
VMware Tanzu
 
Technology choices for Apache Kafka and Change Data Capture
Andrew Schofield
 
Microservices in the Enterprise
Jesus Rodriguez
 
Kubernetes: Dive into the Future of Infrastructure
GlobalLogic Ukraine
 
Deploy your Multi-tier Application in Cloud Foundry
cornelia davis
 
Cache-Aside Cloud Design Pattern
Siva Rama Krishna Chunduru
 
12 Factor Serverless Applications - Mike Morain, AWS - Cloud Native Day Tel A...
Cloud Native Day Tel Aviv
 
Pivotal Cloud Foundry 2.4: A First Look
VMware Tanzu
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with K...
confluent
 
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner
 
Ad

Viewers also liked (6)

PDF
Cloud Foundryは何故動くのか
Kazuto Kusama
 
PDF
Introduction into Cloud Foundry and Bosh | anynines
anynines GmbH
 
PPT
Bluemix and DevOps workshop lab
benm4nn
 
PDF
Cloud Foundry for PHP developers
Daniel Krook
 
PDF
GO-CFを試してみる
Takeshi Morikawa
 
PDF
今すぐ始めるCloud Foundry #hackt #hackt_k
Toshiaki Maki
 
Cloud Foundryは何故動くのか
Kazuto Kusama
 
Introduction into Cloud Foundry and Bosh | anynines
anynines GmbH
 
Bluemix and DevOps workshop lab
benm4nn
 
Cloud Foundry for PHP developers
Daniel Krook
 
GO-CFを試してみる
Takeshi Morikawa
 
今すぐ始めるCloud Foundry #hackt #hackt_k
Toshiaki Maki
 
Ad

Similar to How does the Cloud Foundry Diego Project Run at Scale? (20)

PPTX
Cloud Foundry Roadmap Update - OSCON - May 2017
Chip Childers
 
PDF
Node.js scaling in highload
Timur Shemsedinov
 
PPTX
Diego container scheduler
Hristo Iliev
 
PPTX
Cf summit2014 roadmap
James Bayer
 
PDF
Ignacy Kowalczyk
CodeFest
 
PDF
Cloud Foundry the definitive guide develop deploy and scale First Edition Winn
fazbemcanaj
 
PPTX
Building fast,scalable game server in node.js
Xie ChengChao
 
PDF
Cluster management with Kubernetes
Satnam Singh
 
PDF
Kubernetes intro public - kubernetes user group 4-21-2015
reallavalamp
 
PDF
Kubernetes intro public - kubernetes meetup 4-21-2015
Rohit Jnagal
 
PDF
Orchestrating Linux Containers
Bergamo Linux Users Group
 
PDF
Cloud Native Dünyada CI/CD
Mustafa AKIN
 
PPTX
Hybrid cloud openstack meetup
dfilppi
 
PDF
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
PPTX
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Altoros
 
PPTX
Cloud Foundry Technical Overview at IBM Interconnect 2016
Stormy Peters
 
PDF
Cloud Foundry Technical Overview
cornelia davis
 
PDF
Lessions from building a high available cloud foudry on top of open stack
Yitao Jiang
 
PDF
The Self-Service Developer - GOTOCon CPH
Laszlo Fogas
 
ODP
The journey to container adoption in enterprise
Igor Moochnick
 
Cloud Foundry Roadmap Update - OSCON - May 2017
Chip Childers
 
Node.js scaling in highload
Timur Shemsedinov
 
Diego container scheduler
Hristo Iliev
 
Cf summit2014 roadmap
James Bayer
 
Ignacy Kowalczyk
CodeFest
 
Cloud Foundry the definitive guide develop deploy and scale First Edition Winn
fazbemcanaj
 
Building fast,scalable game server in node.js
Xie ChengChao
 
Cluster management with Kubernetes
Satnam Singh
 
Kubernetes intro public - kubernetes user group 4-21-2015
reallavalamp
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Rohit Jnagal
 
Orchestrating Linux Containers
Bergamo Linux Users Group
 
Cloud Native Dünyada CI/CD
Mustafa AKIN
 
Hybrid cloud openstack meetup
dfilppi
 
Hacking for fun & profit - The Kubernetes Way - Demi Ben-Ari - Panorays
Demi Ben-Ari
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Altoros
 
Cloud Foundry Technical Overview at IBM Interconnect 2016
Stormy Peters
 
Cloud Foundry Technical Overview
cornelia davis
 
Lessions from building a high available cloud foudry on top of open stack
Yitao Jiang
 
The Self-Service Developer - GOTOCon CPH
Laszlo Fogas
 
The journey to container adoption in enterprise
Igor Moochnick
 

More from VMware Tanzu (20)

PDF
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
PDF
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
PDF
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
PPTX
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
PDF
Spring Update | July 2023
VMware Tanzu
 
PPTX
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
PPTX
Building Cloud Ready Apps
VMware Tanzu
 
PDF
Spring Boot 3 And Beyond
VMware Tanzu
 
PDF
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
PDF
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
PDF
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
PPTX
tanzu_developer_connect.pptx
VMware Tanzu
 
PDF
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
PDF
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
PDF
Virtual Developer Connect Workshop - English
VMware Tanzu
 
PDF
Tanzu Developer Connect - French
VMware Tanzu
 
PDF
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
PDF
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
PDF
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
PDF
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 
Spring into AI presented by Dan Vega 5/14
VMware Tanzu
 
What AI Means For Your Product Strategy And What To Do About It
VMware Tanzu
 
Make the Right Thing the Obvious Thing at Cardinal Health 2023
VMware Tanzu
 
Enhancing DevEx and Simplifying Operations at Scale
VMware Tanzu
 
Spring Update | July 2023
VMware Tanzu
 
Platforms, Platform Engineering, & Platform as a Product
VMware Tanzu
 
Building Cloud Ready Apps
VMware Tanzu
 
Spring Boot 3 And Beyond
VMware Tanzu
 
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdf
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Boston 2023
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Seattle 2023
VMware Tanzu
 
tanzu_developer_connect.pptx
VMware Tanzu
 
Tanzu Virtual Developer Connect Workshop - French
VMware Tanzu
 
Tanzu Developer Connect Workshop - English
VMware Tanzu
 
Virtual Developer Connect Workshop - English
VMware Tanzu
 
Tanzu Developer Connect - French
VMware Tanzu
 
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023
VMware Tanzu
 
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring Boot
VMware Tanzu
 
SpringOne Tour: The Influential Software Engineer
VMware Tanzu
 
SpringOne Tour: Domain-Driven Design: Theory vs Practice
VMware Tanzu
 

Recently uploaded (20)

PDF
Software Development Methodologies in 2025
KodekX
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Software Development Methodologies in 2025
KodekX
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 

How does the Cloud Foundry Diego Project Run at Scale?

  • 1. How does the Cloud Foundry Diego Project Run at Scale? and updates on .NET Support
  • 2. Who’s this guy? • Amit Gupta • https://akgupta.ca • @amitkgupta84
  • 3. Who’s this guy? • Berkeley math grad school… dropout • Rails consulting… deserter • now I do BOSH, Cloud Foundry, Diego, etc.
  • 4. Testing Diego Performance at Scale • current Diego architecture • performance testing approach • test specifications • test implementation and tools • results • bottom line • next steps
  • 6. Current Diego Architecture What’s new-ish? • consul for service discovery • receptor (API) to decouple from CC • SSH proxy for container access • NATS-less auction • garden-windows for .NET applications
  • 7. Current Diego Architecture Main components: • etcd ephemeral data store • consul service discovery • receptor Diego API • nsync sync CC desired state w/Diego • route-emitter sync with gorouter • converger health mgmt & consistency • garden containerization • rep sync garden actual state w/Diego • auctioneer workload scheduling
  • 8. Performance Testing Approach • full end-to-end tests • do a lot of stuff: – is it correct, is it performant? • kill a lot of stuff: – is it correct, is it performant? • emit logs and metrics (business as usual) • plot & visualize • fix stuff, repeat at higher scale*
  • 10. Test Specifications #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10 n
  • 11. Test Specifications • Diego does tasks and long-running processes • launch 10n, …, 400n tasks: – workload distribution? – scheduling time distribution? – running time distribution? – success rate? – growth rate? • launch 10n, …, 400n-instance LRP: – same questions…
  • 12. Test Specifications • Diego+CF stages and runs apps • > cf push • upload source bits • fetch buildpack and stage droplet (task) • fetch droplet and run app (LRP) • dynamic routing • streaming logs
  • 13. Test Specifications • bring up n nodes in parallel – from each node, push a apps in parallel – from each node, repeat this for r rounds • a is always ≈ 20 • r is always = 40 • n starts out = 1
  • 14. Test Specifications • the pushed apps have varying characteristics: – 1-4 instances – 128M-1024M memory – 1M-200M source code payload – 1-20 log lines/second – crash never vs. every 30 s
  • 15. Test Specifications • starting with n=1: – app instances ≈ 1k – instances/cell ≈ 100 – memory utilization across cells ≈ 90% – app instances crashing (by-design) ≈ 10%
  • 16. Test Specifications • evaluate: – workload distribution – success rate of pushes – success rate of app routability – times for all the things in the push lifecycles – crash recovery behaviour – all the metrics!
  • 17. Test Specifications • kill 10% of cells – watch metrics for recovery behaviour • kill moar cells… and etcd – does system handle excess load gracefully? • revive everything with > bosh cck – does system recover gracefully… – with no further manual intervention?
  • 18. Test Specifications – Figure Out What’s Broke – – Fix Stuff – – Move On Scale Up & Repeat –
  • 19. Test Implementation and Tools • S3 log, graph, plot backups • ginkgo & gomega testing DSL • BOSH parallel test-lab deploys • tmux & ssh run test suites remotely • papertrail log archives • datadog metrics visualizations • cicerone (custom) log visualizations
  • 20. Results 400 tasks’ lifecycle timelines, dominated by container creation
  • 21. Results Maybe some cells’ gardens were running slower?
  • 22. Results Grouping by cell shows uniform container creation slowdown
  • 23. Results So that’s not it… Also, what’s with the blue steps? Let’s visualize logs a couple more ways Then take stock of the questions raised
  • 24. Results Let’s just look at scheduling (ignore container creation, etc.)
  • 25. Results Scheduling again, grouped by which API node handled the request
  • 26. Results And how about some histograms of all the things?
  • 27. Results From the 400-task request from “Fezzik”: • only 3-4 (out of 10) API nodes handle reqs? • recording task reqs take increasing time? • submitting auction reqs sometimes slow? • later auctions take so long? • outliers wtf? • container creation takes increasing time?
  • 28. Results • only 3-4 (out of 10) API nodes handle reqs? – when multiple address requests during DNS lookup, Golang returns the DNS response to all requests; this results in only 3-4 API endpoint lookups for the whole set of tasks • recording task reqs take increasing time? – API servers use an etcd client with throttling on # of concurrent requests • submitting auction reqs sometimes slow? – auction requests require API node to lookup auctioneer address in etcd, using throttled etcd client
  • 29. Results • later auctions take so long? – reps were taking longer to report their state to auctioneer, because they were making expensive calls to garden, sequentially, to determine current resource usage • outliers wtf? – combination of missing logs due to papertrail lossiness, + cicerone handling missing data poorly • container creation takes increasing time? – garden team tasked with investigation
  • 30. Results Problems can come from: • our software – throttled etcd client – sequential calls to garden • software we consume – garden container creation • “experiment apparatus” (tools and services): – papertrail lossiness – cicerone sloppiness • language runtime – Golang’s DNS behaviour
  • 31. Results Fixed what we could control, and now it’s all garden
  • 32. Results Okay, so far, that’s just been #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10
  • 33. Results Next, the timelines of pushing 1k app instances
  • 34. Results • for the fastest pushes – dominated by red, blue, gold – i.e. upload source & CC emit “start”, staging process, upload droplet • pushes get slower – growth in green, light blue, fucsia, teal – i.e. schedule staging, create staging container, schedule running, create running container • main concern: why is scheduling slowing down?
  • 35. Results • we had a theory (blame app log chattiness) • reproduced experiment in BOSH-Lite – with chattiness turned on – with chattiness turned off • appeared to work better • tried it on AWS • no improvement 
  • 36. Results • spelunked through more logs • SSH’d onto nodes and tried hitting services • eventually pinpointed it: – auctioneer asks cells for state – cell reps ask garden for usage – garden gets container disk usage  bottleneck
  • 37. Results Garden stops sending disk usage stats, scheduling time disappears
  • 38. Results Let’s let things stew between and
  • 39. Results Right after all app pushes, decent workload distribution
  • 40. Results … an hour later, something pretty bad happened
  • 41. Results • cells heartbeat their presence to etcd • if ttl expires, converger reschedules LRPs • cells may reappear after their workloads have been reassigned • they remain underutilized • but why do cells disappear in the first place? • added more logging, hope to catch in n=2 round
  • 42. Results With the one lingering question about cell disappearnce, on to n=2 #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10 ✓✓ ✓ ✓ ?
  • 43. Results With 800 concurrent task reqs, found container cleanup garden bug
  • 44. Results With 800-instance LRP, found API node request scheduling serially
  • 45. Results • we added a story to the garden backlog • the serial request issue was an easy fix • then, with n=2 parallel test-lab nodes, we pushed 2x the apps – things worked correctly – system was performant as a whole – but individual components showed signs of scale issues
  • 47. Results • nsync fetches state from CC and etcd to make sure CC desired state is reflected in diego • converger fetches desired and actual state from etcd to make sure things are consistent • route-emitter fetches state from etcd to keep gorouter in sync • bulk loop times doubled from n=1
  • 48. Results … and this happened again
  • 49. Results – the etcd and consul story –
  • 50. Results Fast-forward to today #1: #2: #3: #4: x 1 #1: #2: #3: #4: x 2 #1: #2: #3: #4: x 5 #1: #2: #3: #4: x 10 ✓✓ ✓ ✓ ? ✓✓ ✓ ✓ ? ✓✓ ✓ ✓ ? ✓ ???
  • 51. Bottom Line At the highest scale: • 4000 concurrent tasks ✓ • 4000-instance LRP ✓ • 10k “real app” instances @ 100 instances/cell: – etcd (ephemeral data store) ✓ – consul (service discovery) ? (… it’s a long story) – receptor (Diego API) ? (bulk JSON) – nsync (CC desired state sync) ? (because of receptor) – route-emitter (gorouter sync) ? (because of receptor) – garden (containerizer) ✓ – rep (garden actual state sync) ✓ – auctioneer (scheduler) ✓
  • 52. Next Steps • Security – mutual SSL between all components – encrypting data-at-rest • Versioning – handle breaking API changes gracefully – production hardening • Optimize data models – hand-in-hand with versioning – shrink payload for bulk reqs – investigate faster encodings; protobufs > JSON – initial experiments show 100x speedup
  • 53. Updates on .NET Support
  • 54. Updates on .NET Support • what’s currently supported? – ASP.NET MVC – nothing too exotic – most CF/Diego features, e.g. security groups – VisualStudio plugin, similar to the Eclipse CF plugin for Java • what are the limitations? – some newer Diego features, e.g. SSH – in α/β stage, dev-only
  • 55. Updates on .NET Support • what’s coming up? – make it easier to deploy Windows cell – more VisualStudio plugin features – hardening testing/CI • further down the line? – remote debugging – the “Spring experience”
  • 56. Updates on .NET Support • shout outs – CenturyLink – HP • feedback & questions? – Mark Kropf (PM): [email protected] – David Morhovich (Lead): [email protected]