1
Transitioning from Ticketing to LBaaS
November 12, 2019 - Amsterdam
William Dauchy
@wdauchy
Network-LB Team Lead
Pierre Cheynier
@pierrecdn
Network-LB Senior SRE
2
Criteo infrastructure in a nutshell
• Focus on cost-efficiency, agility and regaining control
on infra. software
• Examples:
• Commoditize hardware, challenge vendors by establishing
direct ODM relationship
• BIOS, BMCs and Switches OS: switch to OSS
3
Baremetal, microservices, containers
register
4
Transparent communication
5
Network is not to be outdone
• CLOS Matrix (RFC7938)
• Network Services still
in specialized racks
6
User, 2016: My container is running, and so what?
• Step 1: application is deployed at git push
• Step 2: Come on, I’m missing some things here…
• Public or private VIP?
• DNS entries?
• TLS certificates?
• IPv6?
• Traffic engineering?
• Security policies enforcement?
• Metrics?
• Step 3: Don’t you want to .. fill a ticket?
7
8
API extensions for LB
• Our team write DSL or API
extensions
• Same primitives everywhere
• Linked to Consul registration
• Features, not technologies
(vendor agnostic)
• Use Consul storage (KV/Metas)
► Ownership of app network
config has been transferred!
9
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
10
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference
11
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference (>=1.4)
12
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference (>=1.4)
13
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference (>=1.4)
14
Here comes a Control-plane
• Abstract internal dependencies
• Resource reservation logic
• External system provisioning
• Give some controls to admins
• Consume events, produce events
• Device Provisioners:
• One contract
• N implementations
(vendor/technologies)
15
Here comes a Control-plane
• Abstract internal dependencies
• Resource reservation logic
• External system provisioning
• Give some controls to admins
• Consume events, produce events
• Device Provisioners:
• One contract
• N implementations
(vendor/technologies)
16
17
User experience: metrology
• Get network things done, within seconds.
• Also, with great powers (…)
• Diagnostic API endpoints provide errors cause
• Subscribe to alerts on bad service health
• Retrieve KPIs and logs autonomously
18
User experience: Zero-config LB
• Add a Consul http tag
• Get a free LB service configured with sane
defaults in seconds
• Controlled namespace
• Private visibility
• TLS enforced with redirects
• Self-diagnostics on errors
19
User experience: Zero-config LB
• Users are using it extensively
• Remember East-West without LB?
• Rate limit!
• timeout tarpit 2s
20
21
Agility at scale
• Ability to transparently switch LB vendor
• Safe and progressive rollout
• 50K servers
• Between 50 and 100 LBs per datacenter
• Easy and frequent platform upgrade
• HAProxy deployments, several times a week
• Transparent Linux kernel upgrades
• > 530 deployments in less than two years(!)
22
Incidents at scale
• Doubling maxconn, what could go wrong?
• Silent failing changes are not welcome
• strict-limits, introduced in v2.1
23
Probing the general service
• End to end probing is providing fast feedbacks
• Triggered many regressions and bugs
• .. despite the simplicity of checks
24
Log everything
• http-response set-log-level silent if
{rand(100) ge 1}
• log 127.0.0.1 format rfc5424 local0 info
• log-format-sd
25
Observability example
• TLS: %sslv - %sslc
26
27
Load balancing disaggregation
• GeoDNS > L3 > L7
• “Hyper-converged” LB
28
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Support layer=L4
29
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Let L7 LBs become client of this
• Translates into “L7 LB, on behalf of VIP X,
asks for a L4 LB service”
• L4 > L7 specifics
• DSR: invest for ingress only ($)
• Consistent hashing
30
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Let L7 LBs become client of this
• Translates into “L7 LB, on behalf of VIP X,
asks for a L4 LB service”
• L4 > L7 specifics
• DSR: invest for ingress only ($)
• Consistent hashing
31
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Let’s redo this with layer=L3
• Translates into “L4 LB, on behalf of VIP X,
asks for a BGP ECMP configuration”
• Identify switch to peer with, configure it
• L3 > L4 specifics
• BGP ECMP => placement constraints
• DDoS
32
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
33
Load balancing disaggregation
• Moved from Hyper-converged devices
to commodity hardware
• Consequences:
• Cost efficient
• LBs = plain servers in the CLOS matrix!
34
Edge PoPs
• GeoDNS > Edge PoP > L3 > L4 > L7
• Terminate TLS early = huge latency win
• pool-purge-delay (HAProxy >= v2.0)
35
Edge PoPs
• GeoDNS > Edge PoP > L3 > L4 > L7
• Reach large population base, keep costs
under control
• Control plane role
• Configure Geo DNS zones to closest PoP
• Call third-party API
• Configure HAProxy accordingly
36
Edge PoPs
• GeoDNS > Edge PoP > L3 > L4 > L7
• Reach large population base, keep costs
under control
• Control plane role
• Configure Geo DNS zones to closest PoP
• Call third-party API
• Configure HAProxy accordingly
37
GeoDNS on steroids
• Adapt routing based on performance metrics,
aka. traffic engineering
• “Someone told me these subnets perform better
using another AS path”
• Build our own GeoDNS dataset?
• Increase latencies by directing user bases to closest
location with more accuracy
• A decision-making tool?
• Data-center location and topology assessment
38
39
Feedback
• Loading more objects at runtime?
• Several changes per second!
• Never flush stats (at least some counters)
40
Feedback
• Overall stability
• very high traffic pressure
• Latency sensitive
• Loosing requests during reloads is
not acceptable!
• Small memory footprint
• TLS certificates at scale like a
public hosting service
• Share certificates between bind
(in v2.1)
41
Feedback
• Overall stability
• Small memory footprint
• Community + enterprise support
• "I'm investigating an issue we have with some fetch we're
adding on a custom tcp_info structure, and I'm wondering if
I might have discovered a broader issue[...]“
• -
• Deployed worldwide two hours after
Thank you!

More Related Content

PDF
REVOLUTION - Transforming the network with Open SDN
PPTX
Pivotal Cloud Foundry + NSX
PPTX
Multi-tenant Framework for SDN Virtualization
PDF
Sdn primer pdf
PDF
Introduction to SDN
PPTX
SDN: an introduction
PDF
Embracing SDN in the Next Gen Network
PDF
Technical Deep Dive into MidoNet - Taku Fukushima, Developer at Midokura
REVOLUTION - Transforming the network with Open SDN
Pivotal Cloud Foundry + NSX
Multi-tenant Framework for SDN Virtualization
Sdn primer pdf
Introduction to SDN
SDN: an introduction
Embracing SDN in the Next Gen Network
Technical Deep Dive into MidoNet - Taku Fukushima, Developer at Midokura

What's hot (20)

PPTX
High Availability in Neutron
PPTX
SDN Cloud Computing Project Help
PDF
Technical introduction to MidoNet
PPTX
Software Defined Networking: Primer
PPTX
Sdn presentation
PPTX
Software defined network
PPTX
Software defined networking(sdn) vahid sadri
PDF
MidoNet 101: Face to Face with the Distributed SDN
PPTX
NSX for vSphere Logical Routing Deep Dive
PDF
Software Define Networking (SDN)
PPT
Software defined network and Virtualization
PPTX
Software Defined Network - SDN
PPT
OpenFlow tutorial
PDF
Understanding network and service virtualization
PDF
Next-gen Network Telemetry is Within Your Packets: In-band OAM
PDF
L4-L7 services for SDN and NVF by Youcef Laribi
PPTX
Network and Service Virtualization tutorial at ONUG Spring 2015
PDF
Hyperledger Fabric Technical Deep Dive 20190618
PPT
Manageengine Netflow analyzer - An Insight
PDF
VMworld 2013: Deploying VMware NSX Network Virtualization
High Availability in Neutron
SDN Cloud Computing Project Help
Technical introduction to MidoNet
Software Defined Networking: Primer
Sdn presentation
Software defined network
Software defined networking(sdn) vahid sadri
MidoNet 101: Face to Face with the Distributed SDN
NSX for vSphere Logical Routing Deep Dive
Software Define Networking (SDN)
Software defined network and Virtualization
Software Defined Network - SDN
OpenFlow tutorial
Understanding network and service virtualization
Next-gen Network Telemetry is Within Your Packets: In-band OAM
L4-L7 services for SDN and NVF by Youcef Laribi
Network and Service Virtualization tutorial at ONUG Spring 2015
Hyperledger Fabric Technical Deep Dive 20190618
Manageengine Netflow analyzer - An Insight
VMworld 2013: Deploying VMware NSX Network Virtualization
Ad

Similar to HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS (20)

PPTX
Hyperledger Fabric Update - June 2018
PDF
IBM Blockchain Platform - Architectural Good Practices v1.0
PDF
OVNC 2015-성공적인 Customer Optimized Datacenter 구축 방안
PPTX
Service Mesh CTO Forum (Draft 3)
PDF
Introductionto SDN
PDF
Introduction to Software Defined Networking (SDN)
PDF
Distributed Virtual Transaction Directory Server
PDF
SFSCON23 - Andrea Alfonsi - Kubernetes for IoT
PPTX
Micro Services Architecture
PDF
Scaling Hadoop at LinkedIn
PDF
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
PPTX
Manging Container Deployments at Scale
PPTX
Istio Mesh – Managing Container Deployments at Scale
PPTX
Sdn not just a buzzword
PPTX
bruce-sdn.pptx
PPTX
10. Lec X- SDN.pptx
PDF
Monitoring microservices platform
PDF
Microservice - Up to 500k CCU
PPTX
Do You Need A Service Mesh?
PDF
FreeSWITCH as a Microservice
Hyperledger Fabric Update - June 2018
IBM Blockchain Platform - Architectural Good Practices v1.0
OVNC 2015-성공적인 Customer Optimized Datacenter 구축 방안
Service Mesh CTO Forum (Draft 3)
Introductionto SDN
Introduction to Software Defined Networking (SDN)
Distributed Virtual Transaction Directory Server
SFSCON23 - Andrea Alfonsi - Kubernetes for IoT
Micro Services Architecture
Scaling Hadoop at LinkedIn
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Manging Container Deployments at Scale
Istio Mesh – Managing Container Deployments at Scale
Sdn not just a buzzword
bruce-sdn.pptx
10. Lec X- SDN.pptx
Monitoring microservices platform
Microservice - Up to 500k CCU
Do You Need A Service Mesh?
FreeSWITCH as a Microservice
Ad

Recently uploaded (20)

PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
PPTX
TEXTILE technology diploma scope and career opportunities
PPTX
Internet of Everything -Basic concepts details
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Architecture types and enterprise applications.pdf
PPT
Geologic Time for studying geology for geologist
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PPTX
Module 1 Introduction to Web Programming .pptx
PPTX
Build Your First AI Agent with UiPath.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
STKI Israel Market Study 2025 version august
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Training Program for knowledge in solar cell and solar industry
PPTX
Configure Apache Mutual Authentication
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Comparative analysis of machine learning models for fake news detection in so...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Accessing-Finance-in-Jordan-MENA 2024 2025.pdf
TEXTILE technology diploma scope and career opportunities
Internet of Everything -Basic concepts details
sbt 2.0: go big (Scala Days 2025 edition)
Architecture types and enterprise applications.pdf
Geologic Time for studying geology for geologist
Getting started with AI Agents and Multi-Agent Systems
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Module 1 Introduction to Web Programming .pptx
Build Your First AI Agent with UiPath.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
STKI Israel Market Study 2025 version august
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Consumable AI The What, Why & How for Small Teams.pdf
Developing a website for English-speaking practice to English as a foreign la...
Training Program for knowledge in solar cell and solar industry
Configure Apache Mutual Authentication
Taming the Chaos: How to Turn Unstructured Data into Decisions
Comparative analysis of machine learning models for fake news detection in so...

HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS

  • 1. 1 Transitioning from Ticketing to LBaaS November 12, 2019 - Amsterdam William Dauchy @wdauchy Network-LB Team Lead Pierre Cheynier @pierrecdn Network-LB Senior SRE
  • 2. 2 Criteo infrastructure in a nutshell • Focus on cost-efficiency, agility and regaining control on infra. software • Examples: • Commoditize hardware, challenge vendors by establishing direct ODM relationship • BIOS, BMCs and Switches OS: switch to OSS
  • 5. 5 Network is not to be outdone • CLOS Matrix (RFC7938) • Network Services still in specialized racks
  • 6. 6 User, 2016: My container is running, and so what? • Step 1: application is deployed at git push • Step 2: Come on, I’m missing some things here… • Public or private VIP? • DNS entries? • TLS certificates? • IPv6? • Traffic engineering? • Security policies enforcement? • Metrics? • Step 3: Don’t you want to .. fill a ticket?
  • 7. 7
  • 8. 8 API extensions for LB • Our team write DSL or API extensions • Same primitives everywhere • Linked to Consul registration • Features, not technologies (vendor agnostic) • Use Consul storage (KV/Metas) ► Ownership of app network config has been transferred!
  • 9. 9 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency
  • 10. 10 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference
  • 11. 11 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference (>=1.4)
  • 12. 12 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference (>=1.4)
  • 13. 13 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference (>=1.4)
  • 14. 14 Here comes a Control-plane • Abstract internal dependencies • Resource reservation logic • External system provisioning • Give some controls to admins • Consume events, produce events • Device Provisioners: • One contract • N implementations (vendor/technologies)
  • 15. 15 Here comes a Control-plane • Abstract internal dependencies • Resource reservation logic • External system provisioning • Give some controls to admins • Consume events, produce events • Device Provisioners: • One contract • N implementations (vendor/technologies)
  • 16. 16
  • 17. 17 User experience: metrology • Get network things done, within seconds. • Also, with great powers (…) • Diagnostic API endpoints provide errors cause • Subscribe to alerts on bad service health • Retrieve KPIs and logs autonomously
  • 18. 18 User experience: Zero-config LB • Add a Consul http tag • Get a free LB service configured with sane defaults in seconds • Controlled namespace • Private visibility • TLS enforced with redirects • Self-diagnostics on errors
  • 19. 19 User experience: Zero-config LB • Users are using it extensively • Remember East-West without LB? • Rate limit! • timeout tarpit 2s
  • 20. 20
  • 21. 21 Agility at scale • Ability to transparently switch LB vendor • Safe and progressive rollout • 50K servers • Between 50 and 100 LBs per datacenter • Easy and frequent platform upgrade • HAProxy deployments, several times a week • Transparent Linux kernel upgrades • > 530 deployments in less than two years(!)
  • 22. 22 Incidents at scale • Doubling maxconn, what could go wrong? • Silent failing changes are not welcome • strict-limits, introduced in v2.1
  • 23. 23 Probing the general service • End to end probing is providing fast feedbacks • Triggered many regressions and bugs • .. despite the simplicity of checks
  • 24. 24 Log everything • http-response set-log-level silent if {rand(100) ge 1} • log 127.0.0.1 format rfc5424 local0 info • log-format-sd
  • 26. 26
  • 27. 27 Load balancing disaggregation • GeoDNS > L3 > L7 • “Hyper-converged” LB
  • 28. 28 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Support layer=L4
  • 29. 29 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Let L7 LBs become client of this • Translates into “L7 LB, on behalf of VIP X, asks for a L4 LB service” • L4 > L7 specifics • DSR: invest for ingress only ($) • Consistent hashing
  • 30. 30 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Let L7 LBs become client of this • Translates into “L7 LB, on behalf of VIP X, asks for a L4 LB service” • L4 > L7 specifics • DSR: invest for ingress only ($) • Consistent hashing
  • 31. 31 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Let’s redo this with layer=L3 • Translates into “L4 LB, on behalf of VIP X, asks for a BGP ECMP configuration” • Identify switch to peer with, configure it • L3 > L4 specifics • BGP ECMP => placement constraints • DDoS
  • 32. 32 Load balancing disaggregation • GeoDNS > L3 > L4 > L7
  • 33. 33 Load balancing disaggregation • Moved from Hyper-converged devices to commodity hardware • Consequences: • Cost efficient • LBs = plain servers in the CLOS matrix!
  • 34. 34 Edge PoPs • GeoDNS > Edge PoP > L3 > L4 > L7 • Terminate TLS early = huge latency win • pool-purge-delay (HAProxy >= v2.0)
  • 35. 35 Edge PoPs • GeoDNS > Edge PoP > L3 > L4 > L7 • Reach large population base, keep costs under control • Control plane role • Configure Geo DNS zones to closest PoP • Call third-party API • Configure HAProxy accordingly
  • 36. 36 Edge PoPs • GeoDNS > Edge PoP > L3 > L4 > L7 • Reach large population base, keep costs under control • Control plane role • Configure Geo DNS zones to closest PoP • Call third-party API • Configure HAProxy accordingly
  • 37. 37 GeoDNS on steroids • Adapt routing based on performance metrics, aka. traffic engineering • “Someone told me these subnets perform better using another AS path” • Build our own GeoDNS dataset? • Increase latencies by directing user bases to closest location with more accuracy • A decision-making tool? • Data-center location and topology assessment
  • 38. 38
  • 39. 39 Feedback • Loading more objects at runtime? • Several changes per second! • Never flush stats (at least some counters)
  • 40. 40 Feedback • Overall stability • very high traffic pressure • Latency sensitive • Loosing requests during reloads is not acceptable! • Small memory footprint • TLS certificates at scale like a public hosting service • Share certificates between bind (in v2.1)
  • 41. 41 Feedback • Overall stability • Small memory footprint • Community + enterprise support • "I'm investigating an issue we have with some fetch we're adding on a custom tcp_info structure, and I'm wondering if I might have discovered a broader issue[...]“ • - • Deployed worldwide two hours after