SlideShare a Scribd company logo
Metrics-Driven Engineering

Mike Brittain        @ mikebrittain
Director of engineering, Infrastructure

                                          October 13, 2011
Tools and Process at Etsy
How many new visits?
  How many listings created?
  How many registrations?
How do people use Etsy?
  How many convos sent?
    How many purchases?
     How many new shops?
Search indexing?
     How fast are pages generating?
   Async tasks currently in queue?
What is the application doing?
 Developer API auth and rate limiting?
       Images resized and stored?
          Error and warning rates?
Replication slave lag?
       Memcache hits/misses?
       Available connections?
Are the servers in good shape ?
    Database queries per second?
       Total outgoing bandwidth?
            CPU, Memory, I/O?
Business Metrics
Application Metrics
System Metrics
Visibility EVERYWHERE
Constant Change
Metrics-Driven Engineering
$314 Million GMS 2010
  $180 Million GMS 2009
  $87 Million GMS 2008

  $26 Million GMS 2007




credit: pentarux (flickr)
25 Million Unique Visitors
  1 Billion page views per month




credit: pentarux (flickr)
Engineering team grew 500%
                        over 18 months


credit: martin_heigan (flickr)
Less talk, more do.
Always Be Shipping



credit: ibailemon (flickr)
Always Be Shipping
                             (even if it’s your first day)




credit: ibailemon (flickr)
Metrics-Driven Engineering
90+ Engineers
                     40+ Deploys / day

credit: misswired (flickr)
credit: digidave (flickr)
Code Reviews
Automated Tests
$cfg = array(
   'checkout' => array('enabled' => 'on'),
   'homepage' => array('enabled' => 'on'),
   'profiles' => array('enabled' => 'on'),
   'new_search' => array('enabled' => 'off'),
);


                          Config Flags
Enable and disable features quickly
$cfg = array(
   'checkout' => array('enabled' => 'on'),
   'homepage' => array('enabled' => 'on'),
   'profiles' => array('enabled' => 'on'),
   'new_search' => array('enabled' => 'off'),
);


                          Config Flags
Enable and disable features quickly
Plus “admin-only,” percentage ramp-up, A/B testing,
whitelists, blacklists, etc...
Failure is not an option
inevitable!
Failure is not an option
inevitable!
Failure is not an option
            a learning opportunity!
inevitable!
Failure is not an option
            a learning opportunity!
     DETECTABLE!
Access
Metrics-Driven Engineering
Metrics-Driven Engineering
Metrics-Driven Engineering
Detect problems quickly
CONFIDENCE
Metrics-Driven Engineering
A:    Well, the Ops team manages the network, racks
     the servers, installed the monitoring tools, wears
                the pagers, blah, blah, blah...
Engineers build the application
Logging
      Graphing
OPS              ENG
      Trending
      Alerting
“Engineers are too busy writing
  features to build metrics.”
Metrics are part of every feature
        ...and so are config flags
Dead Simple
Simple, open source tools
Cacti (network, SNMP)
Ganglia (machines)
Graphite (application)
Splunk (log analysis, nightly reports)
Nagios (alerting)
                             Logging
                             Logster
                               StatsD
Ganglia
Ganglia
Cluster-oriented
Huge community contributed recipes
Custom metrics (gmetad)
Graphite
Graphite
                            Single-instance
              Create new metrics on-the-fly
   Customize via URLs and display functions
Logging
It’s 2:48 PM.
Do you know where your
       logs are?
Logger::log_error("User login failed.
Reason: $msg for $username", “login”);
Logger::log_error("User login failed.
Reason: $msg for $username", “login”);
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
LogFormat "%h %l %u %t "%r" %>s %b"
                common
LogFormat %{True-Client-IP}i %l %t "%r
         " %>s %b "%{Referer}i"
              "%{User-Agent}i"
    %{etsy_shop_id}n %{etsy_uaid}n %V
           %{etsy_ab_selections}n
            %{etsy_request_uuid}n
         %{etsy_api_consumer_key}n
          %{etsy_api_method_name}n
        %{php_memory_usage_bytes}n
   %{php_time_microsec}n %D" combined
apache_note()
LogFormat %{True-Client-IP}i %l %t "%r
         " %>s %b "%{Referer}i"
              "%{User-Agent}i"
    %{etsy_shop_id}n %{etsy_uaid}n %V
           %{etsy_ab_selections}n
            %{etsy_request_uuid}n
         %{etsy_api_consumer_key}n
          %{etsy_api_method_name}n
        %{php_memory_usage_bytes}n
   %{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t "%r
         " %>s %b "%{Referer}i"
              "%{User-Agent}i"
    %{etsy_shop_id}n %{etsy_uaid}n %V
           %{etsy_ab_selections}n
            %{etsy_request_uuid}n
         %{etsy_api_consumer_key}n
          %{etsy_api_method_name}n
        %{php_memory_usage_bytes}n
   %{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t "%r
         " %>s %b "%{Referer}i"
              "%{User-Agent}i"
    %{etsy_shop_id}n %{etsy_uaid}n %V
           %{etsy_ab_selections}n
            %{etsy_request_uuid}n
         %{etsy_api_consumer_key}n
          %{etsy_api_method_name}n
        %{php_memory_usage_bytes}n
   %{php_time_microsec}n %D" combined
grep "/listing/" access.log | 
awk '{sum=sum+$(NF-2)} END {print sum/NR}'
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Help me, Rhonda.
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
web0001   [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
web0201   [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
web0034   [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web1101   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web0201   [04:28:54   2011]   [error] [client 10.101.x.x] You've been eaten by a grue.
web0055   [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!!!
web0002   [04:28:54   2011]   [warning] [client 10.101.x.x] Sky is falling.
web0089   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web0020   [04:28:54   2011]   [error] [client 10.101.x.x] Sky is falling.
web1101   [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
web0055   [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
web0001   [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web0034   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web0087   [04:28:54   2011]   [fatal] [client 10.101.x.x] Sky is falling.
web0002   [04:28:54   2011]   [error] [client 10.101.x.x] Oh noooooo!
web0201   [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!
web0077   [04:28:54   2011]   [warning] [client 10.101.x.x] Gaaaaahhh!
web0355   [04:28:54   2011]   [warning] [client 10.101.x.x] Oh nooooooooooo
web0052   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web0001   [04:28:54   2011]   [error] [client 10.101.x.x] Gaaaaahhh!!!
web0003   [04:28:54   2011]   [error] [client 10.101.x.x] You've been eaten by a grue.
web0066   [04:28:54   2011]   [fatal] [client 10.101.x.x] Gaaaaahhh!!!
Logster
Fatals       Errors   Warnings
Logster
Run by cron
Keeps a cursor on your log file
Aggregate lines anyway you want
Output to Ganglia or Graphite
Simple parsers
                                  github.com/etsy
web0054 [Fri Mar 04 16:27:48 2011]
[error] [login] [mk04gw1p71] User login
 failed. Reason: wrong password for ...
^.+ [.+] [(?P<log_level>.+)]
if (fields['log_level'] == “fatal”):
   self.fatals += 1

elif (fields['log_level'] == “error”):
   self.errors += 1

elif (fields['log_level'] == “warning”):
   self.warnings += 1

...
MetricObject("fatals",
  (self.fatals / self.duration), "per sec")

MetricObject("errors",
  (self.errors / self.duration), "per sec")

MetricObject("warning",
  (self.warnings / self.duration), "per sec")
Fatals   Errors   Warnings
StatsD
StatsD
                           Network daemon (node.js)
                               Accepts data over UDP
                      Flushes to Graphite every 10 sec
                                     One-line of code
github.com/etsy
StatsD::increment("logins.success");
StatsD::increment("logins.success");




                                  logins
StatsD::timing("gearman.time", $msec);
StatsD::timing("gearman.time", $msec);



                                 90th pct

                                 average

                                 lower
Ad hoc
name value timestamp
echo "events.deploy.site 1 `date +%s`" 
     | nc graphite.etsycorp.com 2003
Vertical Line Technology!
target=drawAsInfinite(events.deploy.site)
Metrics-Driven Engineering
We could stare at graphs all day...
http://graphite/render?
   from=-1hours&width=600&height=200
&target=webs.errorLog.warning&rawData=1
http://graphite/render?
       from=-1hours&width=600&height=200
    &target=webs.errorLog.warning&rawData=1

webs.errorLog.warning,1318444930,1318448530,60|
5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0,
1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0,
1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5.
0,1.0,1.0,None
Holt-Winters Confidence Bands

upper

         lower
Holt-Winters Aberration
Business metrics
 + Confidence bands
_____________
    Alertable metrics
40,000+ metrics at Etsy
  Systems, Applications, Business
Dashboards
Dashboards
Kind of Hard :-/
<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or
+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite
%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production
%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite
%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,
%23ff0000,%23006633,%23cc6600">
     <img src="http://graphite.etsycorp.com/render?
from=-1hours&width=280&height=220&title=File+or+Script+Not
+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite
%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production
%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite
%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,
%23ff0000,%23006633,%23cc6600">
</a>
Super Easy!
$g = new Graphite($time);
$g->setTitle('File Not Found');
$g->addMetric('webs.errorLog.notExist', '#00cc00');
echo $g->getDashboardHTML(280, 220);
Metrics!
Metrics!
Metrics + Events
Metrics!
Metrics + Events
Metrics + Alerts
Metrics!
Metrics + Events
Metrics + Alerts
Metrics + Metrics
High-level, real-time visibility
Detect problems quickly
CONFIDENCE
Make them required features
Make them dead simple
Make them accessible
Make them!
Homework
codeascraft.etsy.com
github.com/etsy                      Get in touch
                                     mike @ etsy . com
We’re always looking for people         @ mikebrittain
who are interested in this kind of
stuff...



Thank You
etsy.com/careers
Metrics-Driven Engineering

More Related Content

Viewers also liked (20)

PDF
Metrics-Driven Engineering at Etsy
Mike Brittain
 
PDF
Web Performance Culture and Tools at Etsy
Mike Brittain
 
PDF
Scaling Deployment at Etsy
Daniel Schauenberg
 
PPT
How to Get to Second Base with Your CDN
Mike Brittain
 
PDF
Take My Logs. Please!
Mike Brittain
 
PDF
Advanced Topics in Continuous Deployment
Mike Brittain
 
PDF
Continuous Deployment at Etsy — TimesOpen NYC
Mike Brittain
 
PDF
Migrating from PostgreSQL to MySQL Without Downtime
Matt Graham
 
PDF
Continuous Deployment: The Dirty Details
Mike Brittain
 
PDF
Web Performance Culture and Tools at Etsy
Mike Brittain
 
PDF
Simple Log Analysis and Trending
Mike Brittain
 
PDF
On Failure and Resilience
Mike Brittain
 
PDF
A Whirlwind Tour of Etsy's Monitoring Stack
Daniel Schauenberg
 
PDF
Continuous Delivery: The Dirty Details
Mike Brittain
 
PDF
From Building a Marketplace to Building Teams
Mike Brittain
 
PDF
Scaling Etsy: What Went Wrong, What Went Right
Ross Snyder
 
PDF
The Real Life Social Network v2
Paul Adams
 
PPTX
Docker Online Meetup: Announcing Docker CE + EE
Docker, Inc.
 
PDF
Principles and Practices in Continuous Deployment at Etsy
Mike Brittain
 
PPTX
26 Disruptive & Technology Trends 2016 - 2018
Brian Solis
 
Metrics-Driven Engineering at Etsy
Mike Brittain
 
Web Performance Culture and Tools at Etsy
Mike Brittain
 
Scaling Deployment at Etsy
Daniel Schauenberg
 
How to Get to Second Base with Your CDN
Mike Brittain
 
Take My Logs. Please!
Mike Brittain
 
Advanced Topics in Continuous Deployment
Mike Brittain
 
Continuous Deployment at Etsy — TimesOpen NYC
Mike Brittain
 
Migrating from PostgreSQL to MySQL Without Downtime
Matt Graham
 
Continuous Deployment: The Dirty Details
Mike Brittain
 
Web Performance Culture and Tools at Etsy
Mike Brittain
 
Simple Log Analysis and Trending
Mike Brittain
 
On Failure and Resilience
Mike Brittain
 
A Whirlwind Tour of Etsy's Monitoring Stack
Daniel Schauenberg
 
Continuous Delivery: The Dirty Details
Mike Brittain
 
From Building a Marketplace to Building Teams
Mike Brittain
 
Scaling Etsy: What Went Wrong, What Went Right
Ross Snyder
 
The Real Life Social Network v2
Paul Adams
 
Docker Online Meetup: Announcing Docker CE + EE
Docker, Inc.
 
Principles and Practices in Continuous Deployment at Etsy
Mike Brittain
 
26 Disruptive & Technology Trends 2016 - 2018
Brian Solis
 

Similar to Metrics-Driven Engineering (20)

PDF
Why you should be using structured logs
Stefan Krawczyk
 
PDF
A Journey with React
FITC
 
ODP
Data-Driven Software Design
Patrick McKenzie
 
KEY
Jarv.us Showcase — SenchaCon 2011
Chris Alfano
 
PDF
Re-Design with Elixir/OTP
Mustafa TURAN
 
PPTX
A miało być tak... bez wycieków
Konrad Kokosa
 
PPTX
Open Source Ajax Solution @OSDC.tw 2009
Robbie Cheng
 
PDF
idea: talk about the Active Cache
Ching Yi Chan
 
PDF
More Secrets of JavaScript Libraries
jeresig
 
KEY
PyCon AU 2012 - Debugging Live Python Web Applications
Graham Dumpleton
 
PDF
Google Back To Front: From Gears to App Engine and Beyond
dion
 
PPTX
Implementation of GUI Framework part3
masahiroookubo
 
PPTX
Preparing a WordPress Plugin for Translation
Brian Hogg
 
PPTX
What is going on - Application diagnostics on Azure - TechDays Finland
Maarten Balliauw
 
PDF
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Fastly
 
PDF
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
LogeekNightUkraine
 
PPTX
Introduction To Developing Custom Actions Within SharePoint
Geoff Varosky
 
PDF
Introducing Neo4j 3.1: New Security and Clustering Architecture
Neo4j
 
PDF
Brian hogg word camp preparing a plugin for translation
wcto2017
 
PDF
"Full Stack frameworks or a story about how to reconcile Front (good) and Bac...
Fwdays
 
Why you should be using structured logs
Stefan Krawczyk
 
A Journey with React
FITC
 
Data-Driven Software Design
Patrick McKenzie
 
Jarv.us Showcase — SenchaCon 2011
Chris Alfano
 
Re-Design with Elixir/OTP
Mustafa TURAN
 
A miało być tak... bez wycieków
Konrad Kokosa
 
Open Source Ajax Solution @OSDC.tw 2009
Robbie Cheng
 
idea: talk about the Active Cache
Ching Yi Chan
 
More Secrets of JavaScript Libraries
jeresig
 
PyCon AU 2012 - Debugging Live Python Web Applications
Graham Dumpleton
 
Google Back To Front: From Gears to App Engine and Beyond
dion
 
Implementation of GUI Framework part3
masahiroookubo
 
Preparing a WordPress Plugin for Translation
Brian Hogg
 
What is going on - Application diagnostics on Azure - TechDays Finland
Maarten Balliauw
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Fastly
 
Kostiantyn Yelisavenko "Mastering Macro Benchmarking in .NET"
LogeekNightUkraine
 
Introduction To Developing Custom Actions Within SharePoint
Geoff Varosky
 
Introducing Neo4j 3.1: New Security and Clustering Architecture
Neo4j
 
Brian hogg word camp preparing a plugin for translation
wcto2017
 
"Full Stack frameworks or a story about how to reconcile Front (good) and Bac...
Fwdays
 
Ad

Recently uploaded (20)

PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Ad

Metrics-Driven Engineering