SlideShare a Scribd company logo
A/B Testing @ Internet Scale
Ya Xu
8/12/2014 @ Coursera
A/B Testing in One Slide
20%80%
Collect results to determine which one is better
Join now
Control Treatment
Outline
§ Culture Challenge
–  Why A/B testing
–  What to A/B test
§ Building a scalable experimentation system
§ Best practices
3
Why A/B Testing
Amazon Shopping Cart Recommendation
5
•  At Amazon, Greg Linden had this idea of showing
recommendations based on cart items
•  Trade-offs
•  Pro: cross-sell more items (increase average basket size)
•  Con: distract people from checking out (reduce conversion)
•  HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
MSN Real Estate
§ “Find a house” widget variations
§ Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
Take-away
Experiments
are the only way to prove causality.
7
Use A/B testing to:
§ Guide product development
§ Measure impact (assess ROI)
§ Gain “real” customer feedback
What to A/B Test
8
Ads CTR Drop
9
Sudden drop
on 11/11/2013
Profile top ads
Root-Cause
10
5 Pixels!!
Navigation bar
Profile top ads
What to A/B Test
§ Evaluating new ideas:
–  Visual changes
–  Complete redesign of web page
–  Relevance algorithms
–  …
§ Platform changes
§ Code refactoring
§ Bug fixes
11
Test Everything!
Startups vs. Big Websites
§ Do startups have enough users to A/B test?
–  Startups typically look for larger effects
–  5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture
early
12
A Scalable Experimentation
System
13
A/B Testing 3 Steps
14
Design
•  What/Whom to experiment on
Deploy
•  Code deployment
Analyze
•  Impact on metrics
A/B Testing Platform Architecture
1.  Experiment Management
2.  Online Infrastructure
3.  Offline Analysis
15
Example: Bing A/B
1. Experiment Management
§ Define experiments
–  Whom to target?
–  How to split traffic?
§ Start/stop an experiment
§ Important addition:
–  Define success criteria
–  Power analysis
16
2. Online Infrastructure
1)  Hash & partition: random & consistent
2)  Deploy: server-side, as a change to
–  The default configuration (Bing)
–  The default code path (LinkedIn)
3)  Data logging
17
0% 100%
Treatment1
D20% D20%
Hash (ID)
Treatment2 Control
Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D20% D20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15%30%
•  Does not scale
•  Traffic management
Hash & Partition @ Scale (II)
§ Fully overlapping system
0% 100%
D
Exp. 2
A2 B2 control
Exp.1
controlA1
D
B1
D
•  Each experiment gets 100% traffic
•  A user is in “all” experiments simultaneously
•  Randomization btw experiments are independent
(unique hashID)
•  Cannot avoid interaction
Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
•  Centralized management (Bing)
•  Central exp. team creates/manages layers/domains
•  De-centralized management (LinkedIn)
•  Each experiment is one “layer” by default
•  Experimenter controls hashID to create a “domain”
Data Logging
§  Trigger
§  Trigger-based logging
–  Log whether a request is actually affected by the
experiment
–  Log for both factual & counter-factual
21
All LinkedIn members
300MM +
Triggered:
Members visiting
contacts page
3. Automated Offline Analysis
§  Large-scale data processing, e.g. daily @LinkedIn
–  200+ experiments
–  700+ metrics
–  Billions of experiment trigger events
§  Statistical analysis
–  Metrics design
–  Statistical significance test (p-value, confidence interval)
–  Deep-dive: slicing & dicing capability
§  Monitoring & alerting
–  Data quality
–  Early termination
22
Best Practices
23
Example: Unified Search
What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N50%
En-US
Pre-unified search
50%
En-US
What to Measure?
§ Success metrics: summarize whether
treatment is better
§ Puzzling example:
–  Key metrics for Bing: number of searches &
revenue
–  Ranking bug in experiment resulted in poor search
results
–  Number of searches up +10% and revenue up
+30%
Success metrics should reflect long
term impact
Scientific Experiment Design
§ How long to run the experiment?
§ How much traffic to allocate to treatment?
Story:
§  Site speed matters
–  Bing: +100msec = -0.6% revenue
–  Amazon: +100msec = -1.0% revenue
–  Google: +100msec = -0.2% queries
§  But not for Etsy.com?
“Faster results better? … meh”
27
Power
§ Power: the chance of detecting a
difference when there really is one.
§ Two reasons your feature doesn’t move
metrics
1.  No “real” impact
2.  Not enough power
28
Properly power up your experiment!
Statistical Significance
§ Which experiment has a bigger impact?
29
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% 2.4%
Statistical Significance
§ Which experiment has a bigger impact?
30
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
Statistical Significance
31
§ Must consider statistical significance
–  A 12.9% delta can still be noise!
–  Identify signal from noise; focus on the “real” movers
–  Ensure results are reproducible
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% Stat. significant 2.4%
Multiple Testing
§ Famous xkcd comic on Jelly Beans
32
Multiple Testing Concerns
§ Multiple ramps
–  Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks”
–  Rely on “full”-week results
§ Multiple variants
–  Choose the best, then rerun to see if replicate
§ Multiple metrics
An irrelevant metric is statistically
significant. What to do?
§  Which metric?
§  How “significant”? (p-value)
34
34
All
metrics
2nd order
metrics
1st order
metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant
even if your experiment does NOTHING? 5
References
§  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better,
Faster Experimentation. Proceedings 16th Conference on Knowledge
Discovery and Data Mining. 2010.
§  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD
2013: Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining. 2013.
§  LinkedIn blog post:
http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35

More Related Content

Similar to Talks@Coursera - A/B Testing @ Internet Scale (20)

PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
PDF
From Labelling Open data images to building a private recommender system
Pierre Gutierrez
 
PPTX
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Craig Sullivan
 
PPTX
Agile 2014 Software Moneyball (Troy Magennis)
Troy Magennis
 
PDF
It Worked for Ustream
Balázs Kereskényi
 
PDF
Optimizely Partner Ecosystem
Optimizely
 
PPTX
Drippler's A/B test library
Nir Hartmann
 
PDF
Digital analytics: Optimization (Lecture 10)
Joni Salminen
 
PPTX
Ria Sankar on Building AI Products
Ria Sankar
 
PPTX
Test Case Design
acatalin
 
PPTX
7 Step Data Cleanse: Salesforce Hygiene
CloudFixer
 
PDF
Data-Driven Marketing
Performable
 
PDF
Petri for kyiv.pptx
Talya Gendler
 
PPTX
Surviving the hype cycle Shortcuts to split testing success
Craig Sullivan
 
PPTX
Advanced Google Analytics #SearchFest
Mike P.
 
PPTX
Tips & Tricks for Getting Things Done Using Analytics Data
Charles Meaden
 
PDF
Designing speed with progressive enhancement
SergeyChernyshev
 
PDF
CRO analytics - How to Continually Optimise
Phil Pearce
 
PPTX
Google Analytics Powerups and Smartcuts
Charles Meaden
 
PDF
Transactional Streaming: If you can compute it, you can probably stream it.
jhugg
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
From Labelling Open data images to building a private recommender system
Pierre Gutierrez
 
Surviving the AB Testing Hype Cycle - Reaktor Breakpoint 2015
Craig Sullivan
 
Agile 2014 Software Moneyball (Troy Magennis)
Troy Magennis
 
It Worked for Ustream
Balázs Kereskényi
 
Optimizely Partner Ecosystem
Optimizely
 
Drippler's A/B test library
Nir Hartmann
 
Digital analytics: Optimization (Lecture 10)
Joni Salminen
 
Ria Sankar on Building AI Products
Ria Sankar
 
Test Case Design
acatalin
 
7 Step Data Cleanse: Salesforce Hygiene
CloudFixer
 
Data-Driven Marketing
Performable
 
Petri for kyiv.pptx
Talya Gendler
 
Surviving the hype cycle Shortcuts to split testing success
Craig Sullivan
 
Advanced Google Analytics #SearchFest
Mike P.
 
Tips & Tricks for Getting Things Done Using Analytics Data
Charles Meaden
 
Designing speed with progressive enhancement
SergeyChernyshev
 
CRO analytics - How to Continually Optimise
Phil Pearce
 
Google Analytics Powerups and Smartcuts
Charles Meaden
 
Transactional Streaming: If you can compute it, you can probably stream it.
jhugg
 

Recently uploaded (20)

PPTX
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
PDF
Water Design_Manual_2005. KENYA FOR WASTER SUPPLY AND SEWERAGE
DancanNgutuku
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
PDF
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
PDF
Additional Information in midterm CPE024 (1).pdf
abolisojoy
 
PDF
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
PPTX
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
PPTX
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
PDF
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PPTX
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
Presentation on Foundation Design for Civil Engineers.pptx
KamalKhan563106
 
Water Design_Manual_2005. KENYA FOR WASTER SUPPLY AND SEWERAGE
DancanNgutuku
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
REINFORCEMENT AS CONSTRUCTION MATERIALS.pptx
mohaiminulhaquesami
 
ARC--BUILDING-UTILITIES-2-PART-2 (1).pdf
IzzyBaniquedBusto
 
Additional Information in midterm CPE024 (1).pdf
abolisojoy
 
IoT - Unit 2 (Internet of Things-Concepts) - PPT.pdf
dipakraut82
 
UNIT DAA PPT cover all topics 2021 regulation
archu26
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Electron Beam Machining for Production Process
Rajshahi University of Engineering & Technology(RUET), Bangladesh
 
Innowell Capability B0425 - Commercial Buildings.pptx
regobertroza
 
Ethics and Trustworthy AI in Healthcare – Governing Sensitive Data, Profiling...
AlqualsaDIResearchGr
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
artificial intelligence applications in Geomatics
NawrasShatnawi1
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
MOBILE AND WEB BASED REMOTE BUSINESS MONITORING SYSTEM
ijait
 
Ad

Talks@Coursera - A/B Testing @ Internet Scale

  • 1. A/B Testing @ Internet Scale Ya Xu 8/12/2014 @ Coursera
  • 2. A/B Testing in One Slide 20%80% Collect results to determine which one is better Join now Control Treatment
  • 3. Outline § Culture Challenge –  Why A/B testing –  What to A/B test § Building a scalable experimentation system § Best practices 3
  • 5. Amazon Shopping Cart Recommendation 5 •  At Amazon, Greg Linden had this idea of showing recommendations based on cart items •  Trade-offs •  Pro: cross-sell more items (increase average basket size) •  Con: distract people from checking out (reduce conversion) •  HiPPO (Highest Paid Person’s Opinion) : stop the project From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
  • 6. MSN Real Estate § “Find a house” widget variations § Revenue to MSN generated every time a user clicks search/find button 6 A B http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
  • 7. Take-away Experiments are the only way to prove causality. 7 Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback
  • 8. What to A/B Test 8
  • 9. Ads CTR Drop 9 Sudden drop on 11/11/2013 Profile top ads
  • 11. What to A/B Test § Evaluating new ideas: –  Visual changes –  Complete redesign of web page –  Relevance algorithms –  … § Platform changes § Code refactoring § Bug fixes 11 Test Everything!
  • 12. Startups vs. Big Websites § Do startups have enough users to A/B test? –  Startups typically look for larger effects –  5% vs. 0.5% difference è 100 times more users! § Startups should establish A/B testing culture early 12
  • 14. A/B Testing 3 Steps 14 Design •  What/Whom to experiment on Deploy •  Code deployment Analyze •  Impact on metrics
  • 15. A/B Testing Platform Architecture 1.  Experiment Management 2.  Online Infrastructure 3.  Offline Analysis 15 Example: Bing A/B
  • 16. 1. Experiment Management § Define experiments –  Whom to target? –  How to split traffic? § Start/stop an experiment § Important addition: –  Define success criteria –  Power analysis 16
  • 17. 2. Online Infrastructure 1)  Hash & partition: random & consistent 2)  Deploy: server-side, as a change to –  The default configuration (Bing) –  The default code path (LinkedIn) 3)  Data logging 17 0% 100% Treatment1 D20% D20% Hash (ID) Treatment2 Control
  • 18. Hash & Partition @ Scale (I) § Pure bucket system (Google/Bing before 200X) 18 0% 100% Exp. 1 D20% D20% Exp. 2 Exp. 3 60% red green yellow 15% 15%30% •  Does not scale •  Traffic management
  • 19. Hash & Partition @ Scale (II) § Fully overlapping system 0% 100% D Exp. 2 A2 B2 control Exp.1 controlA1 D B1 D •  Each experiment gets 100% traffic •  A user is in “all” experiments simultaneously •  Randomization btw experiments are independent (unique hashID) •  Cannot avoid interaction
  • 20. Hash & Partition @ Scale (III) § Hybrid: Layer + Domain 20 •  Centralized management (Bing) •  Central exp. team creates/manages layers/domains •  De-centralized management (LinkedIn) •  Each experiment is one “layer” by default •  Experimenter controls hashID to create a “domain”
  • 21. Data Logging §  Trigger §  Trigger-based logging –  Log whether a request is actually affected by the experiment –  Log for both factual & counter-factual 21 All LinkedIn members 300MM + Triggered: Members visiting contacts page
  • 22. 3. Automated Offline Analysis §  Large-scale data processing, e.g. daily @LinkedIn –  200+ experiments –  700+ metrics –  Billions of experiment trigger events §  Statistical analysis –  Metrics design –  Statistical significance test (p-value, confidence interval) –  Deep-dive: slicing & dicing capability §  Monitoring & alerting –  Data quality –  Early termination 22
  • 25. What to Experiment? Measure one change at a time. Unified Search Experiments 1+2+…N50% En-US Pre-unified search 50% En-US
  • 26. What to Measure? § Success metrics: summarize whether treatment is better § Puzzling example: –  Key metrics for Bing: number of searches & revenue –  Ranking bug in experiment resulted in poor search results –  Number of searches up +10% and revenue up +30% Success metrics should reflect long term impact
  • 27. Scientific Experiment Design § How long to run the experiment? § How much traffic to allocate to treatment? Story: §  Site speed matters –  Bing: +100msec = -0.6% revenue –  Amazon: +100msec = -1.0% revenue –  Google: +100msec = -0.2% queries §  But not for Etsy.com? “Faster results better? … meh” 27
  • 28. Power § Power: the chance of detecting a difference when there really is one. § Two reasons your feature doesn’t move metrics 1.  No “real” impact 2.  Not enough power 28 Properly power up your experiment!
  • 29. Statistical Significance § Which experiment has a bigger impact? 29 Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% 2.4%
  • 30. Statistical Significance § Which experiment has a bigger impact? 30 Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
  • 31. Statistical Significance 31 § Must consider statistical significance –  A 12.9% delta can still be noise! –  Identify signal from noise; focus on the “real” movers –  Ensure results are reproducible Experiment 1 Experiment 2 Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
  • 32. Multiple Testing § Famous xkcd comic on Jelly Beans 32
  • 33. Multiple Testing Concerns § Multiple ramps –  Pre-decide a ramp to base decision on (e.g. 50/50) § Multiple “peeks” –  Rely on “full”-week results § Multiple variants –  Choose the best, then rerun to see if replicate § Multiple metrics
  • 34. An irrelevant metric is statistically significant. What to do? §  Which metric? §  How “significant”? (p-value) 34 34 All metrics 2nd order metrics 1st order metrics p-value < 0.05 p-value < 0.01 p-value < 0.001 Directly impacted by exp. Maybe impacted by exp. Watch out for multiple testing With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5
  • 35. References §  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010. §  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013. §  LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin Additional Resources: RecSys’14 A/B testing workshop 35