Talks@Coursera - A/B Testing @ Internet Scale

A/B Testing @ Internet Scale
Ya Xu
8/12/2014 @ Coursera

A/B Testing in One Slide
20%80%
Collect results to determine which one is better
Join now
Control Treatment

Outline
§ Culture Challenge
–  Why A/B testing
–  What to A/B test
§ Building a scalable experimentation system
§ Best practices
3

Amazon Shopping Cart Recommendation
5
•  At Amazon, Greg Linden had this idea of showing
recommendations based on cart items
•  Trade-offs
•  Pro: cross-sell more items (increase average basket size)
•  Con: distract people from checking out (reduce conversion)
•  HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx

MSN Real Estate
§ “Find a house” widget variations
§ Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx

Take-away
Experiments
are the only way to prove causality.
7
Use A/B testing to:
§ Guide product development
§ Measure impact (assess ROI)
§ Gain “real” customer feedback

Ads CTR Drop
9
Sudden drop
on 11/11/2013
Profile top ads

Root-Cause
10
5 Pixels!!
Navigation bar
Profile top ads

What to A/B Test
§ Evaluating new ideas:
–  Visual changes
–  Complete redesign of web page
–  Relevance algorithms
–  …
§ Platform changes
§ Code refactoring
§ Bug fixes
11
Test Everything!

Startups vs. Big Websites
§ Do startups have enough users to A/B test?
–  Startups typically look for larger effects
–  5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture
early
12

A Scalable Experimentation
System
13

A/B Testing 3 Steps
14
Design
•  What/Whom to experiment on
Deploy
•  Code deployment
Analyze
•  Impact on metrics

A/B Testing Platform Architecture
1.  Experiment Management
2.  Online Infrastructure
3.  Offline Analysis
15
Example: Bing A/B

1. Experiment Management
§ Define experiments
–  Whom to target?
–  How to split traffic?
§ Start/stop an experiment
§ Important addition:
–  Define success criteria
–  Power analysis
16

2. Online Infrastructure
1)  Hash & partition: random & consistent
2)  Deploy: server-side, as a change to
–  The default configuration (Bing)
–  The default code path (LinkedIn)
3)  Data logging
17
0% 100%
Treatment1
D20% D20%
Hash (ID)
Treatment2 Control

Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D20% D20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15%30%
•  Does not scale
•  Traffic management

Hash & Partition @ Scale (II)
§ Fully overlapping system
0% 100%
D
Exp. 2
A2 B2 control
Exp.1
controlA1
D
B1
D
•  Each experiment gets 100% traffic
•  A user is in “all” experiments simultaneously
•  Randomization btw experiments are independent
(unique hashID)
•  Cannot avoid interaction

Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
•  Centralized management (Bing)
•  Central exp. team creates/manages layers/domains
•  De-centralized management (LinkedIn)
•  Each experiment is one “layer” by default
•  Experimenter controls hashID to create a “domain”

Data Logging
§  Trigger
§  Trigger-based logging
–  Log whether a request is actually affected by the
experiment
–  Log for both factual & counter-factual
21
All LinkedIn members
300MM +
Triggered:
Members visiting
contacts page

3. Automated Offline Analysis
§  Large-scale data processing, e.g. daily @LinkedIn
–  200+ experiments
–  700+ metrics
–  Billions of experiment trigger events
§  Statistical analysis
–  Metrics design
–  Statistical significance test (p-value, confidence interval)
–  Deep-dive: slicing & dicing capability
§  Monitoring & alerting
–  Data quality
–  Early termination
22

What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N50%
En-US
Pre-unified search
50%
En-US

What to Measure?
§ Success metrics: summarize whether
treatment is better
§ Puzzling example:
–  Key metrics for Bing: number of searches &
revenue
–  Ranking bug in experiment resulted in poor search
results
–  Number of searches up +10% and revenue up
+30%
Success metrics should reflect long
term impact

Scientific Experiment Design
§ How long to run the experiment?
§ How much traffic to allocate to treatment?
Story:
§  Site speed matters
–  Bing: +100msec = -0.6% revenue
–  Amazon: +100msec = -1.0% revenue
–  Google: +100msec = -0.2% queries
§  But not for Etsy.com?
“Faster results better? … meh”
27

Power
§ Power: the chance of detecting a
difference when there really is one.
§ Two reasons your feature doesn’t move
metrics
1.  No “real” impact
2.  Not enough power
28
Properly power up your experiment!

Statistical Significance
§ Which experiment has a bigger impact?
29
Experiment 1 Experiment 2
Pageviews 1.5% 12.9%
Revenue 0.8% 2.4%

§ Which experiment has a bigger impact?
30
Revenue 0.8% Stat. significant 2.4%

31
§ Must consider statistical significance
–  A 12.9% delta can still be noise!
–  Identify signal from noise; focus on the “real” movers
–  Ensure results are reproducible
Revenue 0.8% Stat. significant 2.4%

Multiple Testing
§ Famous xkcd comic on Jelly Beans
32

Multiple Testing Concerns
§ Multiple ramps
–  Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks”
–  Rely on “full”-week results
§ Multiple variants
–  Choose the best, then rerun to see if replicate
§ Multiple metrics

An irrelevant metric is statistically
significant. What to do?
§  Which metric?
§  How “significant”? (p-value)
34
34
All
metrics
2nd order
metrics
1st order
metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant
even if your experiment does NOTHING? 5

References
§  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better,
Faster Experimentation. Proceedings 16th Conference on Knowledge
Discovery and Data Mining. 2010.
§  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD
2013: Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining. 2013.
§  LinkedIn blog post:
http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35

Talks@Coursera - A/B Testing @ Internet Scale

More Related Content

Similar to Talks@Coursera - A/B Testing @ Internet Scale (20)

Recently uploaded (20)

Talks@Coursera - A/B Testing @ Internet Scale