SlideShare a Scribd company logo
WHAT’S THE SCIENCE IN
DATA SCIENCE?
Skipper Seabold, Director of Data Science R&D, Product Lead
Civis Analytics
@jseabold
PyData LA
Los Angeles, CA
October 23, 2018
All great ideas come while you’re focused on something else. All
motivation to act comes from Twitter.
Four decades of
econom(etr)ic history
In 120 seconds.
Four decades of
econom(etr)ic history
A focus on research design takes the con out of econometrics
1983
“Let’s Take the Con out
of Econometrics”
Criticism of the current
state of econometric
research. Calls for focus
on threats to validity
(sensitivity analysis) of
non-randomized control
trials.
Quasi-Experimental
Design
Lends credibility,
especially to micro,
through “a clear-eyed
focus on research
design … [such as] it
would command in a
real experiment.”
1980s-
2000s
Randomized Control
Trials
In parallel, increasing
use of randomized
control trials in
economics, especially in
development micro.
1990s-
2000s
Worries about External
Validity
“Counterrevolutionary”
worries that the focus
on experimental and
quasi-experimental
research design comes
at the expense of
external validity.
2000s
“Good designs have a beneficial side effect: they
typically lend themselves to a simple explanation
of empirical methods and a straightforward
presentations of results.”
These changes lead to an increased relevance in policy decisions
First, some context.
What does this have to do
with data science?
Obtain, Scrub, Explore, Model, and iNterpret (OSEMN)
Mason and Wiggins, 2010
Have the “ability to [create] prototype-level versions of ... the steps needed to
derive new insights or build data products”
Analyzing the Analyzers, 2013
Use multidisciplinary methods to understand and have a measurable impact on
a business process or product
Me, Today
Data science exists to drive better business outcomes
Multidisciplinary teams use the (Data) Scientific Method to
measure impacts
Question
Start with a product or business question. E.g., how do our marketing
campaigns perform? What’s driving employee attrition?
Hypothesize
Research, if necessary, and write a falsifiable hypothesis. E.g.,
the ROI on our marketing campaigns is greater than break-even.
Research
Design
Design a strategy that allows you to test your hypothesis,
noting all threats to validity.
Analyze
Analyze all data and evidence. Test for threats to
validity.
Communicate
or Productize
Communicate results in a way that stakeholders
will understand or engineering can use.
The current and coming credibility crisis in data science
Question Objectives are often unclear and success is left undefined.
Hypothesize
Research, if necessary, and write a falsifiable hypothesis. E.g.,
the ROI on our marketing campaigns is greater than break-even.
Research
Design
Black-box(-ish) predictive models are often the focus.
Threats to validity are an afterthought or brushed aside.
Analyze
Analyze all data and evidence. Test for threats to
validity.
Communicate
or Productize
Decision-makers don’t understand the value of
data science projects.
Well, kind of.
So, data science is about
running experiments?
The randomized control trial is the gold standard of scientific
discovery
Population Random
Assignment
to Treatment
Treated
Control
Higher
Productivity
Split &
Measure
But sometimes a randomized control trial is a hard internal sell
(yes, these are often strawman arguments)
The results will be difficult
to interpret.
“We’ll never really know
whether what we tried was
the real reason people
stayed.”
Running a control ad is too
expensive.
“I have to pay what? Just to
buy space for some PSA?”
No one wants to be “B.”
“That’s taking food out of my
mouth.”
And even if we could, sometimes experiments can go wrong
Threats to
Validity
Insufficient
Randomization
Partial
Compliance
Attrition Spillovers
This is where the social
science comes in handy.
And I’m not just justifying my life
choices.
This is where the social
science comes in handy.
And I’m not just justifying my life
choices. This is mostly true.
This is where the social
science comes in handy.
In the absence of an experiment, we might try to measure an
intervention by running a regression
Here are some threats to validity in a regression approach
Is our sample truly
representative of our
population of interest?
We didn’t track who
attended trainings but did a
survey after. 40% responded
and 65% of those attended a
training.
Are we sure that we
understand the direction of
causation?
Low productivity offices
may have been targeted for
training programs.
Did we omit variables that
could plausibly explain our
outcome of interest?
People may pursue training
or further education on their
own.
Avoiding threats to validity
Research design strategies
from the social sciences
Use instrumental variables for overcoming simultaneity or
omitted variable biases
Can get around problems like “reverse causation,”
where the outcome causes changes in the
treatment or common unobserved confounders
where an unobserved variable affects the
outcome and the variable of interest leading to
spurious correlation.
We can replace X with “instruments” that are
correlated with X but not caused by Y or that
affect Y but only through X.
An example from the Fulton fish market: stormy weather as an
instrument
Use matching to overcome non-random treatment assignment
Makes up for the lack of a properly randomized
control group by “matching” treated people to
untreated people who look almost exactly like the
treated group to create a “pseudo-control” group.
Then look at the average treatment effect across
groups.
An illustration of a matched control experiment
Population Non Random
Treatment
Treated
Pseudo-
Control
Higher
Productivity
Match &
Measure
Use difference-in-differences for repeated measures to account
for underlying trends
Used when you don’t have an RCT, but you
observe one or more groups over two or more
periods of time, and there is an intervention
between time periods.
A simple example of difference-in-differences
Regression discontinuity exploits thresholds that determine
treatment
Exploits technical, natural, or randomly occurring
cutoffs in treatment assignment to measure the
treatment effect. Those observations close to but
on opposite sides of the cutoff should be alike in
all but whether they are treated or not.
A simple example of a regression discontinuity design
Somewhere in between a “true experiment” and a
quasi-experiment, a survey experiment gives
slightly changed survey instruments to randomly
selected survey participants. They provide an
interesting way to test many hypotheses and are
heavily and increasingly used in the social and
data sciences, particularly in political science.
Survey experiments can be another great tool
Some types of survey experiments
Survey
Experiment Description
Change the context or wording of questions. This can be used, for example,
to understand how some message or creative content influences a survey
taker vs. a control group.
Priming &
Framing
Change whether a sensitive item is included in a list of response options. May
be able to avoid types of biases that can arise in other approaches. Can be
used to measure brand favorability after some bad publicity, for example.
List
Change aspects of a hypothetical scenario. When combined with a conjoint
design, can be used, for example, to understand how users view different
aspects of a potential product without asking every variation of everyone.
Scenario-
based
The attention to research
design focuses the discussion
By using these methods, you discuss particular threats to
validity and whether identifying assumptions are met
Method Identifying Assumption
Assignment to treatment is random conditional on observed characteristics
(Selection on observables; Unconfoundedness). You have covariate balance
between the treatment and control groups (Overlap).
Matching
Look for clumping at the discontinuity. People may adjust their behavior, if
they understand the selection mechanism.
Regression
Discontinuity
The instrument z affects the outcome y only through the variable x, not
directly (Exclusion Restriction), & the instrument z is correlated with the
endogenous variable x (Relevance Restriction).
Instrumental
Variables
What about all the fancy
machine learning I hear
about?
We’re not limited to regression. There have been interesting
advances in machine learning aimed at causal inference.
Bayesian Additive Regression Trees (BART; Chipman, et al.)
Causal Trees & Forests (Imbens & Athey)
Interpretable Modeling (LIME, SHAP, etc.)
Double Machine Learning (Chernozhukov, Hansen, etc.)
G-Estimation (used in epidemiology)
SuperLearner (van der Laan & Polley)
Post-Selection Inference (Taylor, Barber, Imai & Ratkovic, etc.)
A non-random walk through some
(mostly) recent papers.
A few examples of research
design-based thinking at work
A non-random walk through some
(mostly) recent papers. Pro tip: start
or join a journal club at work.
A few examples of research
design-based thinking at work
“Here, There, and Everywhere: Correlated
Online Behaviors Can Lead to Overestimates of
the Effects of Advertising”
A retreat to experiments
by Randall A. Lewis, Justin M. Rao, and David H. Reiley
Observational experiment methods often overestimate the
causal impact of advertising due to “activity bias”
○ 200M impressions for Major Firm on Y! w/ 10% control
○ Tracked sign-ups on a Competitor Firm
○ Both saw more sign-ups for both ads! (And Y! usage)
○ Survey experiment on Amazon Mechanical Turk
○ Treatment (800) and control (800) of ad on Y! activity
○ Treatment and control had the same effects
○ 250m impressions on Yahoo! FP w/ 5% control
○ 5.4% increase in search traffic via experiment
○ Matching (1198%), Regression (800%), Diff-in-diff (100%)
“The Unfavorable Economics of Measuring the
Returns to Advertising”
The limits of measurement
by Randall A. Lewis and Justin M. Rao
It is extremely difficult but not impossible to measure the effect
of an advertising campaign
Noisy Effects
The standard deviation
of sales vs. the average
is around 10:1 across
many industries.
Small Effect
Sizes
Campaigns have a low
spend per-person, and
the ROI on a break-even
campaign is small.
Experiments &
Observational Methods
Designing an experiment to measure this
effect is hard. Observational methods
look attractive but overestimate effects
due to targeted advertising.
The power of an experiment is the likelihood of finding an effect
if it exists in the population
Experimental design for control vs. treatment digital ads
Study Aspect Description
New Sales &
New Account Sign-Ups
Measured
Outcome
Ran for 2-135 days (median 14); Cost $10k - $600k+; Measured
100k to 10m impressions
25
Experiments
Retail Sales (including department stores) &
Financial ServicesIndustries
Every single experiment studied did not have enough power to
measure a precise 10% return
Question Measurable?
12 of 25 are severely underpowered;
only 3 had sufficient power to
conclude they were “wildly
profitable” (ROI of 50%).
Profitable
Campaign
Every experiment is underpowered.
Greater than
10% Return
9 out of 25 fail to reject the null of no
impact; 10 out of 25 have enough
power to test.
No Impact vs
Positive Return
Result
The powered experiments generally
reveal a significant, positive effect for
advertising.
The median campaign would have to
be 9 times larger to have sufficient
power.
The median retail campaign would
have to be 61 times larger. For
financial services, 1241 times larger
(!).
“Courtyard by Marriott: Designing a Hotel
Facility with Consumer-Based Marketing Models”
Change the course of industry
By Jerry Wind, Paul E. Green, Douglas Shifflet, and Marsha
Scarbrough
“Marriott used conjoint analysis to design a new
hotel chain [in the early 80s].”
A well-designed survey experiment resulted in a wildly
successful product ideation for Marriott
The survey design asked two target segments about 7 facets,
considering 50 attributes with 2 to 8 levels each
7 FACETS
Leisure
ServicesRooms
SecurityFood
LoungeExternal
Factors
As of 1989, Marriott went from 3 tests cases in 1983 to 90
hotel in 1987 with more than $200 million in sales. The chain
was expected to grow to 300 hotels by 1994 with an expected
$1 billion in sales.
Captured within 4% of predicted market share.
Created 3,000 new jobs within Marriott with an expected
14,000 new jobs by 1994.
Affected a restructuring of all competitors the midprice-level
lodging industry (upgrades, prices, amenities, and followers).
How successful was this experiment?
Wrapping Up
Yes, Judea Pearl is a keynote speaker,
and I talked about causal inference
but didn’t mention do-calculus.
Wrapping Up
A credibility crisis in data science is preventable
Question
Work with stakeholders to focus on business relevant and measurable
outcomes from the start.
Hypothesize Be clear about the causal mechanisms that you will test.
Research
Design
Keep it sophisticatedly simple. A well thought out
research design leads to better communication of results.
Analyze
Be honest about any threats to validity. You can test
these in the future. This is scientific progress.
Communicate
or Productize
Go the last mile. This step enables data science
to have decision-making relevance.
Did I miss some good papers? Come @ me on twitter. Until then,
here are some more papers and books on my desk.
“A comparison of Approaches to Advertising Measurement: Evidence from Big
Field Experiments at Facebook”
“Confounding in Survey Experiments”
“Using Big Data to Estimate Consumer Surplus: The Case of Uber”
Mastering Metrics: The Path from Cause to Effect [Looks good!]
“A/B Testing” [Experiments at Microsoft using Bing EXP Platform]
“Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on
Pandora Internet Radio”
“Attribution Inference for Digital Advertising using Inhomogeneous Poisson
Models”

More Related Content

What's hot (20)

PPTX
Causal inference in practice
Amit Sharma
 
PPT
Biomarker Strategies
Tom Plasterer
 
PPTX
Hypothesis Testing: Proportions (Compare 1:Standard)
Matt Hansen
 
PPTX
Hypothesis Testing: Proportions (Compare 2+ Factors)
Matt Hansen
 
PPTX
Hypothesis Testing: Spread (Compare 2+ Factors)
Matt Hansen
 
PPTX
Hypothesis Testing: Relationships (Compare 1:1)
Matt Hansen
 
PPTX
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Matt Hansen
 
PPTX
Basic Statistical Concepts & Decision-Making
Penn State University
 
PPTX
How to calculate power in statistics
Stat Analytica
 
PPTX
Predictability of popularity on online social media: Gaps between prediction ...
Amit Sharma
 
PPTX
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Matt Hansen
 
PPTX
Hypothesis Testing: Spread (Compare 1:Standard)
Matt Hansen
 
PPTX
Hypothesis Testing: Finding the Right Statistical Test
Matt Hansen
 
PDF
Biostatistics Workshop: Sample Size & Power
HopkinsCFAR
 
PDF
Clinical prediction models: development, validation and beyond
Maarten van Smeden
 
PPTX
Hypothesis Testing: Proportions (Compare 1:1)
Matt Hansen
 
PDF
Sti2018 jws
Jesper Schneider
 
PDF
Bayes rpp bristol
Alexander Etz
 
PPT
Sample size and power
Christina K J
 
PPTX
Hypothesis Testing: Overview
Matt Hansen
 
Causal inference in practice
Amit Sharma
 
Biomarker Strategies
Tom Plasterer
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Matt Hansen
 
Hypothesis Testing: Proportions (Compare 2+ Factors)
Matt Hansen
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Matt Hansen
 
Hypothesis Testing: Relationships (Compare 1:1)
Matt Hansen
 
Hypothesis Testing: Central Tendency – Non-Normal (Nonparametric Overview)
Matt Hansen
 
Basic Statistical Concepts & Decision-Making
Penn State University
 
How to calculate power in statistics
Stat Analytica
 
Predictability of popularity on online social media: Gaps between prediction ...
Amit Sharma
 
Hypothesis Testing: Central Tendency – Normal (Compare 1:Standard)
Matt Hansen
 
Hypothesis Testing: Spread (Compare 1:Standard)
Matt Hansen
 
Hypothesis Testing: Finding the Right Statistical Test
Matt Hansen
 
Biostatistics Workshop: Sample Size & Power
HopkinsCFAR
 
Clinical prediction models: development, validation and beyond
Maarten van Smeden
 
Hypothesis Testing: Proportions (Compare 1:1)
Matt Hansen
 
Sti2018 jws
Jesper Schneider
 
Bayes rpp bristol
Alexander Etz
 
Sample size and power
Christina K J
 
Hypothesis Testing: Overview
Matt Hansen
 

Similar to What's the Science in Data Science? - Skipper Seabold (20)

PPT
What is research
Jahanzeb Jahan
 
PPTX
Research design
Balaji P
 
PPT
A well-defined research question is the cornerstone of any successful investi...
BiniamWoldegebriel1
 
PPT
SAMPLE_AND_OTHER.ppt
narman1402
 
PDF
CORE: Quantitative Research Methodology: An Overview
Trident University
 
PPTX
S6 quantitative research 2019
collierdr709
 
PDF
Mock Scientific Research Paper
Jessica Howard
 
DOC
Differences between qualitative
Shakeel Ahmad
 
PPTX
PR2hkjsjkasgasdgasgdjkaesjkeasjjsjs.pptx
gnotorio4
 
PPTX
PR2kasgdjADadADGAsdADHSAJGXDJJEHGXAJEH.pptx
gnotorio4
 
PPTX
Presentation RESEARCH METHODOLOGY 1.pptx
HemantDeshmukh25
 
PPTX
QUARTER 4 – WEEK 1-1.pptx
ssuser2123ba
 
PDF
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
aidmanwagod
 
PDF
Methodology
Nelyn Joy Castillon
 
PDF
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
flairkoski9p
 
PDF
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
osmusumazongp
 
PDF
Introduction+to+research
Maya Ninova, Ph.D
 
PDF
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
gegamwedu
 
PDF
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
gegamwedu
 
PPT
Research Methodology
SHASHIKANT KULKARNI
 
What is research
Jahanzeb Jahan
 
Research design
Balaji P
 
A well-defined research question is the cornerstone of any successful investi...
BiniamWoldegebriel1
 
SAMPLE_AND_OTHER.ppt
narman1402
 
CORE: Quantitative Research Methodology: An Overview
Trident University
 
S6 quantitative research 2019
collierdr709
 
Mock Scientific Research Paper
Jessica Howard
 
Differences between qualitative
Shakeel Ahmad
 
PR2hkjsjkasgasdgasgdjkaesjkeasjjsjs.pptx
gnotorio4
 
PR2kasgdjADadADGAsdADHSAJGXDJJEHGXAJEH.pptx
gnotorio4
 
Presentation RESEARCH METHODOLOGY 1.pptx
HemantDeshmukh25
 
QUARTER 4 – WEEK 1-1.pptx
ssuser2123ba
 
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
aidmanwagod
 
Methodology
Nelyn Joy Castillon
 
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
flairkoski9p
 
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
osmusumazongp
 
Introduction+to+research
Maya Ninova, Ph.D
 
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
gegamwedu
 
Research Methods in Psychology 10th Edition Shaughnessy Solutions Manual
gegamwedu
 
Research Methodology
SHASHIKANT KULKARNI
 
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
PDF
Words in Space - Rebecca Bilbro
PyData
 
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
PPTX
Pydata beautiful soup - Monica Puerto
PyData
 
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
PPTX
Extending Pandas with Custom Types - Will Ayd
PyData
 
PDF
Measuring Model Fairness - Stephen Hoover
PyData
 
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
PDF
Deprecating the state machine: building conversational AI with the Rasa stack...
PyData
 
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
PyData
 
Ad

Recently uploaded (20)

PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PDF
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
TrustArc Webinar - Data Privacy Trends 2025: Mid-Year Insights & Program Stra...
TrustArc
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
July Patch Tuesday
Ivanti
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Impact of IEEE Computer Society in Advancing Emerging Technologies including ...
Hironori Washizaki
 

What's the Science in Data Science? - Skipper Seabold

  • 1. WHAT’S THE SCIENCE IN DATA SCIENCE? Skipper Seabold, Director of Data Science R&D, Product Lead Civis Analytics @jseabold PyData LA Los Angeles, CA October 23, 2018
  • 2. All great ideas come while you’re focused on something else. All motivation to act comes from Twitter.
  • 4. In 120 seconds. Four decades of econom(etr)ic history
  • 5. A focus on research design takes the con out of econometrics 1983 “Let’s Take the Con out of Econometrics” Criticism of the current state of econometric research. Calls for focus on threats to validity (sensitivity analysis) of non-randomized control trials. Quasi-Experimental Design Lends credibility, especially to micro, through “a clear-eyed focus on research design … [such as] it would command in a real experiment.” 1980s- 2000s Randomized Control Trials In parallel, increasing use of randomized control trials in economics, especially in development micro. 1990s- 2000s Worries about External Validity “Counterrevolutionary” worries that the focus on experimental and quasi-experimental research design comes at the expense of external validity. 2000s
  • 6. “Good designs have a beneficial side effect: they typically lend themselves to a simple explanation of empirical methods and a straightforward presentations of results.” These changes lead to an increased relevance in policy decisions
  • 7. First, some context. What does this have to do with data science?
  • 8. Obtain, Scrub, Explore, Model, and iNterpret (OSEMN) Mason and Wiggins, 2010 Have the “ability to [create] prototype-level versions of ... the steps needed to derive new insights or build data products” Analyzing the Analyzers, 2013 Use multidisciplinary methods to understand and have a measurable impact on a business process or product Me, Today Data science exists to drive better business outcomes
  • 9. Multidisciplinary teams use the (Data) Scientific Method to measure impacts Question Start with a product or business question. E.g., how do our marketing campaigns perform? What’s driving employee attrition? Hypothesize Research, if necessary, and write a falsifiable hypothesis. E.g., the ROI on our marketing campaigns is greater than break-even. Research Design Design a strategy that allows you to test your hypothesis, noting all threats to validity. Analyze Analyze all data and evidence. Test for threats to validity. Communicate or Productize Communicate results in a way that stakeholders will understand or engineering can use.
  • 10. The current and coming credibility crisis in data science Question Objectives are often unclear and success is left undefined. Hypothesize Research, if necessary, and write a falsifiable hypothesis. E.g., the ROI on our marketing campaigns is greater than break-even. Research Design Black-box(-ish) predictive models are often the focus. Threats to validity are an afterthought or brushed aside. Analyze Analyze all data and evidence. Test for threats to validity. Communicate or Productize Decision-makers don’t understand the value of data science projects.
  • 11. Well, kind of. So, data science is about running experiments?
  • 12. The randomized control trial is the gold standard of scientific discovery Population Random Assignment to Treatment Treated Control Higher Productivity Split & Measure
  • 13. But sometimes a randomized control trial is a hard internal sell (yes, these are often strawman arguments) The results will be difficult to interpret. “We’ll never really know whether what we tried was the real reason people stayed.” Running a control ad is too expensive. “I have to pay what? Just to buy space for some PSA?” No one wants to be “B.” “That’s taking food out of my mouth.”
  • 14. And even if we could, sometimes experiments can go wrong Threats to Validity Insufficient Randomization Partial Compliance Attrition Spillovers
  • 15. This is where the social science comes in handy.
  • 16. And I’m not just justifying my life choices. This is where the social science comes in handy.
  • 17. And I’m not just justifying my life choices. This is mostly true. This is where the social science comes in handy.
  • 18. In the absence of an experiment, we might try to measure an intervention by running a regression
  • 19. Here are some threats to validity in a regression approach Is our sample truly representative of our population of interest? We didn’t track who attended trainings but did a survey after. 40% responded and 65% of those attended a training. Are we sure that we understand the direction of causation? Low productivity offices may have been targeted for training programs. Did we omit variables that could plausibly explain our outcome of interest? People may pursue training or further education on their own.
  • 20. Avoiding threats to validity Research design strategies from the social sciences
  • 21. Use instrumental variables for overcoming simultaneity or omitted variable biases Can get around problems like “reverse causation,” where the outcome causes changes in the treatment or common unobserved confounders where an unobserved variable affects the outcome and the variable of interest leading to spurious correlation. We can replace X with “instruments” that are correlated with X but not caused by Y or that affect Y but only through X.
  • 22. An example from the Fulton fish market: stormy weather as an instrument
  • 23. Use matching to overcome non-random treatment assignment Makes up for the lack of a properly randomized control group by “matching” treated people to untreated people who look almost exactly like the treated group to create a “pseudo-control” group. Then look at the average treatment effect across groups.
  • 24. An illustration of a matched control experiment Population Non Random Treatment Treated Pseudo- Control Higher Productivity Match & Measure
  • 25. Use difference-in-differences for repeated measures to account for underlying trends Used when you don’t have an RCT, but you observe one or more groups over two or more periods of time, and there is an intervention between time periods.
  • 26. A simple example of difference-in-differences
  • 27. Regression discontinuity exploits thresholds that determine treatment Exploits technical, natural, or randomly occurring cutoffs in treatment assignment to measure the treatment effect. Those observations close to but on opposite sides of the cutoff should be alike in all but whether they are treated or not.
  • 28. A simple example of a regression discontinuity design
  • 29. Somewhere in between a “true experiment” and a quasi-experiment, a survey experiment gives slightly changed survey instruments to randomly selected survey participants. They provide an interesting way to test many hypotheses and are heavily and increasingly used in the social and data sciences, particularly in political science. Survey experiments can be another great tool
  • 30. Some types of survey experiments Survey Experiment Description Change the context or wording of questions. This can be used, for example, to understand how some message or creative content influences a survey taker vs. a control group. Priming & Framing Change whether a sensitive item is included in a list of response options. May be able to avoid types of biases that can arise in other approaches. Can be used to measure brand favorability after some bad publicity, for example. List Change aspects of a hypothetical scenario. When combined with a conjoint design, can be used, for example, to understand how users view different aspects of a potential product without asking every variation of everyone. Scenario- based
  • 31. The attention to research design focuses the discussion
  • 32. By using these methods, you discuss particular threats to validity and whether identifying assumptions are met Method Identifying Assumption Assignment to treatment is random conditional on observed characteristics (Selection on observables; Unconfoundedness). You have covariate balance between the treatment and control groups (Overlap). Matching Look for clumping at the discontinuity. People may adjust their behavior, if they understand the selection mechanism. Regression Discontinuity The instrument z affects the outcome y only through the variable x, not directly (Exclusion Restriction), & the instrument z is correlated with the endogenous variable x (Relevance Restriction). Instrumental Variables
  • 33. What about all the fancy machine learning I hear about?
  • 34. We’re not limited to regression. There have been interesting advances in machine learning aimed at causal inference. Bayesian Additive Regression Trees (BART; Chipman, et al.) Causal Trees & Forests (Imbens & Athey) Interpretable Modeling (LIME, SHAP, etc.) Double Machine Learning (Chernozhukov, Hansen, etc.) G-Estimation (used in epidemiology) SuperLearner (van der Laan & Polley) Post-Selection Inference (Taylor, Barber, Imai & Ratkovic, etc.)
  • 35. A non-random walk through some (mostly) recent papers. A few examples of research design-based thinking at work
  • 36. A non-random walk through some (mostly) recent papers. Pro tip: start or join a journal club at work. A few examples of research design-based thinking at work
  • 37. “Here, There, and Everywhere: Correlated Online Behaviors Can Lead to Overestimates of the Effects of Advertising” A retreat to experiments by Randall A. Lewis, Justin M. Rao, and David H. Reiley
  • 38. Observational experiment methods often overestimate the causal impact of advertising due to “activity bias” ○ 200M impressions for Major Firm on Y! w/ 10% control ○ Tracked sign-ups on a Competitor Firm ○ Both saw more sign-ups for both ads! (And Y! usage) ○ Survey experiment on Amazon Mechanical Turk ○ Treatment (800) and control (800) of ad on Y! activity ○ Treatment and control had the same effects ○ 250m impressions on Yahoo! FP w/ 5% control ○ 5.4% increase in search traffic via experiment ○ Matching (1198%), Regression (800%), Diff-in-diff (100%)
  • 39. “The Unfavorable Economics of Measuring the Returns to Advertising” The limits of measurement by Randall A. Lewis and Justin M. Rao
  • 40. It is extremely difficult but not impossible to measure the effect of an advertising campaign Noisy Effects The standard deviation of sales vs. the average is around 10:1 across many industries. Small Effect Sizes Campaigns have a low spend per-person, and the ROI on a break-even campaign is small. Experiments & Observational Methods Designing an experiment to measure this effect is hard. Observational methods look attractive but overestimate effects due to targeted advertising.
  • 41. The power of an experiment is the likelihood of finding an effect if it exists in the population
  • 42. Experimental design for control vs. treatment digital ads Study Aspect Description New Sales & New Account Sign-Ups Measured Outcome Ran for 2-135 days (median 14); Cost $10k - $600k+; Measured 100k to 10m impressions 25 Experiments Retail Sales (including department stores) & Financial ServicesIndustries
  • 43. Every single experiment studied did not have enough power to measure a precise 10% return Question Measurable? 12 of 25 are severely underpowered; only 3 had sufficient power to conclude they were “wildly profitable” (ROI of 50%). Profitable Campaign Every experiment is underpowered. Greater than 10% Return 9 out of 25 fail to reject the null of no impact; 10 out of 25 have enough power to test. No Impact vs Positive Return Result The powered experiments generally reveal a significant, positive effect for advertising. The median campaign would have to be 9 times larger to have sufficient power. The median retail campaign would have to be 61 times larger. For financial services, 1241 times larger (!).
  • 44. “Courtyard by Marriott: Designing a Hotel Facility with Consumer-Based Marketing Models” Change the course of industry By Jerry Wind, Paul E. Green, Douglas Shifflet, and Marsha Scarbrough
  • 45. “Marriott used conjoint analysis to design a new hotel chain [in the early 80s].” A well-designed survey experiment resulted in a wildly successful product ideation for Marriott
  • 46. The survey design asked two target segments about 7 facets, considering 50 attributes with 2 to 8 levels each 7 FACETS Leisure ServicesRooms SecurityFood LoungeExternal Factors
  • 47. As of 1989, Marriott went from 3 tests cases in 1983 to 90 hotel in 1987 with more than $200 million in sales. The chain was expected to grow to 300 hotels by 1994 with an expected $1 billion in sales. Captured within 4% of predicted market share. Created 3,000 new jobs within Marriott with an expected 14,000 new jobs by 1994. Affected a restructuring of all competitors the midprice-level lodging industry (upgrades, prices, amenities, and followers). How successful was this experiment?
  • 49. Yes, Judea Pearl is a keynote speaker, and I talked about causal inference but didn’t mention do-calculus. Wrapping Up
  • 50. A credibility crisis in data science is preventable Question Work with stakeholders to focus on business relevant and measurable outcomes from the start. Hypothesize Be clear about the causal mechanisms that you will test. Research Design Keep it sophisticatedly simple. A well thought out research design leads to better communication of results. Analyze Be honest about any threats to validity. You can test these in the future. This is scientific progress. Communicate or Productize Go the last mile. This step enables data science to have decision-making relevance.
  • 51. Did I miss some good papers? Come @ me on twitter. Until then, here are some more papers and books on my desk. “A comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook” “Confounding in Survey Experiments” “Using Big Data to Estimate Consumer Surplus: The Case of Uber” Mastering Metrics: The Path from Cause to Effect [Looks good!] “A/B Testing” [Experiments at Microsoft using Bing EXP Platform] “Measuring Consumer Sensitivity to Audio Advertising: A Field Experiment on Pandora Internet Radio” “Attribution Inference for Digital Advertising using Inhomogeneous Poisson Models”