Mathematically Elegant Answers to Research Questions No One is Asking (meta-analysis, random effects models, and Bayes factors)

Mathematically Elegant Answers
to Questions No One is Asking
Uri Simonsohn

The overarching concern motivating this talk
• Reality check
• Stat folks: sorry, we have mere supporting roles
• Our research has no intrinsic value
• Extrinsic value: help researchers answer their questions
• As JDMer I worry
• "Do we study things we find interesting, but aren't useful?"
• As Methodologist I worry
• "Do we study things we find interesting, but aren't useful?"
• But it's worse
• Most MBA students can decide whether 'embodied cognition'* is silly
• Most researchers can't decide whether 'Random Effects' are silly
• It's on us to be more transparent about what a method actually does
• Stop taking the math literally
• Start taking researchers seriously
web: http://urisohn.com | Blog: http://datacolada.org

I think of it as a transparency issue
• Important that other methodologists can check our work
• Also important: researchers can evaluate if our work is useful
• Need to transparently (non-technically) explain actual trade-offs
• Not philosophical platitudes (likely to be misinterpreted)

How do researchers study things?
How they choose study designs?
(meta-analytical mean; random effects; Bayes factors)
Taking math literally Taking researchers seriously
Drawn at random Carefully curated, actively non-random
From defined populations From undefined/inexistent populations
(generally)
With known distributions No population à no distribution
If they exist, each researcher their own
Goal: estimate population mean effect Goal: local test of this effect
Qualitative generalization based on thinking

Outline
My Claim: Researchers don't want the answers provided by these tools
1. Mixed models
2. Meta-analysis
3. Bayes Factors
(Platonic generalizability)
(Overall means or subgroup means)
(testing some average hypothesis)

Non-Random Effects:
Designing & Analyzing Experiments with
Multiple Stimuli (in The Real World)
Uri Simonsohn
ESADE, Barcelona
Andres Montealegre
Cornell (PhD Student)
Ioannis Evangelidis
ESADE, Barcelona
!"#$#%&#$%
&'('))*+,
Free Trial

• Hard & applied sciences
• What’s the impact of this vaccine?
• What’s the impact of defaults?
• Psychology
• What’s the impact of disgust on moral judgments?
Psychology’s unique experimental challenge
à
Randomize vaccine à Got Covid?
Randomize default à % organ donors?
Moral judgment
Is this ok? (1-7)

Psychology experiments produce mere correlations
(this seems simultaneously obvious and earth shattering)
• We randomly assign stimuli to participants
• We do not randomly assign attributes to stimuli
• Stimuli are confounded
è psychology experiments are confounded
• Example 1: Rubenstein et al (1971)
• Homophonic words: slower recognition
• Participants randomly shown words, e.g. Pray & Pest
• Pray NOT randomly assigned to have homophone
• Reaction time to Pray vs Pest is confounded

Trainspotting
toilet scene
The Champ
dead father scene
Random assignment
Disgust
Arousal
Study perceived as
objectionable
No Disgust
Unfairness
Nuclear family
reminder
Judged
morality of
incestuous sex
Psychology experiments produce mere correlations
(this seems simultaneously obvious and earth shattering)
Example 2: Emotion induction papers
Endowment
effect

Mixed-model consensus
• Concern is external validity
• Generalize beyond chosen stimuli
• Recommendations:
• Many stimuli
• Use mixed models
• Says nothing on :
• How to select stimuli (beyond, choose many, at random)
• How to learn from stimuli variation
[Clark 1971] >2,900 citations
[Baayen, Davidson, & Bates, 2008] >8,400 citations
[Barr, Levy, Scheepers, & Tily, 2013] >8,100 citations
[Judd, Westfall, & Kenny, 2012] >1,100 citations

Skipping:
Our paper proposes "Match-and-Mix 1.0"
6 steps to choosing (a few) stimuli
For this talk:
Let's focus on the statistical analysis of multi-stimuli experiments

Analyzing Studies with Many Stimuli
Example: Endowment-effect
Dependence Variation
across stimuli
Generalizability
t-test aggregate at
subject level
Cancels out or
can’t do this
Researcher does it
(critical thinking)
regression cluster SE at
subject level
stimuli fixed
effects
Researcher does it
(critical thinking)
mixed-
model
subject
random effect
stimuli
random
intercept
Platonic
Generalizability
(stimuli random slopes)
$75
$14
$6
CHALLENGES
TOOLS

Platonic Generalizability
1. Assume a population of all possible stimuli exist
• All goods that exist
• All goods one could imagine
à Now average them
• People. 50:50 Women:Men
• Endowment effect:
• x% Mugs
• y% Obama dinners
• z% refurbished iPhone 11
• "The" effect we estimate: weighed mean
2. Assume stimuli were chosen at random from it
3. Assume researcher wants to generalize / estimate (1)
(1) exists in theory only à We call it platonic generalizability.

If it were free to get platonic generalizability we may buy it.
But it is very expensive.
Next. Simulations for statistical power
1) Participants see n out of n stimuli
2) Participants see 2 out of n stimuli

Case 1. Subjects see n out of n
Takeaways:
- Nothing beats t-test
- Platonic mixed model
has real power costs
(but recovers as
stimuli increases)

Case 2. Subjects see 2 out of n
Takeaway:
Controlling for
stimuli increases
power when k of n
Mixed model still
expensive
Same pattern

Mixed model advocates know about power
• But they don’t care
• They worry t-tests have too many false-positives
• We sure care about false-positives
• But not about those
• We think they are true-positives.
• This can get philosophical…
…let's make it super concrete.

Next. Let’s contrast those two perspectives in a figure.

S1
S2
S3
S4
S5
S6
Watching bestiality video
Read story of organs
thief who BBQs them
Imagine man clipping
nails in metro
Watch video of man
clipping nails in metro
Watch scene from
“Trainspotting”
Hold bucket
of vomit for
3 minutes
Effect of stimulus on dependent variable Effect of disgust stimulus on immorality of cousin-sex
Proportion
of
Stimuli
Platonic generalizability:
Mean of all possible stimuli is 0?
Construct validity
Do we generally get the effect when expected?
Again, "the" population mean does not exist:
Men:Women 50:50
Nail clipping men vs Trainspotting videos x%:y%?
S4 and S5 are true-positive effects

• Our interest as researchers should guide the tools we use
• Not vice versa
• We thus propose a tool to assess if you generally get an effect when
you expect it.
‘Stimuli Plots’

Stimuli Plots
• Compute effect for each matched-pair of stimuli in control condition
• Assess if effect is obtained in general
• Assess if variation identifies
• Possible confounds
• Interesting moderators
• Ideas for the next study
Next: stimuli plots for three published papers

Paper 1. Kupfer et al (2020)
Means by Stimuli Effects by Stimuli
P =.0046

No effect
Paper 2. Salerno & Slepian (2022)

Paper 3. Rottman & Young (2019)
Two points from our perspective:
1) Stimuli not matched purity/harm
2) Within harm: deer-hunting is an outlier

• Contrast information provided by t-test & stimuli-level data
• With mixed-model results
Generalizability

Outline
1. Mixed models
2. Meta-analysis
3. Bayes Factors
(average hypothesis)

Also makes sense if taking math literally
1. Population of effects exists
2. Researchers sample at random
3. Estimand: overall mean

Why meaningless?
1) No quality control (skip here)
2) Combining incommensurate results

Example #1 of Incommensurate Findings
PNAS Nudge Meta-analysis
http://datacolada.org/105

Estimate #1
Effect Size
d = - .12

Estimate #2
d= 1.18

meta-analysis
• Our estimate of 'the' effect of reminders":
+
----------------------------
2
-.12 +1.18
= d=.53
Yeah. That's what
we wanted to
know

Example #2 of Incommensurate Findings
Econometrica Nudge Meta-analysis
http://datacolada.org/106
+ 51%
+ 7%
+ 4%
The average environment nudge: ~21%

That average only makes sense if we take the math literally.

• There is no population of effects
(What % of nudges involve website defaults vs researchers stopping by?)
• Researchers do not run studies at random
• Readers do not want to know the average effect

Outline
1. Mixed models
2. Meta-analysis
3. Bayes Factors
(the average hypothesis)

• Let's do that for every possible hypothesis
• Not just t-shirt sizes

Uri's claims
1) Confidently.
Many researchers would
like this chart and would
speak to their question.
2) Semi confidently
But probably be
persuaded confidence
intervals actually have
the info they want
3) Most confident
Nobody wants the
average blue number
Especially not weighted
by assumed N(0 , .71)
i.e. Bayes Factor

Bayes Factors
• Taking math literally
• Assume there is a population of effect size
• Assume it is centered at 0 and symmetric
• Assume researchers draw studies at random
• Assume they wish to know if any particular study is:
A) more consistent with that family of all possible effects (including 0)
B) Null of d=0.
What researcher would read that and say "that's exactly what I want" ?

Discussions
Math literally
• Ha ha, that's not "evidence"
• This or that paradox
• Don't you want to have a principled guide for inference?
Researchers seriously
• Does your research question involve an average of hypotheses with
these particular weights?

Shortcomings to my argument
• I am equating my take on researchers being taken seriously
• It is possible to make common-sense arguments against many ideas
• That's OK. We can have those arguments.
• The meta-point:
à we need methods arguments real researchers can play jury to

Mathematically Elegant Answers to Research Questions No One is Asking (meta-analysis, random effects models, and Bayes factors)

More Related Content

Similar to Mathematically Elegant Answers to Research Questions No One is Asking (meta-analysis, random effects models, and Bayes factors)

More from jemille6

Recently uploaded

Mathematically Elegant Answers to Research Questions No One is Asking (meta-analysis, random effects models, and Bayes factors)