Are Observational Causal Studies Hard To Get Right?

Aleksander Molak

Causal Modeling: Training for Start-up & Corporate Teams || Author of "Causal Inference & Discovery in Python" || Host at CausalBanditsPodcast.com || Control For Your Confounders Before They Control You

Published Jul 28, 2025

In last week's post Ron Kohavi cited a 2022 Facebook Ads paper to demonstrate that despite using massive data, observational methods can lead to biased estimates of causal effects.

The authors show that two "observational" methods: stratified propensity score matching (SPSM) and double machine learning (DML) lead to significant overestimation of causal effects of ads on conversion compared to a reference A/B test (6x and 2.9x respectively)

Assuming the reference A/B test has been conducted and analyzed correctly, these numbers indeed don't look good for SPSM and DML.

I agree with Ronny that these results are a "massive overestimation", especially when it comes to SPSM.

Yet...

More Is Better. Or Is It?

I have a hard time agreeing with Ronny's diagnosis that "there was an enormous effort (on part of the authors) to address selection bias with 5,000 user-level features (yes, five thousand)."

Here's why:

The authors correctly identify in their paper that DML in order to work correctly requires two assumptions to be fulfilled: unconfoudedness and positivity.

Both are basic causal inference assumptions, required by a vast majority of standard causal estimators.

However, what the authors miss is that these are not the only assumptions required to get unbiased estimates from DML.

Unconfoundedness is one of the assumptions required to obtain causal identification for interventions under the standard structural causal model framework.

But causal identification is not only about the variables we should include in the model.

It's also about ones that we should avoid using.

When we're interested in the total causal effect of, let's say, X on Y, we should avoid using in our model variables that are the "middlemen" between X and Y, so-called mediators.

We should also avoid using variables that are descendants of the treatment and the outcome (or, for that matter, descendants of any pair made up of a mediator, the outcome, or the treatment).

The authors seem not to take this fact into account.

In fact, they explicitly state that in their modeling they used features like "estimated probability of conversion given exposure", which seems like an opener of a (noisy) collider path between a mediator and the outcome (a seeming paradox: the better the estimate of this probability, the more bias we have!)

It also seems that among four categories of features mentioned by the authors there are at least some features that would mediate the impact of ads on conversion, even if only partially (I assume this would also hold if we change the estimand from ATE to ATT).

Both, colliders and mediators introduce bias in causal estimates that can be arbitrarily more severe than the benefit of bias reduction coming from controlling for confounders (or their proxies).

What I read as an unfortunate implicit conclusion from both, the paper and Ronny's post, is that the number of variables we include in causal models is a somewhat reliable indicator of the effort put into causal modeling.

Practice seems to support what we know from modern causal theory: models with a small(er) number of well selected features almost always outperform models with a massive set of features without any selection.

The Narrow Temptation Of Extrapolation Or...

The idea of "the more features the better" in causal modeling seem to be an extrapolation from predictive modeling and the improvements we often (though not always) observe in in-distribution performance after adding more predictors. to our machine learning models.

As it turns out, this idea does not extrapolate very well to causal modeling.

My dream is that one day we'll start teaching causality as a framework, rather than a set of loosely connected methods, procedures and tricks.

I briefly outlined this idea in Lecture 9 of Causal Secrets, and I believe it can immunize us against popular misconceptions about causal modeling, which, along with the increase in popularity of causal modeling itself, seem to be on the rise.

...The Broader View

My belief is that looking at causality as a framework gives us more opportunities to honestly evaluate which methods and techniques can help us answer the questions we care about.

Does this make causal inference, and particularly observational causal inference easy to get right?

In my opinion, not necessarily.

Causality is difficult!

Yet, the framework perspective allows us to approach causal modeling more thoughtfully and answer questions that are often impossible to address, or even invisible, when we choose to have only a partial view:

What can I do when I cannot randomize the treatment and cannot guarantee there are no hidden confounders in the system I'm modeling?
When and why can design-of-experiments (DoE) be a better choice than an A/B test?
Which causal questions cannot be reliably answered with an RCT and why?
Is it possible to combine observational and experimental data to answer more causal questions or answer existing ones with more precision?

Closing

I've never met a person for whom understanding causal identification or DoE would make them a worse experimenter, or someone who learned more about experimentation and became a worse causal data scientist.

I've met many, who by learning them as a part of a bigger framework were able to answer questions they were never able to answer before.

I prepared a notebook for you, demonstrating how controlling for different types of variables impacts the magnitude of bias in DML.

Abdul A.

Servant | Strategist | Pragmatist | Skeptical Empiricist

2mo

Aleksander Molak, thanks for sharing this interesting paper and the discussion/debate with Ron Kohavi. I appreciate how your commentary highlights the importance of careful variable choice in causal inference. I think the issue may lie in model development. To me, the paper’s central problem is one of model specification: it assumed a flawed causal structure, which then manifested as a feature selection problem by including post-treatment variables. The paper leaned on a predictive mindset, which is why they thought adding strong signals from the user action category, many of which are bad controls, made sense since they improve classification metrics. Many of these variables are post-treatment constructs that, once included, open spurious paths and bias the estimated effect (and can cause even more operational concerns). I see your notebook illustrating this point, but I wonder whether the actual Facebook data used in the paper is available anywhere, or if it remains restricted to the Meta research team. This absolutely relates to model development and it can have costly consequences if not addressed or prevented. if curious please see here: https://www.linkedin.com/feed/update/urn:li:activity:7363957674721308675/

Andrew Rothman

Staff Applied Scientist @ TCG | Harvard grad (Statistics & Machine Learning) | Causal Inference/Experimentation+ML (Causal Discovery, G-Methods, TMLE, DML, RL, PGM) | AI Teaching @ Stanford | E-comm, FinTech, PE/VC

3mo

(Part 1): Nice post Aleksander Molak. Though may a suggest an update to your collider discussion. As fragmented as the Causal Inference space is, I’ve noticed many fields only think about colliders in the context of conditioning on downstream common-effects of treatment and outcome. Many assume if all potential covariates precede treatment and outcome causally, then collider bias need-not be worried about. Some comments in this thread imply this. This viewpoint is incorrect, and most collider-bias in large complex real-world datasets occur in structures and variables that causally precede both treatment and outcome. This has been known in the Biostatistics space for decades but is yet to catch-up in other disciplines. Attached is a canonical paper on the subject by Jamie Robins' group in the 2000’s. Consider Figure 9 with the Causal DAG E<-U->C<-L->D. Conditioning on C is collider bias and opens a non-causal associational path from E to D (a path that was previously closed when not conditioning on C). Yet notice C causally precedes both treatment and outcome. This structure was named “M-bias” by Sander Greenland in 2003. M-bias (in my opinion) is likely everywhere in the facebook paper. (below Part 2 and paper link)

1 Reaction

Jae-Hyung Kim

Ph.D. in Economics

3mo

I think this is the issue raised by Andrew Rothman last year, and I agree all of these comments by him. Here is the link : https://www.linkedin.com/feed/update/urn:li:activity:7251287256005513217?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A7251287256005513217%2C7252044202664996864%29&replyUrn=urn%3Ali%3Acomment%3A%28activity%3A7251287256005513217%2C7252055539218571266%29&dashCommentUrn=urn%3Ali%3Afsd_comment%3A%287252044202664996864%2Curn%3Ali%3Aactivity%3A7251287256005513217%29&dashReplyUrn=urn%3Ali%3Afsd_comment%3A%287252055539218571266%2Curn%3Ali%3Aactivity%3A7251287256005513217%29

1 Reaction

Alejandro Alvarez Pérez

Staff Data Scientist | Causal Inference, Experimentation

3mo

Thanks Aleksander Molak - this is the type of response that the original one needed, imo.

Joshua Amayo

3mo

Thanks for clarifying where the errors arose from. I was beginning to question the generalisability of the methods 😅

LinkedIn respects your privacy

Are Observational Causal Studies Hard To Get Right?

Aleksander Molak

Causal Modeling: Training for Start-up & Corporate Teams || Author of "Causal Inference & Discovery in Python" || Host at CausalBanditsPodcast.com || Control For Your Confounders Before They Control You

More Is Better. Or Is It?

The Narrow Temptation Of Extrapolation Or...

...The Broader View

Closing

Others also viewed

Lies... Damned Lies... and Statistics!

Causal Inference & the Do-Calculus

Doctor and his five faces of AI

data42 - unlocking value from clinical data

Digital Health Explorer Journal: CogX Day Two & Three

AI-Generated Suicide Risk Assessment Tools: A Critical Appraisal from a Medicolegal Perspective

Meta-analysis part 5: Meta-regression in and Moderators R

Escaping the Form Trap: What Healthcare and Social Science Can Teach Each Other About Data, Meaning, and Reuse

Analyzing Diabetes Patterns amongst Indians, A Beginner’s Guide to Pearson’s Correlation Coefficient, Deep Learning in Cyber Security & Much More!

Climbing the ALS mountain

Explore content categories