(2022-03-23: This is the second iteration of this post. The original post was an off-the-cuff Slack message that I posted here, and it described causal inference as “black magic”. Since writing that, I have read “The Book Of Why” by Judea Pearl and I now have a clearer understanding.)

Does smoking cause lung cancer? Here are three approaches to answering this question.

Causal diagram with 'Smoking', 'Cancer', and 'Confounders', and three strategies listed: Intervene on 'Smoking', control for 'Confounders', or let the universe intervene on the arrow from 'Smoking' to 'Cancer'

Approach 3 is particularly interesting.


Approach 1: Intervene

The gold standard for answering causal questions is randomized controlled experiments. Choose a population, split them into three random groups, force one group to smoke, force one group not to smoke, and use the other as a control group. We can denote the results of these experiments as \(P(\text{cancer} \mid do(\text{smoke}))\), \(P(\text{cancer} \mid do(\neg \text{smoke}))\), and \(P(\text{cancer})\), respectively. This technique always works (as long as you’re careful) but is often impractical / unethical.

In terms of a causal diagram, this intervention removes one of the arrows.

Causal diagram with only two arrows. The one from 'Confounders' to 'Smoking' is removed.

Thus, with intervention, we can be confident that any correlation between smoking and cancer indicates causation (in some direction), not confounding.

Is it possible to infer causality from observational data? Sometimes, yes, depending on the causal graph. Read on.


Approach 2: The back-door adjustment

The most well-known strategy for answering causal questions using observational data is to find all of the possible confounding factors and control for them. The simplest way to control for these factors is to split the population into subpopulations, one for each value of the “Confounders” variable. Then, each subpopulation has its own causal diagram

Causal diagram with only 'Smoking' and 'Cancer' (no 'Confounders')

and any correlation will indicate causation (in some direction), because the confounders don’t vary within the subpopulation.

We can combine the results for subpopulations to predict the results of a randomized controlled trial for the full population:

\[\begin{align} P(y \mid do(x)) & = \sum\limits_{z}{P(z) P(y \mid do(x), z)} \\ & = \sum\limits_{z}{P(z) P(y \mid x, z)} \end{align}\]

Accounting for all possible confounders is difficult. Are smokers also more likely to drink alcohol? Is their socioeconomic status lower? We must find all of these relevant factors and test whether smoking correlates with lung cancer within each subpopulation. Exhaustively listing these factors is sometimes possible, but often is not an option. As Fisher pointed out, it’s possible there is a genetic factor that causes cancer and also makes you more likely to smoke. How would you control for that?


Approach 3: The front-door adjustment

Here is a clever alternate strategy that allows us to infer causality from purely observational data. Rather than observing upstream factors like alcohol or genetic factors, we observe downstream factors. We find some other observable factor that is within the causal chain between “smoking” and “lung cancer”, for example, “tar in lungs”. If “tar in lungs” is a noisy mediator between smoking and lung cancer, then we can now infer causality by observing all three variables {smoking, tar in lungs, cancer} for the population. For “tar in lungs” to be a mediator, it must be the case that all causal influence of smoking on lung cancer is by way of tar in the lungs.

Causal diagram with 'Tar in lungs' inserted between 'Smoking' and 'Cancer'


There are a few useful ways of describing this technique:

  • We use “smoking” to signal whether the unknown upstream factor is present, and we use “tar in lungs” as the more direct test of smoking-induced cancer.
  • Rather than directly attacking the question “Does smoking cause cancer?”, we answer the question “Does tar in the lungs cause cancer?”, and we perform the back-door adjustment (Approach 2) treating “smoking” as a confounder. We combine this information with “Does smoking cause tar in the lungs?” to get our final answer.
  • We let the universe perform a randomized controlled trial for us! The variable “tar in lungs” is a noisy mediator – it won’t always be equal to the truth value of the “smoking” variable. Some smokers won’t have tar. Perhaps even some non-smokers will have tar (due to non-modeled causes like pollution or job hazards). According to our causal diagram, there are no confounders causing these random occurrences that also influence lung cancer, so within a smoking or non-smoking subpopulation, this random noise is functionally equivalent to random interventions.

We can estimate the result of a randomized controlled trial as follows:

\[\begin{align} P(y \mid do(x)) & = \sum\limits_{z \in \{\text{tar}, \neg \text{tar}\}}{P(z \mid do(x)) P(y \mid do(z))} \\ & = \sum\limits_{z \in \{\text{tar}, \neg \text{tar}\}}{p(z \mid x)\sum\limits_{x' \in \{ \text{smoke}, \neg \text{smoke} \}}{p(x')p(y \mid x', z)}} \end{align}\]

and each of the terms in this expanded summation can be measured from observational data.

Note that the mediator must be at least somewhat noisy. Expanding this formula, we need estimations of cancer probabilities for populations meeting the conditions \(P(\text{cancer} \mid \text{tar}, \neg \text{smoke})\) and \(P(\text{cancer} \mid \neg \text{tar}, \text{smoke})\). If we just have one of these, we can use it to estimate the other. If we have neither, then “tar in lungs” is a redundant variable, totally equal to “smoking” and it provides no use. In that case we would need to find a different mediator.

Noisy mediators give us interesting capabilities. They can come in many forms. In a medical study, patients may decide to receive a treatment, then be prevented by a car breakdown. To practitioners of causal inference, car breakdowns are gifts from the universe.