I was writing a quick informal chat message summary of causal inference, and I decided to post it here instead. This post is inspired by Michael Nielsen’s “If correlation doesn’t imply causation, then what does?”. To complement that post, I’m going to avoid using figures and will instead try to state some basic intuition — I recommend Michael’s post if you want to dive deeper.

Does smoking cause lung cancer?

The gold standard for answering causal questions is randomized controlled experiments. Choose a population, split them into three random groups, force one group to smoke, force one group not to smoke, and use the other as a control group. We can denote the results of these experiments as , , and , respectively. This technique always works (as long as you’re careful) but is often impractical / unethical.

Answering this type of question from purely observational data is hard. The most well-known strategy is to find all of the possible confounding factors and control for them. Are smokers also more likely to drink alcohol? Is their socioeconomic status lower? We must find all of these relevant factors and test whether smoking correlates with lung cancer within each subpopulation. Exhaustively listing these factors is sometimes possible, but often is not an option. As Fisher pointed out, it’s possible there is a genetic factor that causes cancer and also makes you more likely to smoke. How would you control for that?

Causal inference practitioners show that there is another way to infer causality from observational data. The trick is: rather than observing upstream factors like alcohol or genetic factors, observe downstream factors. If we can find some other observable factor that is within the causal chain between “smoking” and “lung cancer”, for example, “tar in lungs”, then we now have the ability to account for the unknown upstream factors. We use “tar in lungs” as the more direct test of smoking-induced cancer, and we use “smoking” to signal whether the unknown upstream factor is present. So if our assumption is correct that smoking’s causal influence on cancer is via tar in the lungs, then we can infer causation from purely observational data. In fact, we can infer the result of a randomized experiment without performing the experiment! The probability of cancer given a smoking intervention ends up being

and each of the terms in this expanded summation can be measured from observational data.

One important detail of this equation: it depends on having data for populations meeting the condition . That is, there must be people who don’t smoke but have tar in their lungs due to other unknown causes (like pollution or being a firefighter). This adds a constraint to the type of “downstream” factors that we can use – if the only way for tar to appear in lungs were through smoking, then we would need to find a different downstream factor.

For me, this is still black magic. Similar to Michael (circa 2012), I don’t yet completely understand the logic of the equation, despite the fact that I know how to derive it.

Causal inference is the study of this black magic. Without tricks like this, an AI will be prone to learning bad theories like “packing an umbrella causes rain”. Can we build machines that infer causality in principled ways? One big open question is whether the human brain is inherently capable of causal inference, or if causal inference is more of a recent cognitive / cultural ability. On one hand, we build advanced models of the world; maybe causal inference is precisely what children are doing as they play. On the other hand, people tend to incorrectly infer causation from correlation. When I was a kid, I thought that the wind was created by trees waving their branches. (I grew up in a rural area surrounded by tall trees — it was a good theory.) So it is possible that understanding causal inference is not important to building something as smart as a 5-year-old. Whether or not that’s true, an AI that is capable of causal inference at its core is a compelling idea.