Randomization and Causal Sparseness
Suppose I'm running a randomized study: Treatment group A gets the medicine; control group B gets a placebo; later, I test both groups for disease X. I've randomized perfectly, it's double blind, there's perfect compliance, my disease measure is flawless, and no one drops out. After the intervention, 40% of the treatment group have disease X and 80% of the control group do. Statistics confirm that the difference is very unlikely to be chance (p < .001). Yay! Time for FDA approval!
There's an assumption behind the optimistic inference that I want to highlight. I will call it the Causal Sparseness assumption. This assumption is required for us to be justified in concluding that randomization has achieved what we want randomization to achieve.
So, what is randomization supposed to achieve?
Dice roll, please....
Randomization is supposed to achieve this: a balancing of other causal influences that might bear on the outcome. Suppose that the treatment works only for women, but we the researchers don't know that. Randomization helps ensure that approximately as many women are in treatment as in control. Suppose that the treatment works twice as well for participants with genetic type ABCD. Randomization should also balance that difference (even if we the researchers do no genetic testing and are completely oblivious to this influence). Maybe the treatment works better if the medicine is taken after a meal. Randomization (and blinding) should balance that too.
But here's the thing: Randomization only balances such influences in expectation. Of course, it could end up, randomly, that substantially more women are in treatment than control. It's just unlikely if the number of participants N is large enough. If we had an N of 200 in each group, the odds are excellent that the number of women will be similar between the groups, though of course there remains a minuscule chance (6 x 10^-61 assuming 50% women) that 200 women are randomly assigned to treatment and none to control.
And here's the other thing: People (or any other experimental unit) have infinitely many properties. For example: hair length (cf. Rubin 1974), dryness of skin, last name of their kindergarten teacher, days since they've eaten a burrito, nearness of Mars on their 4th birthday....
Combine these two things and this follows: For any finite N, there will be infinitely many properties that are not balanced between the groups after randomization -- just by chance. If any of these properties are properties that need to be balanced for us to be warranted in concluding that the treatment had an effect, then we cannot be warranted in concluding that the treatment had an effect.
Let me restate in an less infinitary way: In order for randomization to warrant the conclusion that the intervention had an effect, N must be large enough to ensure balance of all other non-ignorable causes or moderators that might have a non-trivial influence on the outcome. If there are 200 possible causes or moderators to be balanced, for example, then we need sufficient N to balance all 200.
Treating all other possible and actual causes as "noise" is one way to deal with this. This is just to take everything that's unmeasured and make one giant variable out of it. Suppose that there are 200 unmeasured causal influences that actually do have an effect. Unless N is huge, some will be unbalanced after randomization. But it might not matter, since we ought to expect them to be unbalanced in a balanced way! A, B, and C are unbalanced in a way that favors a larger effect in the treatment condition; D, E, and F are unbalanced in a way that favors a larger effect in the control condition. Overall it just becomes approximately balanced noise. It would be unusual if all of the unbalanced factors A-F happened to favor a larger effect in the treatment condition.
That helps the situation, for sure. But it doesn't eliminate the problem. To see why, consider an outcome with many plausible causes, a treatment that's unlikely to actually have an effect, and a low-N study that barely passes the significance threshold.
Here's my study: I'm interested in whether silently thinking "vote" while reading through a list of registered voters increases the likelihood that the targets will vote. It's easy to randomize! One hundred get the think-vote treatment and another one hundred are in a control condition in which I instead silently think "float". I preregister the study as a one-tailed two-proportion test in which that's the only hypothesis: no p-hacking, no multiple comparisons. Come election day, in the think-vote condition 60 people vote and in the control condition only 48 vote (p = .04)! That's a pretty sizable effect for such a small intervention. Let's hire a bunch of volunteers?
Suppose also that there are at least 40 variables that plausibly influence voting rate: age, gender, income, political party, past voting history.... The odds are good that at least one of these variables will be unequally distributed after randomization in a way that favors higher voting rates in the treatment condition. And -- as the example is designed to suggest -- it's surely more plausible, despite the preregistration, to think that that unequally distributed factor better explains the different voting rates between the groups than the treatment does. (This point obviously lends itself to Bayesian analysis.)
We can now generalize back, if we like, to the infinite case: If there are infinitely many possible causal factors that we ought to be confident are balanced before accepting the experimental conclusion, then no finite N will suffice. No finite N can ensure that they are all balanced after randomization.
We need an assumption here, which I'm calling Causal Sparseness. (Others might have given this assumption a different name. I welcome pointers.) It can be thought of as either a knowability assumption or a simplicity assumption: We can know, before running our study, that there are few enough potentially unbalanced causes of the outcome that, if our treatment gives a significant result, the effectiveness of the treatment is a better explanation than one of those unbalanced causes. The world is not dense with plausible alternative causes.
As the think-vote example shows, the plausibility of the Causal Sparseness assumption varies with the plausibility of the treatment and the plausibility that there are many other important causal factors that might be unbalanced. Assessing this plausibility is a matter of theoretical argument and verbal justification.
Making the Causal Sparseness assumption more plausible is one important reason we normally try to make the treatment and control conditions as similar as possible. (Otherwise, why not just trust randomness and leave the rest to a single representation of "noise"?) The plausibility of Causal Sparseness cannot be assessed purely mechanically through formal methods. It requires a theory-grounded assessment in every randomized experiment.