Understanding: definition, examples;
Applying in DAG;
General solution;
Selection bias occurs when some part of the target population is not in the sampled population, or, more generally, when some population units are sampled at a different rate than intended by the investigator. A good sample will be as free from selection bias as possible.
— Sharon L. Lohr, Sampling: Design and Analysis 2nd
Selection bias happens in different fields. Everything can be “selection bias”
Some other names refer to “selection bias”
\(X\): exposure of interest
\(Y\): outcome of interest
\(S\): selection into study (S = 1 if selected)
We can estimate \[ RR^s_{XY} = \frac{Pr(Y = 1 \mid X = 1; S = 1)}{Pr(Y = 1 \mid X = 0; S = 1)} \]
which may not equal \(RR^t_{XY}\)
\(s\): subject to bias
\(t\): true
What is \(RR^t_{XY}\)? \(RR^t_{XY}\) is the true causal effect in the target population.
We will assume that if we estimated \(\frac{Pr(Y =1\mid X=1)}{Pr(Y =1\mid X=0)}\) , this is what we’d get
— (Hernán, Hernández-Díaz, and Robins 2004)
Consider a randomized trial of anti-retroviral therapy (\(X\)) among people living with HIV, with a goal of preventing the development of AIDS (\(Y\))
\(\frac{Pr(Y = 1 \mid X = 1)}{Pr(Y = 1 \mid X = 0)}\) is the risk ratio among people randomized to the intervention arm vs. standard of care
If some people drop out of the study, we estimate \(\frac{Pr(Y = 1 \mid X = 1; S = 1)}{Pr(Y = 1 \mid X = 0; S = 1)}\)
library(DiagrammeR) #grViz
grViz("
digraph causal{
node[shape=none]
X
node [shape = box,
fontname = Helvetica]
S
node[shape=none]
Y
}")
Target population The complete collection of observations we want to study. Defining the target population is an important and often difficult part of the study. For example, in a political poll, should the target population be all adults eligible to vote? All registered voters? All persons who voted in the last election? The choice of target population will profoundly affect the statistics that result.
— Sharon L. Lohr, Sampling: Design and Analysis, 2nd
The study participants are not a random sample of all people living with HIV … is that a problem?
But not when it comes to estimating valid causal effects. With complete follow-up, we can estimate the effect of the drug in the target population from which the participants came.
The participants who were lost to follow-up are not a random sample of all participants
grViz("
digraph causal{
node [shape = box]
S
node[shape=none]
X; Y; U;
subgraph U{
rankdir=TB; edge[dir=back]
S -> U
Y -> U
}
subgraph C{
rank=same;
X -> S
S -> Y [color = white]
edge[color=gray]
X -> Y
}
}")
Selection bias can occur when a non-causal X-Y path is opened by conditioning on \(S\)
grViz("
digraph causal{
node [shape = box]
S
node[shape=none]
X; Y; U;
subgraph U{
rankdir=TB; edge[dir=back, color = red]
S -> U
Y -> U
}
subgraph C{
rank=same;
X -> S [color = red]
S -> Y [color = white]
edge[color=gray]
X -> Y
}
}")
Does Zika virus infection (\(X\)) increase the risk of microcephaly (\(Y\))?
grViz("
digraph causal{
node[shape=none]
X
node [shape = box,
fontname = Helvetica]
S
node[shape=none]
Y
}")
We might assume that
Intuitively, pregnancies are either in our study if:
The already low-risk pregnancies also have lower exposure to the virus…. It looks like exposure to Zika virus is associated with microcephaly.
If people more at risk (\(U_2\)) of outcome \(Y\) also have more side effects \(U_1\), they are more likely to discontinue the drug and not be included in the study (\(S\) = 0).
grViz("
digraph causal{
node [shape = box]
S
node[shape=none]
X; Y; U1; U2
subgraph U{
rankdir=TB; edge[dir=back]
U1 -> U2
Y -> U2
}
subgraph C{
rank=same;
X -> U1 -> S
S -> Y [color = white]
edge[color=gray]
X -> Y
}
}")
Selection is based on case status (\(Y\)). If controls with gastrointestinal disease are used (\(U\)), the fact that they are more likely to avoid coffee can make coffee look like it causes cancer.
grViz("
digraph causal{
node [shape = box]
S
node[shape=none]
X; Y; U
subgraph U{
rankdir=TB; edge[dir=back]
X -> U
S -> U
}
subgraph C{
rank=same;
edge[color=gray]
X -> Y
edge[color=black]
Y -> S
}
}")
If depression (\(U1\)) causes \(X\) and \(S\), and smoking (\(U_2\)) causes \(S\) and \(Y\) , selection bias (“\(M\)-bias”) can result.
grViz("
digraph causal{
node [shape = box]
S
node[shape=none]
X; Y; U1; U2
subgraph C{
rank=same;
edge[color=gray]
X -> Y
}
subgraph U{
rankdir=TB; edge[dir=back]
X -> U1
Y -> U2
S -> U1
S -> U2
}
}")
— (Hernán, Hernández-Díaz, and Robins 2004) and (Smith 2020)
OR
Sensitivity analysis!
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2021, Aug. 25). HaiBiostat: Selection bias. Retrieved from https://hai-mn.github.io/posts/2021-08-25-selection-bias/
BibTeX citation
@misc{nguyen2021selection, author = {Nguyen, Hai}, title = {HaiBiostat: Selection bias}, url = {https://hai-mn.github.io/posts/2021-08-25-selection-bias/}, year = {2021} }