Selection bias

Causal Inference DiagrammeR

Understanding: definition, examples;
Applying in DAG;
General solution;

Hai Nguyen
August 25, 2021

Selection bias occurs when some part of the target population is not in the sampled population, or, more generally, when some population units are sampled at a different rate than intended by the investigator. A good sample will be as free from selection bias as possible.

— Sharon L. Lohr, Sampling: Design and Analysis 2nd

Selection bias happens in different fields. Everything can be “selection bias”

Some other names refer to “selection bias”

Selection bias: set up

\(X\): exposure of interest
\(Y\): outcome of interest
\(S\): selection into study (S = 1 if selected)

We can estimate \[ RR^s_{XY} = \frac{Pr(Y = 1 \mid X = 1; S = 1)}{Pr(Y = 1 \mid X = 0; S = 1)} \]

which may not equal \(RR^t_{XY}\)

\(s\): subject to bias
\(t\): true

What is \(RR^t_{XY}\)? \(RR^t_{XY}\) is the true causal effect in the target population.

We will assume that if we estimated \(\frac{Pr(Y =1\mid X=1)}{Pr(Y =1\mid X=0)}\) , this is what we’d get

Selection bias happens when? ~ Examples

(Hernán, Hernández-Díaz, and Robins 2004)

Consider a randomized trial of anti-retroviral therapy (\(X\)) among people living with HIV, with a goal of preventing the development of AIDS (\(Y\))

hide
library(DiagrammeR) #grViz
grViz("
digraph causal{

node[shape=none]
X 

node [shape = box,
      fontname = Helvetica]
S

node[shape=none]
Y
}")

What’s the target population?

Target population The complete collection of observations we want to study. Defining the target population is an important and often difficult part of the study. For example, in a political poll, should the target population be all adults eligible to vote? All registered voters? All persons who voted in the last election? The choice of target population will profoundly affect the statistics that result.
— Sharon L. Lohr, Sampling: Design and Analysis, 2nd

The study participants are not a random sample of all people living with HIV … is that a problem?

But not when it comes to estimating valid causal effects. With complete follow-up, we can estimate the effect of the drug in the target population from which the participants came.

Why not?

The participants who were lost to follow-up are not a random sample of all participants

hide
grViz("
digraph causal{

node [shape = box]
S

node[shape=none]
X; Y; U;

subgraph U{
  rankdir=TB; edge[dir=back]
  S -> U
  Y -> U
}

subgraph C{
  rank=same;
  X -> S
  S -> Y [color = white]
  edge[color=gray]
  X -> Y 
}
}")

Conditioning on a collider

Selection bias can occur when a non-causal X-Y path is opened by conditioning on \(S\)

hide
grViz("
digraph causal{

node [shape = box]
S

node[shape=none]
X; Y; U;

subgraph U{
  rankdir=TB; edge[dir=back, color = red]
  S -> U 
  Y -> U
}

subgraph C{
  rank=same;
  X -> S [color = red]
  S -> Y [color = white]
  edge[color=gray]
  X -> Y 
}
}")

Common structure

Does Zika virus infection (\(X\)) increase the risk of microcephaly (\(Y\))?

hide
grViz("
digraph causal{

node[shape=none]
X 

node [shape = box,
      fontname = Helvetica]
S

node[shape=none]
Y
}")

Is the selected group different?

We might assume that

Conditioning on a collider

Intuitively, pregnancies are either in our study if:

The already low-risk pregnancies also have lower exposure to the virus…. It looks like exposure to Zika virus is associated with microcephaly.

A note about confounding

More examples

If people more at risk (\(U_2\)) of outcome \(Y\) also have more side effects \(U_1\), they are more likely to discontinue the drug and not be included in the study (\(S\) = 0).

hide
grViz("
digraph causal{

node [shape = box]
S

node[shape=none]
X; Y; U1; U2

subgraph U{
  rankdir=TB; edge[dir=back]
  U1 -> U2 
  Y -> U2
}

subgraph C{
  rank=same;
  X -> U1 -> S
  S -> Y [color = white]
  
  edge[color=gray]
  X -> Y 
}
}")

Selection is based on case status (\(Y\)). If controls with gastrointestinal disease are used (\(U\)), the fact that they are more likely to avoid coffee can make coffee look like it causes cancer.

hide
grViz("
digraph causal{

node [shape = box]
S

node[shape=none]
X; Y; U

subgraph U{
  rankdir=TB; edge[dir=back]
  X -> U 
  S -> U
}

subgraph C{
  rank=same;
  edge[color=gray]
  X -> Y 
  edge[color=black]
  Y -> S
}
}")

If depression (\(U1\)) causes \(X\) and \(S\), and smoking (\(U_2\)) causes \(S\) and \(Y\) , selection bias (“\(M\)-bias”) can result.

hide
grViz("
digraph causal{

node [shape = box]
S

node[shape=none]
X; Y; U1; U2

subgraph C{
  rank=same;
  edge[color=gray]
  X -> Y 
}

subgraph U{
  rankdir=TB; edge[dir=back]
  X -> U1
  Y -> U2
  S -> U1
  S -> U2
}
}")

(Hernán, Hernández-Díaz, and Robins 2004) and (Smith 2020)

What to do?

OR

Sensitivity analysis!

Hernán, Miguel A., Sonia Hernández-Díaz, and James M. Robins. 2004. “A Structural Approach to Selection Bias.” Journal Article. Epidemiology (Cambridge, Mass.) 15 (5): 615–25. https://doi.org/10.1097/01.ede.0000135174.63482.43.
Smith, Louisa H. 2020. “Selection Mechanisms and Their Consequences: Understanding and Addressing Selection Bias.” Journal Article. Current Epidemiology Reports 7 (4): 179–89. https://doi.org/10.1007/s40471-020-00241-6.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Nguyen (2021, Aug. 25). HaiBiostat: Selection bias. Retrieved from https://hai-mn.github.io/posts/2021-08-25-selection-bias/

BibTeX citation

@misc{nguyen2021selection,
  author = {Nguyen, Hai},
  title = {HaiBiostat: Selection bias},
  url = {https://hai-mn.github.io/posts/2021-08-25-selection-bias/},
  year = {2021}
}