library(tidyverse)
library(patchwork)
set.seed(11)
## Fork
# n ... number of observations
<- 500
n # d ... binary confounder
<- rbinom(n, 1, 0.5)
d <- 1.5 * d + rnorm(n)
x <- 0.4 + 2 * d + rnorm(n)
y <- data.frame(x, y, d = factor(d, levels = c(0, 1), labels = c("Yes", "No")))
data_fork <- ggplot(data_fork, aes(x, y)) +
plt_fork geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
ggtitle("Relation due to omitted confounder")
## Pipe
set.seed(11)
<- 1 * rnorm(n)
x <- rbinom(n, 1, boot::inv.logit(2 * x + rnorm(n)))
z <- 2 * z + rnorm(n)
y <- data.frame(x, z = factor(z, levels = c(0, 1), labels = c("Yes", "No")), y)
data_pipe <- ggplot(data_pipe, aes(x, y)) +
plt_pipe geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
ggtitle("Relation through mediator")
## Collider
set.seed(11)
<- rnorm(n)
x <- rnorm(n)
y <- rbinom(n, 1, boot::inv.logit(9 * x - 9 * y + rnorm(n)))
a <- data.frame(x, y, a = factor(a, levels = c(0, 1), labels = c("No", "Yes")))
data_collider $x_a <- resid(lm(x ~ 0 + a))
data_collider$y_a <- resid(lm(y ~ 0 + a))
data_collider<- ggplot(data_collider, aes(x_a, y_a)) +
plt_collider geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(x = "x", y = "y") +
theme(legend.position = "top") +
ggtitle("Relation due to included collider")
+ plt_pipe + plt_collider plt_fork
Causal Pitchfork - The Data
Introduction
This document deals with a fundamental question of causal inference: Which variables should be included in a causal model? (see Cinelli, Forney, and Pearl 2020) To answer this question two points need to be clear:
- In general each causal model only investigates the causal effect of a single independent variable, \(x_k\), on the dependent variable \(y\). The coefficients associated with all other variables, \(x_{j\neq k}\), cannot (automatically) be interpreted as causal relationships. As regression coefficients are commonly presented in a single table, it is often unclear to the reader which coefficients can be interpreted as causal (see Westreich and Greenland 2013).
- Statistical significance (or any other statistical test) does not give us any idea about the causal model. To illustrate this, the following figure shows three statistically significant relationships between the variables \(x\) and \(y\) (all t-stats \(> 9\)). However, by construction there is no causal relationship between them in two of these examples. Even more concerning: In one case the exclusion of a control variable leads to spurious correlation (leftmost plot) while in the other the inclusion of the control variable does the same (rightmost plot).
The Fork (Good control)
set.seed(42)
library(ggdag)
library(gt)
library(dagitty)
<- dagify(x ~ d, y ~ d,
confounder coords = list(
x = c(x = 1, y = 2, d = 1.5),
y = c(x = 1, y = 2, d = 2)
)|>
) tidy_dagitty() |>
mutate(fill = ifelse(name == "d", "Confounder", "variables of interest")) |>
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(size = 7, aes(color = fill)) +
geom_dag_edges(show.legend = FALSE) +
geom_dag_text() +
theme_dag() +
theme(
legend.title = element_blank(),
legend.position = "top"
) confounder
A typical dataset with a confounder will exhibit correlation between the treatment \(X\) and outcome \(y.\) This relationship is not causal! In the example below we have a binary confounder \(d\) (Yes/No) that is d-connected with both \(X\) and \(y\) (\(X\) and \(y\) are not d-connected)
set.seed(11)
# n ... number of observations
<- 500
n # d ... binary confounder
<- rbinom(n, 1, 0.5)
d <- 1.5 * d + rnorm(n)
x <- 0.4 + 2 * d + rnorm(n)
y <- data.frame(x, y, d = factor(d, levels = c(0, 1), labels = c("Yes", "No")))
data_fork ggplot(data_fork, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
lm(y ~ x) |>
::tidy() |>
broomgt() |>
fmt_number(
columns = estimate:p.value,
decimals = 4
)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 1.0086 | 0.0671 | 15.0351 | 0.0000 |
x | 0.4672 | 0.0464 | 10.0576 | 0.0000 |
However once we take the confounder into account the association vanishes which reflects the lack of a causal relationship in this case (note that for simplicity the regression lines in the plot are not the same as the model output shown).
# options(scipen = 10)
ggplot(data_fork, aes(x, y, color = d)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
theme(legend.position = "top")
lm(y ~ x * d, data_fork) |>
::tidy() |>
broomgt() |>
fmt_number(
columns = estimate:p.value,
decimals = 4
)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.4426 | 0.0633 | 6.9879 | 0.0000 |
x | 0.0526 | 0.0617 | 0.8519 | 0.3947 |
dNo | 1.9350 | 0.1364 | 14.1815 | 0.0000 |
x:dNo | −0.0616 | 0.0914 | −0.6741 | 0.5006 |
The Pipe (Bad control)
<- dagify(z ~ x, y ~ z,
med coords = list(x = c(x = 1, z = 1.5, y = 2), y = c(x = 1, y = 1, z = 1))
|>
) tidy_dagitty() |>
mutate(fill = ifelse(name == "z", "Mediator", "variables of interest")) |>
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(size = 7, aes(color = fill)) +
geom_dag_edges(show.legend = FALSE) +
geom_dag_text() +
theme_dag() +
theme(
legend.title = element_blank(),
legend.position = "top"
) med
If we have a mediator in our data the picture looks very similar to the previous one. In addition, taking the mediator into account also has a similar effect: we remove the association between \(X\) and \(y\). However, in this case that is not what we want since \(X\) and \(y\) are d-connected. \(X\) causes \(y\) through \(z\) (note that for simplicity the regression lines in the second plot are not the same as the model output shown).
set.seed(11)
<- 1 * rnorm(n)
x <- rbinom(n, 1, boot::inv.logit(2 * x + rnorm(n)))
z <- 2 * z + rnorm(n)
y <- data.frame(x, z = factor(z, levels = c(0, 1), labels = c("Yes", "No")), y)
data_pipe ggplot(data_pipe, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
lm(y ~ x) |>
::tidy() |>
broomgt() |>
fmt_number(
columns = estimate:p.value,
decimals = 4
)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.9028 | 0.0586 | 15.4170 | 0.0000 |
x | 0.5804 | 0.0593 | 9.7944 | 0.0000 |
ggplot(data_pipe, aes(x, y, color = z)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
theme(legend.position = "top")
lm(y ~ x * z) |>
::tidy() |>
broomgt() |>
fmt_number(
columns = estimate:p.value,
decimals = 4
)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | −0.0393 | 0.0755 | −0.5205 | 0.6029 |
x | −0.0185 | 0.0787 | −0.2349 | 0.8143 |
z | 2.0562 | 0.1137 | 18.0855 | 0.0000 |
x:z | −0.0324 | 0.1146 | −0.2826 | 0.7776 |
The Collider (Bad control)
dagify(a ~ x, a ~ y,
coords = list(x = c(x = 1, y = 2, a = 1.5), y = c(x = 1, y = 0, a = 0))
|>
) tidy_dagitty() |>
mutate(fill = ifelse(name == "a", "Collider", "variables of interest")) |>
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_point(size = 7, aes(color = fill)) +
geom_dag_edges(show.legend = FALSE) +
geom_dag_text() +
theme_dag() +
theme(
legend.title = element_blank(),
legend.position = "top"
)
The collider is a special case. There is no association between \(X\) and \(y\) as long as we do not account for the collider in the model. However, by accounting for the collider we implicitly learn about \(y\) as well (we use \(X\) as the predictor). Since the collider is caused by \(X\) and \(y\), we can figure out what \(y\) must be once we know \(X\) and the collider similar to solving a simple equation you would see in high-school.
set.seed(11)
<- rnorm(n)
x <- rnorm(n)
y <- rbinom(n, 1, boot::inv.logit(9 * x - 9 * y + rnorm(n)))
a <- data.frame(x, y, a = factor(a, levels = c(0, 1), labels = c("No", "Yes")))
data_collider ggplot(data_collider, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal()
lm(y ~ x) |>
::tidy() |>
broomgt() |>
fmt_number(columns = estimate:p.value, decimals = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0203 | 0.0449 | 0.4519 | 0.6515 |
x | 0.0307 | 0.0455 | 0.6754 | 0.4997 |
$x_a <- resid(lm(x ~ 0 + a))
data_collider$y_a <- resid(lm(y ~ 0 + a))
data_colliderggplot(data_collider, aes(x_a, y_a)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(x = "x after accounting for a", y = "y after accounting for a") +
theme(legend.position = "top")
lm(y ~ x + a, data_collider) |>
::tidy() |>
broomgt() |>
fmt_number(columns = estimate:p.value, decimals = 4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.8614 | 0.0538 | 16.0142 | 0.0000 |
x | 0.5328 | 0.0422 | 12.6293 | 0.0000 |
aYes | −1.6663 | 0.0834 | −19.9822 | 0.0000 |
Appendix: How the sausage is made
The fork, mediator, and collider were generated as binary variables to make visualization easier. Binary variables can be drawn from a so-called Bernoulli distribution which is a special case of the binomial distribution with size = 1. The distribution takes the probability of getting a \(1\) as input.
The Fork
## Make code reproducible
set.seed(11)
## Number of observations
<- 1500
n ## Random draw from Bernoulli with p(1) = 0.5, p(0) = 0.5
<- rbinom(n, 1, 0.5)
d ## X is caused by d
<- 2 * d + rnorm(n)
x ## y is caused by d
<- 0.4 + 2.5 * d + rnorm(n)
y <- data.frame(x, y, d = factor(d,
fork levels = c(0, 1),
labels = c("No", "Yes")
))ggplot(fork, aes(x, y, color = d)) +
geom_point() +
theme_minimal() +
theme(legend.position = "top")
The Pipe
## Generate random X
<- rnorm(n)
x ## inv.logit ensures that values are between 0 and 1
ggplot(data.frame()) +
stat_function(fun = boot::inv.logit, xlim = c(-10, 10)) +
theme_minimal() +
labs(title = "Inverse Logit function", x = "x", y = "inv.logit(x)")
## z is caused by X
<- rbinom(n, 1, boot::inv.logit(2 * x + rnorm(n)))
z ## y is caused by z
<- 2 * z + rnorm(n)
y <- data.frame(x, y, z = factor(z,
pipe levels = c(0, 1),
labels = c("Yes", "No")
))ggplot(pipe, aes(x, y, color = z)) +
geom_point() +
theme_minimal() +
theme(legend.position = "top")
The Collider
## Generate random x
<- rnorm(n)
x ## Generate random y
<- rnorm(n)
y ## a is caused by both X and y
<- rbinom(n, 1, boot::inv.logit(9 * x - 9 * y + rnorm(n)))
a <- data.frame(x, y, a = factor(a,
collider levels = c(0, 1),
labels = c("No", "Yes")
))ggplot(collider, aes(x, y)) +
geom_point() +
theme_minimal()
In order to get the partial correlation of \(X\) and \(y\) after accounting for \(a\) we first regress both \(X\) and \(y\) on \(a\) and use the unexplained part (residual) in the plot. This is equivalent to a regression that has both \(X\) and \(a\) as explanatory variables.
$x_a <- residuals(lm(x ~ 0 + a))
collider$y_a <- residuals(lm(y ~ 0 + a))
colliderggplot(collider, aes(x_a, y_a)) +
geom_point() +
theme_minimal() +
labs(x = "x after accounting for a", y = "y after accounting for a")