Necessary to decide which variables to use in model
“d” stands for “directional”
Usually we are dealing with more than two variables
Complication: causation flows only directed - association might flow against
Code
dagify(z ~ x, y2 ~ z, a ~ x, a ~ y3, x ~ d, y1 ~ d,coords =list(x =c(x =1, z =1.5, y2 =2, a =1.5, y3 =2, d =1.5, y1 =2), y =c(x =1, y2 =1, z =1, a =0, y3 =0, d =2, y1 =2))) %>%tidy_dagitty() %>%ggdag(text_size =3, node_size =5) +geom_dag_edges() +theme_dag() +labs(title="Causal Pitchfork", subtitle ="x and y2 are d-connected but x and y1/y3 are not") +theme(title =element_text(size =8))
Analyzing DAGs: Fork
Good Control
Code
med <-dagify( x ~ d, y1 ~ d,coords =list(x =c(x =1, z =1.5, y =2, a =1.5, b =2, d =1.5, y1 =2), y =c(x =1, y =1, z =1, a =0, b =0, d =2, y1 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name =="d", "Confounder", "variables of interest")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top") med
d causes both x and y1
Arrows pointing to x are called “back-door” paths
Eliminated by randomized experiment! Why?
Controlling for d “blocks” the non-causal association x \(\rightarrow\) y1
Analyzing DAGs: Pipe
Bad Control (possibly use mediation analysis)
Code
med <-dagify(z ~ x, y2 ~ z,coords =list(x =c(x =1, z =1.5, y2 =2), y =c(x=1, y2 =1, z=1))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name =="z", "Mediator", "variables of interest")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top") med
x causes y through z
Controlling for z blocks the causal association x \(\rightarrow\) y2
Analyzing DAGs: Collider
Bad control
Code
dagify(a ~ x, a ~ y,coords =list(x =c(x =1, y =2, a =1.5), y =c(x =1, y =0, a =0))) |>tidy_dagitty() |>mutate(fill =ifelse(name =="a", "Collider", "variables of interest")) |>ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size =7, aes(color = fill)) +geom_dag_edges(show.legend =FALSE) +geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="top" )
x & y cause a
x & y are d-separated and uncorrelated
By adding a to the model spurious correlation between x & y is introduced
Exercise
Which variables should be included?
Effect of x on y
Effect of z on y
Code
library(ggdag)library(dagitty)library(tidyverse)dagify(y ~ n + z + b + c, x ~ z + a + c, n ~ x, z ~ a + b, exposure ="x", outcome ="y",coords =list(x =c(n =2, x =1, y =3, a =1, z =2, c =2, b =3), y =c(x =2, y =2, a =3, z =3, c =1, b =3, n =2))) %>%tidy_dagitty() %>%ggdag(text_size =8, node_size =12) +geom_dag_edges() +theme_dag()
library(ggpubr)p1 <-dagify(y ~ x + U2, a ~ U1 + U2, x ~ U1,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name %in%c("U1", "U2"), "Unobserved", "Observed")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="M-Bias")p2 <-dagify(y ~ a + U, a ~ x + U,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U =1.7, U2 =2), y =c(x=1, y =1, a =1, b =0, U =2, U2 =2))) %>%tidy_dagitty() %>%mutate(fill =ifelse(name %in%c("U"), "Unobserved", "Observed")) %>%ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Post-treatment Bias")ggarrange(p1, p2)
Common bad controls
Code
p1 <-dagify(y ~ x , a ~ x + y,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%#mutate(fill = ifelse(name %in% c("U1", "U2"), "Unobserved", "Observed")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, #aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Selection Bias")p2 <-dagify(y ~ x , a ~ y,coords =list(x =c(x =1, y =2, a =1.5, b =1.5, U1 =1, U2 =2), y =c(x=1, y =1, a =1.5, b =0, U1 =2, U2 =2))) %>%tidy_dagitty() %>%#mutate(fill = ifelse(name %in% c("U1", "U2"), "Unobserved", "Observed")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +geom_dag_point(size=12, #aes(color = fill) ) +geom_dag_edges(show.legend =FALSE)+geom_dag_text() +theme_dag() +theme(legend.title =element_blank(),legend.position ="bottom") +labs(title ="Case-control Bias")ggarrange(p1, p2)
Intelligence, education, income
Case-control study: Observation ex-post. Ex.: Smoking \(\rightarrow\) lung cancer
Exercise
Prepare a short presentation of a (potential) DAG for your thesis
Moderation
An effect of variable \(x\) on outcome \(y\) is moderated if it depends on another variable \(z\) in any way (strength, sign, …)
In regression analysis we can test for moderation by including interactions (product of two variables) in the model
E.g., the association of flipper length and body mass is moderated by the species
Code
library(palmerpenguins)library(ggstatsplot)lm(body_mass_g ~ species:flipper_length_mm + species + flipper_length_mm, data = penguins) |>broom::tidy() |>gt::gt() |>gt::tab_options(table.font.size =35)
term
estimate
std.error
statistic
p.value
(Intercept)
-2535.836802
879.467667
-2.8833770
0.004187922468810483
speciesChinstrap
-501.358972
1523.459013
-0.3290925
0.742290787208460534
speciesGentoo
-4251.443811
1427.332228
-2.9785944
0.003106357935232562
flipper_length_mm
32.831690
4.627184
7.0953936
0.000000000007691129
speciesChinstrap:flipper_length_mm
1.741704
7.855734
0.2217112
0.824673398945700575
speciesGentoo:flipper_length_mm
21.790812
6.941167
3.1393586
0.001843295165362903
Mediation
References
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2020. “A Crash Course in Good and Bad Controls.”SSRN 3689437.
Imbens, Guido W. 2020. “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics.”Journal of Economic Literature 58 (4): 1129–79. https://doi.org/10.1257/jel.20191597.