Finally understood the “doubly robust” property of TMLE through simulation. Works well if one of the outcomes OR the treatment model is correct. XGBoost + TMLE captured complex relationships without manual specification. It worked on simulated complex data, would it work in the real world? 🤔
Motivations:
I’ve always heard about Targeted Maximum Likelihood Estimation (TMLE) and I’ve read it
Catharina Hofman’s blog post several times. Printed out her cheat sheet and went through it several times. Every time I thought I understood, I found myself questioning my understanding the next time. 🤣 So, what better way to dive a little deeper into the machinery behind this, and why is it useful? Let’s go!
To set the context straight, let’s estimate the average treatment effect (ATE) and its usage g-computation as a standard approach.
Objectives:
What is TMLE?
TMLE is a statistical method used to estimate causal effects in observational studies and clinical trials. It combines elements of machine learning and traditional statistical techniques to obtain robust estimates of treatment effects while controlling for confounding variables. TMLE works in two main steps: first, it estimates the outcome model and the treatment model, and then it uses these models to adjust the estimate of the treatment effect, directly approximating the parameter of interest. This approach is especially useful in environments where standard methods can be biased or inefficient, as it allows the integration of flexible machine learning algorithms to improve the accuracy of the estimates. You will hear the term Doubly Robust about this method. What’s robust x 2 about this?
What does double robust mean?
Doubly Robust (DR) estimation refers to a statistical property of certain estimators that remains consistent when either the treatment assignment (propensity score) model or the outcome model are correctly specified, but not necessarily both. In other words, a doubly robust estimator provides two opportunities for obtaining a valid estimate of the causal effect: if one of the models is misspecified, as long as the other model is correctly specified, the estimator will still produce consistent results. This property is particularly advantageous in observational research where there may be uncertainty about the appropriate specification of either model, increasing the reliability of causal inferences from the data. I didn’t fully understand this until we simulated the data to test this theory. Hopefully it will become clearer as we run through the simulation. But wait, what metrics should we use for this? Bias and variance!
What is bias and variance?
Bias and variance are two fundamental concepts in statistics and machine learning that describe different sources of error in predictive models. Bias refers to the systematic error that occurs when a model consistently overestimates or underestimates the true value of a parameter. High bias can lead to underfitting, where the model fails to capture the underlying patterns in the data. Variance, on the other hand, refers to the variability of model predictions for different training datasets. High variance can lead to overfitting, where the model captures noise in the training data instead of the real signal. The trade-off between bias and variance is an important consideration in model selection and evaluation because it affects the overall accuracy and generalizability of predictive models.
The formula for bias is: $$ \text{Bias}(\hat{\theta}) = E[\hat{\theta}] – \theta $$ Where:
\(\hat{\theta}\)is the estimator of the parameter\(\theta\)\(E[\hat{\theta}]\)is the expected value of the estimator
In pseudo-R code it would look something like this:
predicted_theta <- vector(mode = "numeric", length=1000)
for (i in 1:1000) {
training_data <- dplyr::slice_sample(original_training_data, n = nrow(original_training_data), replace = T)
model <- glm(outcome~treatment+confounder,data=training_data)
outcome_hat_1 <- predict(model,newdata = training_data |> mutate(treatment = 1))
outcome_hat_0 <- predict(model,newdata = training_data |> mutate(treatment = 0))
predicted_theta[i] <- mean(outcome_hat_1) - mean(outcome_hat_0)
}
bias <- mean(predicted_theta) - theta
In my own language the bias is: how close our estimation on average is to the actual value.
The formula for variance is: $$ \text{Var}(\hat{\theta}) = E[(\hat{\theta} – E[\hat{\theta}])^2]$$
Where:
\(\hat{\theta}\)is the estimator of the parameter\(\theta\)\(E[\hat{\theta}]\)is the expected value of the estimator
In pseudo-R code it would look something like this:
predicted_theta <- vector(mode = "numeric", length=1000)
for (i in 1:1000) {
training_data <- dplyr::slice_sample(original_training_data, n = nrow(original_training_data), replace = T)
model <- glm(outcome~treatment+confounder,data=training_data)
outcome_hat_1 <- predict(model,newdata = training_data |> mutate(treatment = 1))
outcome_hat_0 <- predict(model,newdata = training_data |> mutate(treatment = 0))
predicted_theta[i] <- mean(outcome_hat_1) - mean(outcome_hat_0)
}
variance <- mean((predicted_theta-mean(predicted_theta))^2)
# or
variance <- var(predicted_theta)
We will use bias And variance to test the doubly robust theory. But first, let’s simulate some data!
Simulate data
library(tidyverse)
set.seed(1)
n <- 10000
W1 <- rnorm(n)
W2 <- rnorm(n)
W3 <- rbinom(n, 1, 0.5)
W4 <- rnorm(n)
# TRUE propensity score model
A <- rbinom(n, 1, plogis(-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4))
# TRUE outcome model
Y <- rbinom(n, 1, plogis(-1 + 0.2*A + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2))
# Calculate TRUE ATE
logit_Y1 <- -1 + 0.2 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2
logit_Y0 <- -1 + 0 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2
Y1_true <- plogis(logit_Y1)
Y0_true <- plogis(logit_Y0)
true_ATE <- mean(Y1_true - Y0_true)
df <- tibble(W1 = W1, W2 = W2, W3 = W3, W4 = W4, A = A, Y = Y,
true_ATE = true_ATE, Y1_true = Y1_true, Y0_true = Y0_true)
Okay, our real ATE here is 0.0373518. We will see whether a doubly robust method can estimate this outcome, or whether the treatment model is misspecified.
Write a function to estimate
Let’s look at the WRONG outcome model ❌
model <- glm(Y ~ A + W1 + W2 + W3 + W4, family = "binomial") summary(model) ## ## Call: ## glm(formula = Y ~ A + W1 + W2 + W3 + W4, family = "binomial") ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.045765 0.041489 -25.206 <2e-16 *** ## A -0.050142 0.047732 -1.050 0.293 ## W1 0.767386 0.026058 29.449 <2e-16 *** ## W2 -0.024726 0.022807 -1.084 0.278 ## W3 0.561572 0.045658 12.300 <2e-16 *** ## W4 -0.003209 0.022382 -0.143 0.886 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 12737 on 9999 degrees of freedom ## Residual deviance: 11519 on 9994 degrees of freedom ## AIC: 11531 ## ## Number of Fisher Scoring iterations: 4
Ouch. If you quickly look at the coefficient of A, it is -0.0501418. Completely the opposite of the real ATE. Okay, let’s see g-computation and see if it produces the same result.
g calculation function
g_comp <- function(model,data,ml=F) {
if (ml==T) {
y1 <- predict(model, new_data=data |> mutate(A=as.factor(1)), type = "prob")[,2] |> pull()
y0 <- predict(model, new_data=data |> mutate(A=as.factor(0)), type = "prob")[,2] |> pull()
} else {
y1 <- predict(model, newdata=data |> mutate(A=1), type = "response")
y0 <- predict(model, newdata=data |> mutate(A=0), type = "response")
}
return(mean(y1-y0))
}
g_comp(model,df)
## [1] -0.009823307
Yes, incorrect! What if we use the RIGHT Outcome model?
The right outcome model ✅
model <- glm(Y ~ A + W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial") g_comp(model,df) ## [1] 0.03576854
Wow! Look there! if we specify the outcome model correctly, it is actually VERY close to real ATE!
The wrong treatment model ❌
What if we use now IPW but with the wrong treatment model and see if we can estimate ATE
ps_model <- glm(A ~ W1 + W2 + W3 + W4, family = "binomial") ps <- ps_model$fitted.values ps_final <- pmax(pmin(ps, 0.95), 0.05) weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final)) model <- glm(Y ~ A, family = "binomial", weights = weights) g_comp(model,df) ## [1] -0.0095777
Wow, very wrong indeed! Now let’s look at the right treatment model
The Right Treatment Model ✅
ps_model <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial") #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4 ps <- ps_model$fitted.values ps_final <- pmax(pmin(ps, 0.95), 0.05) weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final)) model <- glm(Y ~ A, family = "binomial", weights = weights) g_comp(model,df) ## [1] 0.03874379
Not too shabby! very close to our real ATE! To be honest, how on earth are we supposed to know in advance what complex equation to specify for a treatment or an outcome model!?
Let’s try ML xgboost
Let’s see if xgboost can work out the outcome model without specifying all these weird interactions and quadratic relationships.
library(tidymodels)
library(doParallel)
library(future)
workers <- parallel::detectCores(logical = FALSE) - 1
plan(multisession, workers = workers)
future::nbrOfWorkers()
# Set up parallel processing
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores - 1) # Leave 1 core free
registerDoParallel(cl)
df_ml <- df |>
select(Y,A,W1,W2,W3,W4) |>
mutate(Y = as.factor(Y),
A = as.factor(A))
# Define model specification
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
# sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Create workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(Y ~ .)
# Tuning grid
xgb_grid <- grid_space_filling(
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),df_ml),
learn_rate(),
size = 20
)
# Cross-validation and tuning
set.seed(1)
folds <- vfold_cv(df_ml, v = 5)
xgb_res <- tune_grid(
xgb_wf,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE,
parallel_over = "everything")
)
# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")
# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = df_ml)
# g-comp
g_comp(final_fit, df_ml, T)
## [1] 0.03447109
Wow, nice! ✅ Pretty close to our real ATE without specifying any interaction or quadratic relationship. Please note that this dataset is quite large.
Now let’s try to see if we can use it xgboost to create an accurate treatment model and use its weights to respond to our good faith glm.
# Rereate workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(A ~ .)
# Tuning grid
xgb_grid <- grid_space_filling(
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),df_ml |> select(-Y)),
learn_rate(),
size = 20
)
# Cross-validation and tuning
set.seed(1)
folds <- vfold_cv(df_ml |> select(-Y), v = 5)
xgb_res <- tune_grid(
xgb_wf,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE,
parallel_over = "everything")
)
# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")
# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = df_ml |> select(-Y))
# calc ps
ps <- predict(final_fit, new_data = df_ml |> select(-Y), type = "prob")[,2] |> pull()
ps_final <- ps
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))
# glm model
model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model, df_ml)
## [1] 0.04840169
Wow, compared to
thisour ATE is much closer to our real ATE than the misspecified treatment model. Although it’s still quite biased, right? it is far from the real ATE. But at least we know that ML methods can probably handle this complex relationship.
Now let’s write a function to estimate bias and variance! Because that is our most important question. And then we will look at the TMLE procedure.
code
bias <- function(predicted_theta, theta) {
return(mean(predicted_theta - theta))
}
variance <- function(predicted_theta) {
return(var(predicted_theta))
}
TMLE
Since we know that xgboost is able to estimate the correct outcome model, why don’t we just use logistic regression here? Let’s imagine that somehow we only get the treatment model correct, but not the outcome model. Will TMLE be able to clear this up?
Step 1. Create an outcome model
model_outcome <- glm(Y ~ A + W1 + W2 + W3 + W4, family = "binomial") #wrong model_outcome_all <- predict(model_outcome, newdata = df, type = "response") model_outcome_1 <- predict(model_outcome, newdata = df |> mutate(A = 1), type = "response") model_outcome_0 <- predict(model_outcome, newdata = df |> mutate(A = 0), type = "response")
Step 2. Create a treatment model and smart covariate
model_treatment <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial") #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4 ps <- model_treatment$fitted.values ps_final <- ps # ps_final <- pmax(pmin(ps, 0.95), 0.05) a_1 <- 1/(predict(model_treatment, df, type = "response")) a_0 <- -1/(1 - predict(model_treatment, df, type = "response")) clever_covariate <- ifelse(A == 1, 1/ps_final, -1/(1-ps_final))
Step 3. Estimate the fluctuation parameter
epsilon_model <- glm(Y ~ -1 + offset(qlogis(model_outcome$fitted.values)) + clever_covariate, family = "binomial") summary(epsilon_model) ## ## Call: ## glm(formula = Y ~ -1 + offset(qlogis(model_outcome$fitted.values)) + ## clever_covariate, family = "binomial") ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## clever_covariate 0.041886 0.009675 4.329 1.5e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 11519 on 10000 degrees of freedom ## Residual deviance: 11497 on 9999 degrees of freedom ## AIC: 11499 ## ## Number of Fisher Scoring iterations: 4 epsilon <- epsilon_model$coefficients
Step 4. Update the initial results
updated_outcome_1 <- plogis(qlogis(model_outcome_1)+epsilon*a_1) updated_outcome_0 <- plogis(qlogis(model_outcome_0)+epsilon*a_0)
Step 5. Calculate ATE
(ate <- mean(updated_outcome_1-updated_outcome_0)) ## [1] 0.04068321
Wow, imagine specifying the wrong outcome model, but the right treatment model will bring our ATE closer to the real ATE! Imagine if our misspecified outcome model would have yielded an ATE of -0.009
here.or Not too shabby!
Step 6. Estimate the standard error
se <- sqrt(var((Y-model_outcome_all)*clever_covariate+updated_outcome_1-updated_outcome_0-ate)/n)
pval <- 2*(1-pnorm(abs(ate/se)))
print(paste0("ATE: ",round(ate,3), " [95%CI ", round(ate-1.96*se,3),"-",round(ate+1.96*se,3),", p=",round(pval,3),"]"))
## [1] "ATE: 0.041 [95%CI 0.008-0.073, p=0.015]"
There you have it! Our final estimates, standard error and pval. Thanks
khstats for step-by-step guidance. Very useful to reproduce the framework.
Compare models
Now let’s resample 1000 times with n = 6000 (after calculating the sample size of the effect we have with a power of 80% and alpha 5%) of 1) properly specified logistic regression outcome model. 2) an inverse probability weighting (IPW) approach with properly specified probabilities for treatment assignments. 3) incorrect specific logistic regression outcome model. 4) Hyperparameter-tuned xgboost outcome model and see how their biases and variances differ.
My messy code
n_sample <- 1000
df_ate <- tibble(model=character(),bias=numeric(),variance=numeric())
### correct outcome model specification, logistic regression
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
set.seed(i)
data <- slice_sample(df, n = 6000, replace = T)
model <- glm(Y ~ A + W1 + I(W2^2) + W3 + W1:W2 + W4, data = data, family = "binomial")
ate <- g_comp(model,data)
predicted_ate[i] <- ate
}
df_ate <- df_ate |>
bind_rows(tibble(model="logreg_correct_outcome",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))
### correct treatment model specification, logistic regression ipw
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
set.seed(i)
data <- slice_sample(df, n = 6000, replace = T)
ps_model <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, data=data, family = "binomial") #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4
ps <- ps_model$fitted.values
ps_final <- pmax(pmin(ps, 0.95), 0.05)
weights <- ifelse(data$A == 1, 1/ps_final, 1/(1-ps_final))
model <- glm(Y ~ A, data=data, family = "binomial", weights = weights)
ate <- g_comp(model,data)
predicted_ate[i] <- ate
}
df_ate <- df_ate |>
bind_rows(tibble(model="logreg_treatment_outcome",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))
### no interaction or quadratic relationship outcome model
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
set.seed(i)
data <- slice_sample(df, n = 6000, replace = T)
model <- glm(Y ~ A + W1 + W2 + W3 + W4, data = data, family = "binomial")
ate <- g_comp(model,data)
predicted_ate[i] <- ate
}
df_ate <- df_ate |>
bind_rows(tibble(model="logreg_wrong_outcome",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))
### xgboost
library(tidymodels)
library(future)
plan(multisession, workers = 6)
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
set.seed(i)
data <- slice_sample(df, n = 6000, replace = T) |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
train <- data # you really don't need to do this, but i was lazy to change the oother ML codes
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Create workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(Y ~ .)
# Tuning grid
xgb_grid <- grid_space_filling(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),train),
learn_rate(),
size = 5
)
# Cross-validation and tuning
folds <- vfold_cv(train, v = 5)
xgb_res <- tune_grid(
xgb_wf,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE,
verbose = T)
)
# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")
# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = train)
# g-comp
ate <- g_comp(final_fit, train, T)
predict_ate[i] <- ate
}
df_ate <- df_ate |>
bind_rows(tibble(model="xgboost_outcome_model",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))
| model | prejudice | variance |
|---|---|---|
| logreg_correct_outcome | -0.0009778 | 0.0001410 |
| logreg_treatment_outcome | 0.0019444 | 0.0001581 |
| logreg_wrong_outcome | -0.0464683 | 0.0001354 |
| xgboost_outcome_model | -0.0164066 | 0.0000459 |
Wow, look at that! When outcome or treatment models are correctly specified, logistic regression is still the best with the lowest bias and low variance. When the outcome model was incorrectly specified, it became more biased and the variance did not really change much. When we used xgboost for the outcome model alone, it was less biased than the misspecified logistic regression outcome model, and interestingly, it has the lowest variance. Very interesting! I think this is useful because it seems like a tree-based model is able to explain quadratic and interaction relationships without us having to specify it. Now what if we use xgboost models for both outcome and treatment models and then use the TMLE framework to see if they are better!
Using TMLE procedures with Xgboost
We do the same as above by resampling 1000 times with n of 6000 with replacement. Then assess the bias and variance of the ATE. We will use the xghboost model for both treatment and outcome models. Then use the TMLE procedure for the test as shown above. As for matching, we will use grid search with space padding (grid_space_filling with size 5).
My messy code
library(tidymodels)
library(tidyverse)
library(future)
plan(multisession, workers = 4)
predicted_ate <- vector(mode = "numeric", length = n_sample)}
for (i in n_sample:1) {
# sample
data <- slice_sample(df, n = 6000, replace = T) |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
train <- data
# outcome model
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
# sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Create workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(Y ~ .)
# Tuning grid
xgb_grid <- grid_space_filling(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),train),
learn_rate(),
size = 5
)
# Cross-validation and tuning
folds <- vfold_cv(train, v = 5)
xgb_res <- tune_grid(
xgb_wf,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE,
# parallel_over = "resamples",
verbose = T)
)
# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")
# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = train)
# predict
outcome <- predict(final_fit, new_data = train, type = "prob")[,2] |> pull()
outcome_0 <- predict(final_fit, new_data = train |> mutate(A = as.factor(0)), type = "prob")[,2] |> pull()
outcome_1 <- predict(final_fit, new_data = train |> mutate(A = as.factor(1)), type = "prob")[,2] |> pull()
# -------------------------------------------------------
# treatment model
train_tx <- train |> select(-Y)
xgb_wf_tx <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(A ~ .)
xgb_grid_tx <- grid_space_filling(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),train_tx),
learn_rate(),
size = 5
)
# Cross-validation and tuning
folds_tx <- vfold_cv(train_tx, v = 5)
xgb_res_tx <- tune_grid(
xgb_wf_tx,
resamples = folds_tx,
grid = xgb_grid_tx,
control = control_grid(save_pred = TRUE,
verbose = T)
)
# Select best model
best_xgb_tx <- select_best(xgb_res_tx, metric = "roc_auc")
# Finalize and fit
final_xgb_tx <- finalize_workflow(xgb_wf_tx, best_xgb)
final_fit_tx <- fit(final_xgb_tx, data = train_tx)
ps <- predict(final_fit_tx, new_data = train_tx |> select(-A), type = "prob")[,2] |> pull()
ps_final <- ps
a_1 <- 1/(predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull())
a_0 <- -1/(1 - (predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull()))
clever_covariate <- ifelse(train_tx$A == 1, 1/ps_final, -1/(1-ps_final))
# step 3
epsilon_model <- glm(train$Y ~ -1 + offset(qlogis(outcome)) + clever_covariate, family = "binomial")
summary(epsilon_model)
epsilon <- epsilon_model$coefficients
#### Step 4. Update Initial Outcomes
updated_outcome_1 <- plogis(qlogis(outcome_1)+epsilon*a_1)
updated_outcome_0 <- plogis(qlogis(outcome_0)+epsilon*a_0)
#### Step 5. Compute ATE
ate <- mean(updated_outcome_1-updated_outcome_0)
predicted_ate[i] <- ate
}
df_ate <- df_ate |>
bind_rows(tibble(model="xgboost_tmle_model_size5",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))
| model | prejudice | variance |
|---|---|---|
| logreg_correct_outcome | -0.0009778 | 0.0001410 |
| logreg_treatment_outcome | 0.0019444 | 0.0001581 |
| logreg_wrong_outcome | -0.0464683 | 0.0001354 |
| xgboost_outcome_model | -0.0164066 | 0.0000459 |
| xgboost_tmle_model_size5 | 0.0066270 | 0.0001399 |
Wow, how about that! Less biased than just xgboost of the outcome model! Although the variance appears to be higher than the pure xgboost outcome model, it is approximately the same as logistic regression. It appears that TMLE can improve bias. Do you think we can do better? What if we increase the size to 20?
My messy code
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
set.seed(i)
data <- slice_sample(df, n = 6000, replace = T) |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
train <- data
# train <- df |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
# outcome model
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
# sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# Create workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(Y ~ .)
# Tuning grid
xgb_grid <- grid_space_filling(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),train),
learn_rate(),
size = 20
)
# Cross-validation and tuning
folds <- vfold_cv(train, v = 5)
xgb_res <- tune_grid(
xgb_wf,
resamples = folds,
grid = xgb_grid,
control = control_grid(save_pred = TRUE,
# parallel_over = "resamples",
verbose = T)
)
# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")
# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = train)
# predict
outcome <- predict(final_fit, new_data = train, type = "prob")[,2] |> pull()
outcome_0 <- predict(final_fit, new_data = train |> mutate(A = as.factor(0)), type = "prob")[,2] |> pull()
outcome_1 <- predict(final_fit, new_data = train |> mutate(A = as.factor(1)), type = "prob")[,2] |> pull()
# -------------------------------------------------------
# treatment model
train_tx <- train |> select(-Y)
xgb_wf_tx <- workflow() %>%
add_model(xgb_spec) %>%
add_formula(A ~ .)
xgb_grid_tx <- grid_space_filling(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
finalize(mtry(),train_tx),
learn_rate(),
size = 20
)
# Cross-validation and tuning
folds_tx <- vfold_cv(train_tx, v = 5)
xgb_res_tx <- tune_grid(
xgb_wf_tx,
resamples = folds_tx,
grid = xgb_grid_tx,
control = control_grid(save_pred = TRUE,
verbose = T)
)
# Select best model
best_xgb_tx <- select_best(xgb_res_tx, metric = "roc_auc")
# Finalize and fit
final_xgb_tx <- finalize_workflow(xgb_wf_tx, best_xgb)
final_fit_tx <- fit(final_xgb_tx, data = train_tx)
ps <- predict(final_fit_tx, new_data = train_tx |> select(-A), type = "prob")[,2] |> pull()
ps_final <- ps
# ps_final <- pmax(pmin(ps, 0.95), 0.05)
a_1 <- 1/(predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull())
a_0 <- -1/(1 - (predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull()))
clever_covariate <- ifelse(train_tx$A == 1, 1/ps_final, -1/(1-ps_final))
# step 3
epsilon_model <- glm(train$Y ~ -1 + offset(qlogis(outcome)) + clever_covariate, family = "binomial")
summary(epsilon_model)
epsilon <- epsilon_model$coefficients
#### Step 4. Update Initial Outcomes
updated_outcome_1 <- plogis(qlogis(outcome_1)+epsilon*a_1)
updated_outcome_0 <- plogis(qlogis(outcome_0)+epsilon*a_0)
#### Step 5. Compute ATE
ate <- mean(updated_outcome_1-updated_outcome_0)
| model | prejudice | variance |
|---|---|---|
| logreg_correct_outcome | -0.0009778 | 0.0001410 |
| logreg_treatment_outcome | 0.0019444 | 0.0001581 |
| logreg_wrong_outcome | -0.0464683 | 0.0001354 |
| xgboost_outcome_model | -0.0164066 | 0.0000459 |
| xgboost_tmle_model_size5 | 0.0066270 | 0.0001399 |
| xgboost_tmle_model_size20 | 0.0041615 | 0.0001645 |
Look there! Less biased than the size of 5, but you can see the variance starting to increase. This is really cool!
Is there a traditional statistical method for this?
Yes, apparently it is! Improved IPW estimator
here provides a doubly robust method for estimating causal effects that combines propensity score weighting with outcome modeling. AIPW remains consistent as long as the propensity score model or outcome model is correctly specified (sound familiar? It’s just like TMLE!), making it more reliable than traditional methods when the model specification is uncertain.
Here’s AIPW R package, if necessary to use this in the future.
This double robust method also reminds me of Double Machine Learning (DML), we should compare all these methods in the future!
Opportunities for improvement
- learn to use
future_lapplywith distributed computing - test different ML models for TMLE
- does it matter if we use all the data (as above) or does ATE change with the train/test split?
Lessons learned
- learned to pick up anchor
- learned from offset does in
glm - learned the procedure of TMLE
- learned to use
futureparallel computing
If you like this article:
Related
#Bias #variance #doubly #robust #estimation #promise #TMLE #testing #simulated #data #bloggers


