Bias, variance and doubly robust estimation: the promise of TMLE testing in simulated data

Finally understood the “doubly robust” property of TMLE through simulation. Works well if one of the outcomes OR the treatment model is correct. XGBoost + TMLE captured complex relationships without manual specification. It worked on simulated complex data, would it work in the real world? 🤔

Motivations:

I’ve always heard about Targeted Maximum Likelihood Estimation (TMLE) and I’ve read it
Catharina Hofman’s blog post several times. Printed out her cheat sheet and went through it several times. Every time I thought I understood, I found myself questioning my understanding the next time. 🤣 So, what better way to dive a little deeper into the machinery behind this, and why is it useful? Let’s go!

To set the context straight, let’s estimate the average treatment effect (ATE) and its usage g-computation as a standard approach.

Objectives:

What is TMLE?

TMLE is a statistical method used to estimate causal effects in observational studies and clinical trials. It combines elements of machine learning and traditional statistical techniques to obtain robust estimates of treatment effects while controlling for confounding variables. TMLE works in two main steps: first, it estimates the outcome model and the treatment model, and then it uses these models to adjust the estimate of the treatment effect, directly approximating the parameter of interest. This approach is especially useful in environments where standard methods can be biased or inefficient, as it allows the integration of flexible machine learning algorithms to improve the accuracy of the estimates. You will hear the term Doubly Robust about this method. What’s robust x 2 about this?

What does double robust mean?

Doubly Robust (DR) estimation refers to a statistical property of certain estimators that remains consistent when either the treatment assignment (propensity score) model or the outcome model are correctly specified, but not necessarily both. In other words, a doubly robust estimator provides two opportunities for obtaining a valid estimate of the causal effect: if one of the models is misspecified, as long as the other model is correctly specified, the estimator will still produce consistent results. This property is particularly advantageous in observational research where there may be uncertainty about the appropriate specification of either model, increasing the reliability of causal inferences from the data. I didn’t fully understand this until we simulated the data to test this theory. Hopefully it will become clearer as we run through the simulation. But wait, what metrics should we use for this? Bias and variance!

What is bias and variance?

Bias and variance are two fundamental concepts in statistics and machine learning that describe different sources of error in predictive models. Bias refers to the systematic error that occurs when a model consistently overestimates or underestimates the true value of a parameter. High bias can lead to underfitting, where the model fails to capture the underlying patterns in the data. Variance, on the other hand, refers to the variability of model predictions for different training datasets. High variance can lead to overfitting, where the model captures noise in the training data instead of the real signal. The trade-off between bias and variance is an important consideration in model selection and evaluation because it affects the overall accuracy and generalizability of predictive models.

The formula for bias is: $$ \text{Bias}(\hat{\theta}) = E[\hat{\theta}] – \theta $$ Where:

$\hat{\theta}$ is the estimator of the parameter $\theta$
$E[\hat{\theta}]$ is the expected value of the estimator

In pseudo-R code it would look something like this:

predicted_theta <- vector(mode = "numeric", length=1000)

for (i in 1:1000) {
  training_data <- dplyr::slice_sample(original_training_data, n = nrow(original_training_data), replace = T)
  model <- glm(outcome~treatment+confounder,data=training_data)
  outcome_hat_1 <- predict(model,newdata = training_data |> mutate(treatment = 1))
  outcome_hat_0 <- predict(model,newdata = training_data |> mutate(treatment = 0))
  predicted_theta[i] <- mean(outcome_hat_1) - mean(outcome_hat_0)
}

bias <- mean(predicted_theta) - theta

In my own language the bias is: how close our estimation on average is to the actual value.

The formula for variance is: $$ \text{Var}(\hat{\theta}) = E[(\hat{\theta} – E[\hat{\theta}])^2]$$

Where:

$\hat{\theta}$ is the estimator of the parameter $\theta$
$E[\hat{\theta}]$ is the expected value of the estimator

In pseudo-R code it would look something like this:

predicted_theta <- vector(mode = "numeric", length=1000)

for (i in 1:1000) {
  training_data <- dplyr::slice_sample(original_training_data, n = nrow(original_training_data), replace = T)
  model <- glm(outcome~treatment+confounder,data=training_data)
  outcome_hat_1 <- predict(model,newdata = training_data |> mutate(treatment = 1))
  outcome_hat_0 <- predict(model,newdata = training_data |> mutate(treatment = 0))
  predicted_theta[i] <- mean(outcome_hat_1) - mean(outcome_hat_0)
}

variance <- mean((predicted_theta-mean(predicted_theta))^2)
# or 
variance <- var(predicted_theta)

We will use bias And variance to test the doubly robust theory. But first, let’s simulate some data!

Simulate data

library(tidyverse)

set.seed(1)

n <- 10000
W1 <- rnorm(n)
W2 <- rnorm(n)
W3 <- rbinom(n, 1, 0.5)
W4 <- rnorm(n)

# TRUE propensity score model
A <- rbinom(n, 1, plogis(-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4))

# TRUE outcome model
Y <- rbinom(n, 1, plogis(-1 + 0.2*A + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2))

# Calculate TRUE ATE
logit_Y1 <- -1 + 0.2 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2
logit_Y0 <- -1 + 0 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2

Y1_true <- plogis(logit_Y1)
Y0_true <- plogis(logit_Y0)
true_ATE <- mean(Y1_true - Y0_true)

df <- tibble(W1 = W1, W2 = W2, W3 = W3, W4 = W4, A = A, Y = Y,
           true_ATE = true_ATE, Y1_true = Y1_true, Y0_true = Y0_true)

Okay, our real ATE here is 0.0373518. We will see whether a doubly robust method can estimate this outcome, or whether the treatment model is misspecified.

Write a function to estimate

Let’s look at the WRONG outcome model ❌

model <- glm(Y ~ A + W1 + W2 + W3 + W4, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = Y ~ A + W1 + W2 + W3 + W4, family = "binomial")
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.045765   0.041489 -25.206   <2e-16 ***
## A           -0.050142   0.047732  -1.050    0.293    
## W1           0.767386   0.026058  29.449   <2e-16 ***
## W2          -0.024726   0.022807  -1.084    0.278    
## W3           0.561572   0.045658  12.300   <2e-16 ***
## W4          -0.003209   0.022382  -0.143    0.886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12737  on 9999  degrees of freedom
## Residual deviance: 11519  on 9994  degrees of freedom
## AIC: 11531
## 
## Number of Fisher Scoring iterations: 4

Ouch. If you quickly look at the coefficient of A, it is -0.0501418. Completely the opposite of the real ATE. Okay, let’s see g-computation and see if it produces the same result.

g calculation function

g_comp <- function(model,data,ml=F) {
  if (ml==T) {
   y1 <- predict(model, new_data=data |> mutate(A=as.factor(1)), type = "prob")[,2] |> pull()
   y0 <- predict(model, new_data=data |> mutate(A=as.factor(0)), type = "prob")[,2] |> pull()
  } else {
  y1 <- predict(model, newdata=data |> mutate(A=1), type = "response")
  y0 <- predict(model, newdata=data |> mutate(A=0), type = "response")
  }
  return(mean(y1-y0))
}

g_comp(model,df)

## [1] -0.009823307

Yes, incorrect! What if we use the RIGHT Outcome model?

The right outcome model ✅

model <- glm(Y ~ A + W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial")

g_comp(model,df)

## [1] 0.03576854

Wow! Look there! if we specify the outcome model correctly, it is actually VERY close to real ATE!

The wrong treatment model ❌

What if we use now IPW but with the wrong treatment model and see if we can estimate ATE

ps_model <- glm(A ~ W1 + W2 + W3 + W4, family = "binomial")  
ps <- ps_model$fitted.values
ps_final <- pmax(pmin(ps, 0.95), 0.05)
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))

model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model,df)

## [1] -0.0095777

Wow, very wrong indeed! Now let’s look at the right treatment model

The Right Treatment Model ✅

ps_model <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial")  #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4

ps <- ps_model$fitted.values
ps_final <- pmax(pmin(ps, 0.95), 0.05)
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))

model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model,df)

## [1] 0.03874379

Not too shabby! very close to our real ATE! To be honest, how on earth are we supposed to know in advance what complex equation to specify for a treatment or an outcome model!?

Let’s try ML xgboost

Let’s see if xgboost can work out the outcome model without specifying all these weird interactions and quadratic relationships.

library(tidymodels)
library(doParallel)  
library(future)

workers <- parallel::detectCores(logical = FALSE) - 1
plan(multisession, workers = workers)
future::nbrOfWorkers() 

# Set up parallel processing
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores - 1)  # Leave 1 core free
registerDoParallel(cl)

df_ml <- df |> 
  select(Y,A,W1,W2,W3,W4) |>
  mutate(Y = as.factor(Y),
         A = as.factor(A))

# Define model specification
xgb_spec <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  # sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")  

# Create workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(Y ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),df_ml),
  learn_rate(),
  size = 20
)

# Cross-validation and tuning
set.seed(1)
folds <- vfold_cv(df_ml, v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         parallel_over = "everything")
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = df_ml)

# g-comp
g_comp(final_fit, df_ml, T)

## [1] 0.03447109

Wow, nice! ✅ Pretty close to our real ATE without specifying any interaction or quadratic relationship. Please note that this dataset is quite large.

Now let’s try to see if we can use it xgboost to create an accurate treatment model and use its weights to respond to our good faith glm.

# Rereate workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(A ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),df_ml |> select(-Y)),
  learn_rate(),
  size = 20
)

# Cross-validation and tuning
set.seed(1)
folds <- vfold_cv(df_ml |> select(-Y), v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         parallel_over = "everything")
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = df_ml |> select(-Y))

# calc ps
ps <- predict(final_fit, new_data = df_ml |> select(-Y), type = "prob")[,2] |> pull()
ps_final <- ps
weights <- ifelse(A == 1, 1/ps_final, 1/(1-ps_final))

# glm model 
model <- glm(Y ~ A, family = "binomial", weights = weights)
g_comp(model, df_ml)

## [1] 0.04840169

Wow, compared to
thisour ATE is much closer to our real ATE than the misspecified treatment model. Although it’s still quite biased, right? it is far from the real ATE. But at least we know that ML methods can probably handle this complex relationship.

Now let’s write a function to estimate bias and variance! Because that is our most important question. And then we will look at the TMLE procedure.

code

bias <- function(predicted_theta, theta) {
  return(mean(predicted_theta - theta))
}

variance <- function(predicted_theta) {
  return(var(predicted_theta))
}

TMLE

Since we know that xgboost is able to estimate the correct outcome model, why don’t we just use logistic regression here? Let’s imagine that somehow we only get the treatment model correct, but not the outcome model. Will TMLE be able to clear this up?

Step 1. Create an outcome model

model_outcome <- glm(Y ~ A + W1 + W2 + W3 + W4, family = "binomial") #wrong
model_outcome_all <- predict(model_outcome, newdata = df, type = "response")
model_outcome_1 <- predict(model_outcome, newdata = df |> mutate(A = 1), type = "response")
model_outcome_0 <- predict(model_outcome, newdata = df |> mutate(A = 0), type = "response")

Step 2. Create a treatment model and smart covariate

model_treatment <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, family = "binomial")  #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4

ps <- model_treatment$fitted.values
ps_final <- ps
# ps_final <- pmax(pmin(ps, 0.95), 0.05)
a_1 <- 1/(predict(model_treatment, df, type = "response"))
a_0 <- -1/(1 - predict(model_treatment, df, type = "response"))
clever_covariate <- ifelse(A == 1, 1/ps_final, -1/(1-ps_final))

Step 3. Estimate the fluctuation parameter

epsilon_model <- glm(Y ~ -1 + offset(qlogis(model_outcome$fitted.values)) + clever_covariate, family = "binomial")
summary(epsilon_model)

## 
## Call:
## glm(formula = Y ~ -1 + offset(qlogis(model_outcome$fitted.values)) + 
##     clever_covariate, family = "binomial")
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## clever_covariate 0.041886   0.009675   4.329  1.5e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11519  on 10000  degrees of freedom
## Residual deviance: 11497  on  9999  degrees of freedom
## AIC: 11499
## 
## Number of Fisher Scoring iterations: 4

epsilon <- epsilon_model$coefficients

Step 4. Update the initial results

updated_outcome_1 <- plogis(qlogis(model_outcome_1)+epsilon*a_1)
updated_outcome_0 <- plogis(qlogis(model_outcome_0)+epsilon*a_0)

Step 5. Calculate ATE

(ate <- mean(updated_outcome_1-updated_outcome_0))

## [1] 0.04068321

Wow, imagine specifying the wrong outcome model, but the right treatment model will bring our ATE closer to the real ATE! Imagine if our misspecified outcome model would have yielded an ATE of -0.009
here.or Not too shabby!

Step 6. Estimate the standard error

se <- sqrt(var((Y-model_outcome_all)*clever_covariate+updated_outcome_1-updated_outcome_0-ate)/n)
pval <- 2*(1-pnorm(abs(ate/se)))
print(paste0("ATE: ",round(ate,3), " [95%CI ", round(ate-1.96*se,3),"-",round(ate+1.96*se,3),", p=",round(pval,3),"]"))

## [1] "ATE: 0.041 [95%CI 0.008-0.073, p=0.015]"

There you have it! Our final estimates, standard error and pval. Thanks
khstats for step-by-step guidance. Very useful to reproduce the framework.

Compare models

Now let’s resample 1000 times with n = 6000 (after calculating the sample size of the effect we have with a power of 80% and alpha 5%) of 1) properly specified logistic regression outcome model. 2) an inverse probability weighting (IPW) approach with properly specified probabilities for treatment assignments. 3) incorrect specific logistic regression outcome model. 4) Hyperparameter-tuned xgboost outcome model and see how their biases and variances differ.

My messy code

n_sample <- 1000
df_ate <- tibble(model=character(),bias=numeric(),variance=numeric())

### correct outcome model specification, logistic regression
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
  set.seed(i)
  data <- slice_sample(df, n = 6000, replace = T)
  model <- glm(Y ~ A + W1 + I(W2^2) + W3 + W1:W2 + W4, data = data, family = "binomial")
  ate <- g_comp(model,data)
  predicted_ate[i] <- ate
}

df_ate <- df_ate |>
  bind_rows(tibble(model="logreg_correct_outcome",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))

### correct treatment model specification, logistic regression ipw
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
  set.seed(i)
  data <- slice_sample(df, n = 6000, replace = T)
  ps_model <- glm(A ~ W1 + I(W2^2) + W3 + W1:W2 + W4, data=data, family = "binomial")  #-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4

  ps <- ps_model$fitted.values
  ps_final <- pmax(pmin(ps, 0.95), 0.05)
  weights <- ifelse(data$A == 1, 1/ps_final, 1/(1-ps_final))

  model <- glm(Y ~ A, data=data, family = "binomial", weights = weights)
  ate <- g_comp(model,data)
  predicted_ate[i] <- ate
}

df_ate <- df_ate |>
  bind_rows(tibble(model="logreg_treatment_outcome",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))

### no interaction or quadratic relationship outcome model
predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
  set.seed(i)
  data <- slice_sample(df, n = 6000, replace = T)
  model <- glm(Y ~ A + W1 + W2 + W3 + W4, data = data, family = "binomial")
  ate <- g_comp(model,data)
  predicted_ate[i] <- ate
}

df_ate <- df_ate |>
  bind_rows(tibble(model="logreg_wrong_outcome",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))


### xgboost
library(tidymodels)
library(future)

plan(multisession, workers = 6)

predicted_ate <- vector(mode = "numeric", length = n_sample)
for (i in 1:n_sample) {
set.seed(i)
data <- slice_sample(df, n = 6000, replace = T) |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
train <- data # you really don't need to do this, but i was lazy to change the oother ML codes

xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")  

# Create workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(Y ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),train),
  learn_rate(),
  size = 5
)

# Cross-validation and tuning
folds <- vfold_cv(train, v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         verbose = T)
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = train)

# g-comp
ate <- g_comp(final_fit, train, T)

predict_ate[i] <- ate
}

df_ate <- df_ate |>
  bind_rows(tibble(model="xgboost_outcome_model",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))

model	prejudice	variance
logreg_correct_outcome	-0.0009778	0.0001410
logreg_treatment_outcome	0.0019444	0.0001581
logreg_wrong_outcome	-0.0464683	0.0001354
xgboost_outcome_model	-0.0164066	0.0000459

Wow, look at that! When outcome or treatment models are correctly specified, logistic regression is still the best with the lowest bias and low variance. When the outcome model was incorrectly specified, it became more biased and the variance did not really change much. When we used xgboost for the outcome model alone, it was less biased than the misspecified logistic regression outcome model, and interestingly, it has the lowest variance. Very interesting! I think this is useful because it seems like a tree-based model is able to explain quadratic and interaction relationships without us having to specify it. Now what if we use xgboost models for both outcome and treatment models and then use the TMLE framework to see if they are better!

Using TMLE procedures with Xgboost

We do the same as above by resampling 1000 times with n of 6000 with replacement. Then assess the bias and variance of the ATE. We will use the xghboost model for both treatment and outcome models. Then use the TMLE procedure for the test as shown above. As for matching, we will use grid search with space padding (grid_space_filling with size 5).

My messy code

library(tidymodels)
library(tidyverse)
library(future)
plan(multisession, workers = 4)
predicted_ate <- vector(mode = "numeric", length = n_sample)}

for (i in n_sample:1) {

# sample
data <- slice_sample(df, n = 6000, replace = T) |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
train <- data

# outcome model
xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  # sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")  

# Create workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(Y ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),train),
  learn_rate(),
  size = 5
)

# Cross-validation and tuning
folds <- vfold_cv(train, v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         # parallel_over = "resamples", 
                         verbose = T)
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = train)

# predict
outcome <- predict(final_fit, new_data = train, type = "prob")[,2] |> pull()
outcome_0 <- predict(final_fit, new_data = train |> mutate(A = as.factor(0)), type = "prob")[,2] |> pull()
outcome_1 <- predict(final_fit, new_data = train |> mutate(A = as.factor(1)), type = "prob")[,2] |> pull()

# -------------------------------------------------------
# treatment model 
train_tx <- train |> select(-Y)

xgb_wf_tx <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(A ~ .)


xgb_grid_tx <- grid_space_filling(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),train_tx),
  learn_rate(),
  size = 5
)

# Cross-validation and tuning
folds_tx <- vfold_cv(train_tx, v = 5)

xgb_res_tx <- tune_grid(
  xgb_wf_tx,
  resamples = folds_tx,
  grid = xgb_grid_tx,
  control = control_grid(save_pred = TRUE,
                         verbose = T)
)

# Select best model
best_xgb_tx <- select_best(xgb_res_tx, metric = "roc_auc")

# Finalize and fit
final_xgb_tx <- finalize_workflow(xgb_wf_tx, best_xgb)
final_fit_tx <- fit(final_xgb_tx, data = train_tx)

ps <- predict(final_fit_tx, new_data = train_tx |> select(-A), type = "prob")[,2] |> pull()
ps_final <- ps
a_1 <- 1/(predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull())
a_0 <- -1/(1 - (predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull()))
clever_covariate <- ifelse(train_tx$A == 1, 1/ps_final, -1/(1-ps_final))

# step 3 
epsilon_model <- glm(train$Y ~ -1 + offset(qlogis(outcome)) + clever_covariate, family = "binomial")
summary(epsilon_model)
epsilon <- epsilon_model$coefficients

#### Step 4. Update Initial Outcomes
updated_outcome_1 <- plogis(qlogis(outcome_1)+epsilon*a_1)
updated_outcome_0 <- plogis(qlogis(outcome_0)+epsilon*a_0)

#### Step 5. Compute ATE
ate <- mean(updated_outcome_1-updated_outcome_0)

predicted_ate[i] <- ate
}

df_ate <- df_ate |>
bind_rows(tibble(model="xgboost_tmle_model_size5",bias=bias(predicted_ate,true_ATE),variance=variance(predicted_ate)))

model	prejudice	variance
logreg_correct_outcome	-0.0009778	0.0001410
logreg_treatment_outcome	0.0019444	0.0001581
logreg_wrong_outcome	-0.0464683	0.0001354
xgboost_outcome_model	-0.0164066	0.0000459
xgboost_tmle_model_size5	0.0066270	0.0001399

Wow, how about that! Less biased than just xgboost of the outcome model! Although the variance appears to be higher than the pure xgboost outcome model, it is approximately the same as logistic regression. It appears that TMLE can improve bias. Do you think we can do better? What if we increase the size to 20?

My messy code

predicted_ate <- vector(mode = "numeric", length = n_sample)

for (i in 1:n_sample) {
  set.seed(i)
data <- slice_sample(df, n = 6000, replace = T) |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))
train <- data
# train <- df |> select(Y,A,W1:W4) |> mutate(Y = as.factor(Y), A = as.factor(A))


# outcome model
xgb_spec <- boost_tree(
  trees = tune(),
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  # sample_size = tune(),
  mtry = tune(),
  learn_rate = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")  

# Create workflow
xgb_wf <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(Y ~ .)

# Tuning grid
xgb_grid <- grid_space_filling(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),train),
  learn_rate(),
  size = 20
)

# Cross-validation and tuning
folds <- vfold_cv(train, v = 5)

xgb_res <- tune_grid(
  xgb_wf,
  resamples = folds,
  grid = xgb_grid,
  control = control_grid(save_pred = TRUE,
                         # parallel_over = "resamples", 
                         verbose = T)
)

# Select best model
best_xgb <- select_best(xgb_res, metric = "roc_auc")

# Finalize and fit
final_xgb <- finalize_workflow(xgb_wf, best_xgb)
final_fit <- fit(final_xgb, data = train)

# predict
outcome <- predict(final_fit, new_data = train, type = "prob")[,2] |> pull()
outcome_0 <- predict(final_fit, new_data = train |> mutate(A = as.factor(0)), type = "prob")[,2] |> pull()
outcome_1 <- predict(final_fit, new_data = train |> mutate(A = as.factor(1)), type = "prob")[,2] |> pull()

# -------------------------------------------------------
# treatment model 
train_tx <- train |> select(-Y)

xgb_wf_tx <- workflow() %>%
  add_model(xgb_spec) %>%
  add_formula(A ~ .)


xgb_grid_tx <- grid_space_filling(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  finalize(mtry(),train_tx),
  learn_rate(),
  size = 20
)

# Cross-validation and tuning
folds_tx <- vfold_cv(train_tx, v = 5)

xgb_res_tx <- tune_grid(
  xgb_wf_tx,
  resamples = folds_tx,
  grid = xgb_grid_tx,
  control = control_grid(save_pred = TRUE,
                         verbose = T)
)

# Select best model
best_xgb_tx <- select_best(xgb_res_tx, metric = "roc_auc")

# Finalize and fit
final_xgb_tx <- finalize_workflow(xgb_wf_tx, best_xgb)
final_fit_tx <- fit(final_xgb_tx, data = train_tx)

ps <- predict(final_fit_tx, new_data = train_tx |> select(-A), type = "prob")[,2] |> pull()
ps_final <- ps
# ps_final <- pmax(pmin(ps, 0.95), 0.05)
a_1 <- 1/(predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull())
a_0 <- -1/(1 - (predict(final_fit_tx, new_data = train_tx, type = "prob")[,2] |> pull()))
clever_covariate <- ifelse(train_tx$A == 1, 1/ps_final, -1/(1-ps_final))

# step 3 
epsilon_model <- glm(train$Y ~ -1 + offset(qlogis(outcome)) + clever_covariate, family = "binomial")
summary(epsilon_model)
epsilon <- epsilon_model$coefficients

#### Step 4. Update Initial Outcomes
updated_outcome_1 <- plogis(qlogis(outcome_1)+epsilon*a_1)
updated_outcome_0 <- plogis(qlogis(outcome_0)+epsilon*a_0)

#### Step 5. Compute ATE
ate <- mean(updated_outcome_1-updated_outcome_0)

model	prejudice	variance
logreg_correct_outcome	-0.0009778	0.0001410
logreg_treatment_outcome	0.0019444	0.0001581
logreg_wrong_outcome	-0.0464683	0.0001354
xgboost_outcome_model	-0.0164066	0.0000459
xgboost_tmle_model_size5	0.0066270	0.0001399
xgboost_tmle_model_size20	0.0041615	0.0001645