Integrating Python Predictions with R’s Tidyverse

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report a problem with the content on this page here)

Want to share your content on R bloggers? click here if you have a blog, or here if you don’t.

In this article, we have performed a successful integration of a non-standard Python prediction model into the R Tidyverse/Tidymodels framework, which mainly uses the reticulate package.

1. The goal (the challenge)

The goal was to use a powerful custom Python model (nnetsauce‘s MTS wrap around a cyb$BoosterRegressor) and integrate the results (predictions and prediction intervals) into the R ecosystem, allowing consistent data manipulation with dplyr and high-quality visualization using ggplot2 (and interactive plotly).

The challenge was that this custom Python model is not natively recognized by R’s standard prediction packages, such as modeltime.

2. The integration mechanism

The integration was achieved through the following steps:

reticulate Bridge: We used the reticulate package to seamlessly call the Python model, passing R data (tbl_train), matching the model (regr$fit()), and run the prediction (regr$predict()).
Data Conversion: The raw prediction output (predictions and confidence intervals) from Python was carefully extracted and converted into a standard R chatter (forecast_data_tbl).
Tidymodels simulation (first attempt): We tried this chatter in one modeltime structure using dummy models and calibration, which failed due to strict internal validation controls plot_modeltime_forecast().
Tidyverse final solution: We left it rigid modeltime visualization features and adopted Tidyverse’s core approach:
1. Data manipulation: We used dplyr And tidyr::pivot_longer() to restructure the prediction data in the long formatfor which is optimal ggplot2.
2. Visualization: We used pure ggplot2 And plotly to manually draw the lines and ribbons and apply the desired STOXX aesthetic, successfully bypassing all package incompatibility errors.

3. Key findings

The process revealed a crucial statistical finding:

Metric vs. Visual Discrepancy: The point prediction quality of the Python model was excellent (low CARD), but the calculation of the internal prediction interval (type_pi="gaussian") showed serious defects (high Out-of-band error).
Solution: To ensure that the visualization reflected the low MAPE score, we calculated the actual MAPE on the test set and manually realistic MAPE-based confidence interval to be used in the final plot.

Essentially we’ve created a robust R-centric pipeline for post-processing, validation and visualization of forecasts generated by a custom Python machine learning model.

library(tidyverse)
library(tidyquant)
library(reticulate)
library(tidymodels)
library(timetk)
library(modeltime)
library(plotly)

# nnetsauce_env ortamını kullanmasını söyleyin
use_python("C:/Users/selcu/AppData/Local/r-miniconda/envs/nnetsauce_env/python.exe", required = TRUE) 


# Import Python packages
np <- import("numpy")
ns <- import("nnetsauce")
cyb <- import("cybooster")
sklearn <- import("sklearn")


#iShares US Aerospace & Defense ETF (ITA)
df_ita <- 
  tq_get("ITA") %>% 
  select(date, close) %>% 
  filter(date >= last(date) - months(36))

#Split Data 
split <- time_series_split(df_ita,
                           assess = "15 days",
                           cumulative = TRUE)


tbl_train <- training(split)
tbl_test <- testing(split)

# 1. import DecisionTreeRegressor from sklearn
skl_tree <- import("sklearn.tree")

# Define the Base Estimator (The core learning component, factored for clarity)
# This uses a Decision Tree with a max depth of 3 as the weak learner for the boosting model.
base_estimator_instance <- skl_tree$DecisionTreeRegressor(max_depth = 3L)

# Define the Booster Object (The main regression engine)
# cyb$BoosterRegressor is the ensemble model that will be fitted.
booster_regressor <- cyb$BoosterRegressor(
  base_estimator_instance, # The weak learner used for boosting
  n_estimators = 100L        # Number of boosting stages (number of trees to build)
)

# Define the MTS (Multi-variate Time Series) Model Wrapper
# ns$MTS wraps the booster model to handle time series features like lags and prediction intervals.
regr <- ns$MTS(
  obj = booster_regressor, # The fitted regression model (BoosterRegressor)
  
  # Time Series Parameters:
  lags = 20L,              # Number of past observations (lags) to use as predictors (X)
  type_pi = "gaussian",    # Method for computing Prediction Intervals (PI). 
  # 'gaussian' assumes errors are normally distributed.
  # (NOTE: We found this to be too narrow, requiring manual adjustment with MAPE.)
  
  # Execution Control:
  show_progress = TRUE     # Displays the training progress (useful for monitoring)
)


#Converting the tibble to data frame for the fitting process
df_train <- 
  tbl_train %>% 
  as.data.frame() %>% 
  mutate(date = as.character(date))

df <- df_train[, -1, drop = FALSE]
rownames(df) <- df_train$date


#Fit the model
regr$fit(df)

#Step 1: Extract Python Predictions and Create the Base R Data Table

# --- Define Python Function and Retrieve Predictions ---
# (Ensure this function has been run in your R session previously.)
reticulate::py_run_string("
def get_full_data_11(preds_obj):
    mean_list = preds_obj[0]['close'].tolist() 
    lower_list = preds_obj[1]['close'].tolist()
    upper_list = preds_obj[2]['close'].tolist()
    return {'mean': mean_list, 'lower': lower_list, 'upper': upper_list}
")

preds_raw <- regr$predict(h = 11L)
full_data_list <- py$get_full_data_11(preds_raw)

preds_mean_vector  <- unlist(full_data_list$mean)
preds_lower_vector <- unlist(full_data_list$lower)
preds_upper_vector <- unlist(full_data_list$upper)

# Align data sets based on prediction length
vector_length <- length(preds_mean_vector)
df_test_final <- tbl_test %>% dplyr::slice(1:vector_length) 

# --- Create the Base Forecast Table ---
forecast_data_tbl <- tibble(
  .index = df_test_final$date,
  .value = df_test_final$close,
  .prediction = preds_mean_vector,
  .conf_lo    = preds_lower_vector, # These are the initially incorrect (too narrow) CIs
  .conf_hi    = preds_upper_vector, # These are the initially incorrect (too narrow) CIs
    .model_desc = "nnetsauce_BoosterRegressor"
)

#Step 2: Calculate the Actual MAPE and Create Realistic Confidence Intervals
library(dplyr)
library(tibble)
library(tidyr) 

# --- Step 2.1: Calculate the Actual MAPE on the Test Set ---

# Create a verification table containing error metrics
check_tbl <- forecast_data_tbl %>%
  mutate(
    Error = .value - .prediction,
    # MAPE formula: abs(Actual - Prediction) / Actual * 100
    Absolute_Percentage_Error = abs(Error) / .value * 100 
  )

# Actual MAPE (Calculated from the Test Set)
actual_mape_test <- mean(check_tbl$Absolute_Percentage_Error, na.rm = TRUE) / 100 
# (Converting the percentage to decimal format: e.g., 2.84% -> 0.0284)

print(paste0("Calculated MAPE (decimal format): ", actual_mape_test))


# --- Step 2.2: Create New, Realistic Intervals Based on MAPE ---

Z_SCORE_95 <- 1.96 # Z-Score for 95% Confidence Interval

# Create the New Realistic Prediction Interval Table
forecast_data_tbl_realistic <- check_tbl %>%
  mutate(
    # Error Margin = Prediction * MAPE * Z-Score
    Error_Amount = .prediction * actual_mape_test * Z_SCORE_95,
    
    # New Confidence Intervals
    .conf_lo_new = .prediction - Error_Amount, 
    .conf_hi_new = .prediction + Error_Amount
  ) %>%
  # Rename/replace columns to use the new, wider intervals
  mutate(
    .conf_lo = .conf_lo_new,
    .conf_hi = .conf_hi_new
  ) %>%
  select(-c(Error_Amount, .conf_lo_new, .conf_hi_new, Error, Absolute_Percentage_Error))

#Visualize Confidence Intervals

# Visualization Settings
min_y <- min(forecast_data_tbl_realistic$.conf_lo) * 0.98 
max_y <- max(forecast_data_tbl_realistic$.conf_hi) * 1.02 
mape_value_for_title <- round(actual_mape_test * 100, 2)
COLOR_PREDICTION <- "red" # Dark Grey/Blue (Prediction)
COLOR_ACTUAL <- "black" # Red (Actual Value)
COLOR_RIBBON <- "dimgrey" 

# Pivot the Base Table to Long Format (required for ggplot)
forecast_tbl_long_final <- forecast_data_tbl_realistic %>%
  pivot_longer(cols = c(.value, .prediction), 
               names_to = ".key", 
               values_to = "Y_Value") %>%
  mutate(
    .conf_lo_plot = ifelse(.key == ".prediction", .conf_lo, NA_real_),
    .conf_hi_plot = ifelse(.key == ".prediction", .conf_hi, NA_real_)
  )

# Create the ggplot Object
p <- ggplot(forecast_tbl_long_final, aes(x = .index, y = Y_Value, color = .key)) +
  
  # Draw the Confidence Interval (Ribbon)
  geom_ribbon(aes(ymin = .conf_lo_plot, ymax = .conf_hi_plot),
              fill = COLOR_RIBBON, 
              alpha = 0.6, 
              color = NA,
              data = forecast_tbl_long_final %>% filter(.key == ".prediction")) +
  
  # Draw the Lines
  geom_line(linewidth = 2.0) +
  
  # Y-Axis Zoom
  coord_cartesian(ylim = c(min_y, max_y)) + 
  
  # Colors and Titles
  labs(title = "iShares US Aerospace & Defense ETF",
       subtitle = paste0("Predictive Intervals based on TEST SET MAPE: ", mape_value_for_title, "% 
nnetsauce-Wrapped Booster Regressor (MTS) Model"),
       x = "", 
       y = "",
       color = "") + 
  
  scale_color_manual(values = c(".value" = COLOR_ACTUAL, 
                                ".prediction" = COLOR_PREDICTION),
                     labels = c(".value" = "Actual Value", 
                                ".prediction" = "Prediction")) +
  scale_y_continuous(labels = scales::label_currency()) +
  scale_x_date(labels = scales::label_date("%b %d"),
               date_breaks = "2 days") +
  theme_minimal(base_family = "Roboto Slab", base_size = 16) +
  theme(plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = ggtext::element_markdown(face = "bold"),
        plot.background = element_rect(fill = "azure", color = "azure"),
        panel.background = element_rect(fill = "snow", color = "snow"),
        axis.text = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1, 
                                   vjust = 1),
        legend.position = "none")


# Interactive Output
plotly::ggplotly(p)

NOTE: This article was generated using an AI model. The final content and structure were reviewed and approved by the author.

#Integrating #Python #Predictions #Tidyverse #bloggers

Integrating Python Predictions with R’s Tidyverse | R bloggers

1. The goal (the challenge)

2. The integration mechanism

3. Key findings

Related

Like this:

Related

Similar Posts

Tour 2025: sleeping in carriage

Buying tip – Winson x Tanakajakkuten Grit Dress collection – Shoegazing.com

Leave a Reply Cancel reply

1. The goal (the challenge)

2. The integration mechanism

3. Key findings

Related

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply