Want to share your content on R bloggers? click here if you have a blog, or here if you don’t.
In this article, we have performed a successful integration of a non-standard Python prediction model into the R Tidyverse/Tidymodels framework, which mainly uses the reticulate package.
1. The goal (the challenge)
The goal was to use a powerful custom Python model (nnetsauce‘s MTS wrap around a cyb$BoosterRegressor) and integrate the results (predictions and prediction intervals) into the R ecosystem, allowing consistent data manipulation with dplyr and high-quality visualization using ggplot2 (and interactive plotly).
The challenge was that this custom Python model is not natively recognized by R’s standard prediction packages, such as modeltime.
2. The integration mechanism
The integration was achieved through the following steps:
reticulateBridge: We used thereticulatepackage to seamlessly call the Python model, passing R data (tbl_train), matching the model (regr$fit()), and run the prediction (regr$predict()).- Data Conversion: The raw prediction output (predictions and confidence intervals) from Python was carefully extracted and converted into a standard R chatter (
forecast_data_tbl). - Tidymodels simulation (first attempt): We tried this chatter in one
modeltimestructure using dummy models and calibration, which failed due to strict internal validation controlsplot_modeltime_forecast(). - Tidyverse final solution: We left it rigid
modeltimevisualization features and adopted Tidyverse’s core approach:- Data manipulation: We used
dplyrAndtidyr::pivot_longer()to restructure the prediction data in the long formatfor which is optimalggplot2. - Visualization: We used pure
ggplot2Andplotlyto manually draw the lines and ribbons and apply the desired STOXX aesthetic, successfully bypassing all package incompatibility errors.
- Data manipulation: We used
3. Key findings
The process revealed a crucial statistical finding:
- Metric vs. Visual Discrepancy: The point prediction quality of the Python model was excellent (low CARD), but the calculation of the internal prediction interval (
type_pi="gaussian") showed serious defects (high Out-of-band error). - Solution: To ensure that the visualization reflected the low MAPE score, we calculated the actual MAPE on the test set and manually realistic MAPE-based confidence interval to be used in the final plot.
Essentially we’ve created a robust R-centric pipeline for post-processing, validation and visualization of forecasts generated by a custom Python machine learning model.
library(tidyverse)
library(tidyquant)
library(reticulate)
library(tidymodels)
library(timetk)
library(modeltime)
library(plotly)
# nnetsauce_env ortamını kullanmasını söyleyin
use_python("C:/Users/selcu/AppData/Local/r-miniconda/envs/nnetsauce_env/python.exe", required = TRUE)
# Import Python packages
np <- import("numpy")
ns <- import("nnetsauce")
cyb <- import("cybooster")
sklearn <- import("sklearn")
#iShares US Aerospace & Defense ETF (ITA)
df_ita <-
tq_get("ITA") %>%
select(date, close) %>%
filter(date >= last(date) - months(36))
#Split Data
split <- time_series_split(df_ita,
assess = "15 days",
cumulative = TRUE)
tbl_train <- training(split)
tbl_test <- testing(split)
# 1. import DecisionTreeRegressor from sklearn
skl_tree <- import("sklearn.tree")
# Define the Base Estimator (The core learning component, factored for clarity)
# This uses a Decision Tree with a max depth of 3 as the weak learner for the boosting model.
base_estimator_instance <- skl_tree$DecisionTreeRegressor(max_depth = 3L)
# Define the Booster Object (The main regression engine)
# cyb$BoosterRegressor is the ensemble model that will be fitted.
booster_regressor <- cyb$BoosterRegressor(
base_estimator_instance, # The weak learner used for boosting
n_estimators = 100L # Number of boosting stages (number of trees to build)
)
# Define the MTS (Multi-variate Time Series) Model Wrapper
# ns$MTS wraps the booster model to handle time series features like lags and prediction intervals.
regr <- ns$MTS(
obj = booster_regressor, # The fitted regression model (BoosterRegressor)
# Time Series Parameters:
lags = 20L, # Number of past observations (lags) to use as predictors (X)
type_pi = "gaussian", # Method for computing Prediction Intervals (PI).
# 'gaussian' assumes errors are normally distributed.
# (NOTE: We found this to be too narrow, requiring manual adjustment with MAPE.)
# Execution Control:
show_progress = TRUE # Displays the training progress (useful for monitoring)
)
#Converting the tibble to data frame for the fitting process
df_train <-
tbl_train %>%
as.data.frame() %>%
mutate(date = as.character(date))
df <- df_train[, -1, drop = FALSE]
rownames(df) <- df_train$date
#Fit the model
regr$fit(df)
#Step 1: Extract Python Predictions and Create the Base R Data Table
# --- Define Python Function and Retrieve Predictions ---
# (Ensure this function has been run in your R session previously.)
reticulate::py_run_string("
def get_full_data_11(preds_obj):
mean_list = preds_obj[0]['close'].tolist()
lower_list = preds_obj[1]['close'].tolist()
upper_list = preds_obj[2]['close'].tolist()
return {'mean': mean_list, 'lower': lower_list, 'upper': upper_list}
")
preds_raw <- regr$predict(h = 11L)
full_data_list <- py$get_full_data_11(preds_raw)
preds_mean_vector <- unlist(full_data_list$mean)
preds_lower_vector <- unlist(full_data_list$lower)
preds_upper_vector <- unlist(full_data_list$upper)
# Align data sets based on prediction length
vector_length <- length(preds_mean_vector)
df_test_final <- tbl_test %>% dplyr::slice(1:vector_length)
# --- Create the Base Forecast Table ---
forecast_data_tbl <- tibble(
.index = df_test_final$date,
.value = df_test_final$close,
.prediction = preds_mean_vector,
.conf_lo = preds_lower_vector, # These are the initially incorrect (too narrow) CIs
.conf_hi = preds_upper_vector, # These are the initially incorrect (too narrow) CIs
.model_desc = "nnetsauce_BoosterRegressor"
)
#Step 2: Calculate the Actual MAPE and Create Realistic Confidence Intervals
library(dplyr)
library(tibble)
library(tidyr)
# --- Step 2.1: Calculate the Actual MAPE on the Test Set ---
# Create a verification table containing error metrics
check_tbl <- forecast_data_tbl %>%
mutate(
Error = .value - .prediction,
# MAPE formula: abs(Actual - Prediction) / Actual * 100
Absolute_Percentage_Error = abs(Error) / .value * 100
)
# Actual MAPE (Calculated from the Test Set)
actual_mape_test <- mean(check_tbl$Absolute_Percentage_Error, na.rm = TRUE) / 100
# (Converting the percentage to decimal format: e.g., 2.84% -> 0.0284)
print(paste0("Calculated MAPE (decimal format): ", actual_mape_test))
# --- Step 2.2: Create New, Realistic Intervals Based on MAPE ---
Z_SCORE_95 <- 1.96 # Z-Score for 95% Confidence Interval
# Create the New Realistic Prediction Interval Table
forecast_data_tbl_realistic <- check_tbl %>%
mutate(
# Error Margin = Prediction * MAPE * Z-Score
Error_Amount = .prediction * actual_mape_test * Z_SCORE_95,
# New Confidence Intervals
.conf_lo_new = .prediction - Error_Amount,
.conf_hi_new = .prediction + Error_Amount
) %>%
# Rename/replace columns to use the new, wider intervals
mutate(
.conf_lo = .conf_lo_new,
.conf_hi = .conf_hi_new
) %>%
select(-c(Error_Amount, .conf_lo_new, .conf_hi_new, Error, Absolute_Percentage_Error))
#Visualize Confidence Intervals
# Visualization Settings
min_y <- min(forecast_data_tbl_realistic$.conf_lo) * 0.98
max_y <- max(forecast_data_tbl_realistic$.conf_hi) * 1.02
mape_value_for_title <- round(actual_mape_test * 100, 2)
COLOR_PREDICTION <- "red" # Dark Grey/Blue (Prediction)
COLOR_ACTUAL <- "black" # Red (Actual Value)
COLOR_RIBBON <- "dimgrey"
# Pivot the Base Table to Long Format (required for ggplot)
forecast_tbl_long_final <- forecast_data_tbl_realistic %>%
pivot_longer(cols = c(.value, .prediction),
names_to = ".key",
values_to = "Y_Value") %>%
mutate(
.conf_lo_plot = ifelse(.key == ".prediction", .conf_lo, NA_real_),
.conf_hi_plot = ifelse(.key == ".prediction", .conf_hi, NA_real_)
)
# Create the ggplot Object
p <- ggplot(forecast_tbl_long_final, aes(x = .index, y = Y_Value, color = .key)) +
# Draw the Confidence Interval (Ribbon)
geom_ribbon(aes(ymin = .conf_lo_plot, ymax = .conf_hi_plot),
fill = COLOR_RIBBON,
alpha = 0.6,
color = NA,
data = forecast_tbl_long_final %>% filter(.key == ".prediction")) +
# Draw the Lines
geom_line(linewidth = 2.0) +
# Y-Axis Zoom
coord_cartesian(ylim = c(min_y, max_y)) +
# Colors and Titles
labs(title = "iShares US Aerospace & Defense ETF",
subtitle = paste0("Predictive Intervals based on TEST SET MAPE: ", mape_value_for_title, "%
nnetsauce-Wrapped Booster Regressor (MTS) Model"),
x = "",
y = "",
color = "") +
scale_color_manual(values = c(".value" = COLOR_ACTUAL,
".prediction" = COLOR_PREDICTION),
labels = c(".value" = "Actual Value",
".prediction" = "Prediction")) +
scale_y_continuous(labels = scales::label_currency()) +
scale_x_date(labels = scales::label_date("%b %d"),
date_breaks = "2 days") +
theme_minimal(base_family = "Roboto Slab", base_size = 16) +
theme(plot.title = element_text(face = "bold", size = 16),
plot.subtitle = ggtext::element_markdown(face = "bold"),
plot.background = element_rect(fill = "azure", color = "azure"),
panel.background = element_rect(fill = "snow", color = "snow"),
axis.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 45,
hjust = 1,
vjust = 1),
legend.position = "none")
# Interactive Output
plotly::ggplotly(p)
NOTE: This article was generated using an AI model. The final content and structure were reviewed and approved by the author.
Related
#Integrating #Python #Predictions #Tidyverse #bloggers


