New R -package This Usupot: Visualization of the decomposition of differences in speed meter | R-Bloggers

New R -package This Usupot: Visualization of the decomposition of differences in speed meter | R-Bloggers

5 minutes, 47 seconds Read

[This article was first published on HOXO-M Blog, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)


Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

1. Overview

With data analysis, when a metric differs between two groups, we sometimes want to investigate whether a certain subgroup stimulates that difference. For example, if an important metric decrease is detected compared to the previous year, you may want to perform a more detailed analysis. In this analysis you can focus on gender among the attributes and investigate whether the deterioration has occurred in male, female or both. However, this type of analysis is a challenge when the metric is a speed, because the size of the contribution of each subgroup to the speed cannot be easily calculated, in contrast to volume statistics.

To tackle this problem, we set an approach inspired by the story of the Ship of Theseus. This approach gradually replaces the components of one group through those of the other, so that the metric is re -altarized at every step. The change in the metric with each step can then be interpreted as the contribution of each subgroup to the general difference.

For example, suppose the metric was 6.2% in 2024 and in 2025 fell to 5.2%. Again, we focus on gender. We replace the male data within the 2024 data set with the male data from 2025 and recalculate the metric. As a result, the metric would fall by 0.8 percentage points and reach 5.4%. In this case, the contribution of the male group to the change in the metriek -0.8 percentage points. We then replace the female data from 2024 by that from 2025. The data set then consists entirely from 2025 data and the metric falls by 0.2 percentage points, and reaches 5.2%. The contribution of the female group is therefore -0.2 percentage points.

When visualized, the results seem to be as follows:

From this plot we can see that the deterioration of the metric is mainly driven by the male group. We call this visualization the ‘Thoseus Plot’.

The This Usplot Package is designed to make it easy to generate theseus plots for different attributes.

2. Installation

You can This Usplot package from Kran.

install.packages("TheseusPlot")

You can install the development version of Girub of:

remotes::install_github("hoxo-m/TheseusPlot")

3. Details

3.1 Prepare data for

To make theseus plots, you need two data frames that share common columns.

We use the New York City flight data from 2013 from NYCFLIGHTS13 As a demo set set. Here we will define the speed matric if the share of flights that arrived on time. In December 2013, the per-time arrival percentage fell considerably compared to November. We are investigating the cause using a Thoseus -Plot.

First we create one on_time Column in the data frame to indicate whether every flight has arrived on time. We then extract the flights for November and December in separate data frames to form two comparison groups. The arrival percentage on De Tijd was 64% in November and fell to 47% in December.

library(dplyr)
library(nycflights13)

data <- flights |> 
  filter(!is.na(arr_delay)) |>
  mutate(on_time = arr_delay <= 0) |>  # Arrived on time
  left_join(airlines, by = "carrier") |>
  mutate(carrier = name) |>  # Convert carrier abbreviations to full names
  select(year, month, day, origin, dest, carrier, dep_delay, on_time)

data |> head()
#> # A tibble: 6 × 8
#>    year month   day origin dest  carrier                dep_delay on_time
#>                                  
#> 1  2013     1     1 EWR    IAH   United Air Lines Inc.          2 FALSE  
#> 2  2013     1     1 LGA    IAH   United Air Lines Inc.          4 FALSE  
#> 3  2013     1     1 JFK    MIA   American Airlines Inc.         2 FALSE  
#> 4  2013     1     1 JFK    BQN   JetBlue Airways               -1 TRUE   
#> 5  2013     1     1 LGA    ATL   Delta Air Lines Inc.          -6 TRUE   
#> 6  2013     1     1 EWR    ORD   United Air Lines Inc.         -4 FALSE

data_Nov <- data |> filter(month == 11)
data_Dec <- data |> filter(month == 12)

data_Nov |> summarise(on_time_rate = mean(on_time)) |> pull(on_time_rate)
#> [1] 0.6426161
data_Dec |> summarise(on_time_rate = mean(on_time)) |> pull(on_time_rate)
#> [1] 0.4672835

3.2 Basics

With the help of the two prepared data frames, we first make one ship object. The ship Object is a copy of the R6 class ShipOfTheseusDesigned to suddenly make theseus.

library(TheseusPlot)

ship <- create_ship(data_Nov, data_Dec, y = on_time, labels = c("November", "December"))

You can make a theseus plot by passing on column names to the plot method of one ship object. For example, to make a theseus plot for the airport of origin:

ship$plot(origin)

New York City has three major airports and Newark Liberty International Airport (EWR) was good for the majority of the fall in the temporary arrival percentage.

Note that the number of flights at every airport cases, because a larger flight volume is expected to have a greater impact. To make this clear, the Theseus -Plot shows the data size for each group within each subgroup as a bar chart. We see from this that the number of flights is comparable to the airports, which makes a direct comparison of contributions possible.

In summary, a theseus -plot consists of two components:

  • A waterfall plot shows how much each subgroup has contributed to the change in the metric.
  • A bar chart that represents the sample size for each group within each subgroup.

A ship Object also offers the table Method to inspect the exact values ​​that are used in the Thisus Plot.

ship$table(origin)
#> # A tibble: 3 × 8
#>   origin contrib    n1    n2    x1    x2 rate1 rate2
#>             
#> 1 EWR    -0.0831  9603  9410  6251  3901 0.651 0.415
#> 2 JFK    -0.0565  8645  8923  5702  4332 0.660 0.485
#> 3 LGA    -0.0358  8723  8687  5379  4393 0.617 0.506

3.3 Turn the plot around

When there are many subgroups, a theseus plot can be difficult to read. In such cases you can exchange the X and Y axes for better visualization.

ship$plot_flip(carrier)

When the number of subgroups is large, those with small contributions are automatically grouped. This happens as standard when there are more than 10 subgroups, but the threshold can be adjusted with the n argument.

ship$plot_flip(carrier, n = 5)

From this plot, Jetblue Airways and United Air Lines seem to have the biggest contributions to the decrease in the temporary arrival percentage.

3.4 Automatic discretization of continuous values

Theseus suddenly do not immediately support continuous variables. If a continuous column is provided, it is automatically discreted. For example, we can make a theseus -plot for delays delays.

ship$plot_flip(dep_delay)

Continuous variables are discreted as standard so that each subgroup has approximately the same sample sizes, with the number of bins set at 10. You can change these settings by giving the return value of continuous_config() at the continuous argument.

ship$plot_flip(dep_delay, continuous = continuous_config(n = 5))

This result shows that both a decrease in the departure of time and an increase in delayed departure has contributed to the fall in the temporary arrival percentage.


#package #Usupot #Visualization #decomposition #differences #speed #meter #RBloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *