Introducing a distionarium for building and researching distributions | R bloggers

Introducing a distionarium for building and researching distributions | R bloggers

8 minutes, 19 seconds Read


After going through the rOpenSci peer review, it became distionary package is now newly available on CRAN. This allows you to quickly create probability distributions – either from some data or from the built-in library – and then examine them in detail.

These distributions form the building blocks that combine advanced statistical models with broader models probavers ecosystem, built to free modelers from low-level coding so production pipelines remain human-friendly. At this point, the other probaverse packages are distributorallowing you to change distributions into new forms, and starveallowing you to tailor distributions to data. Developed with risk analysis use cases such as climate and insurance in mind, the same tools translate smoothly to simulations, education and other applied environments.

This post highlights the top 3 features of this latest version of distionary. Let’s start loading the package.

library(distionary)

Feature 1: More than just Base R distributions

Of course, all Base R distributions are available in distionary. Here is everyone’s favorite normal distribution.

dst_norm(0, 1)

Normal distribution (continuous)
--Parameters--
mean sd
0 1

plot(dst_norm(0, 1))

And good old Poisson.

dst_pois(3)

Poisson distribution (discrete)
--Parameters--
lambda
3

plot(dst_pois(3))
Poisson distribution probability mass function.

But additional game-changing distributions are also included.

A Zero distributionwhich always evaluates to NA. When you run an algorithm that encounters a problem, you can return a Null distribution instead of throwing an error. Even downstream evaluation steps won’t throw an error, because the code still sees a distribution instead of a bare one NA or NULL.

# Make a Null distribution.
null <- dst_null()
# Null distributions always evaluate to NA.
eval_quantile(null, at = c(0.25, 0.5, 0.75))

[1] NA NA NA

mean(null)

[1] NA

Empirical distributionswhere the data Are the distribution. These respect the observed behavior without enforcing a specific form, and are also often used as a benchmark for comparison with other models. Here is an example where the ozone concentration from the airquality dataset loaded with R.

# Empirical distribution of Ozone from the `airquality` dataset.
emp <- dst_empirical(airquality$Ozone, na_action_y = "drop")
# Inspect
print(emp, n = 5)

Finite distribution (discrete)
--Parameters--
# A tibble: 67 × 2
outcomes probs
 
1 1 0.00862
2 4 0.00862
3 6 0.00862
4 7 0.0259
5 8 0.00862
# ℹ 62 more rows

Compare his cumulative distribution function (CDF) to that of a Gamma distribution adapted to ozone levels, borrowing the hunger package from the probaverse for the appropriate task.

# Fit a Gamma distribution to Ozone using the famish package.
library(famish)
gamma <- fit_dst_gamma(airquality$Ozone, na_action = "drop")

# Plot the cumulative distribution functions (CDFs) together.
plot(emp, "cdf", n = 1000, xlab = "Ozone Levels (ppb)")
plot(gamma, "cdf", add = TRUE, col = "red")
legend(
 "bottomright",
 legend = c("Empirical", "Fitted Gamma"),
 col = c("black", "red"),
 lty = 1
)
Comparison of Empirical CDF and Adjusted Gamma CDF for Ozone Levels.

These textbook distributions become much more useful once they become building blocks for building a system. For example, they can form predictive distributions in a machine learning context, or be related to other variables. This is what the probaverse is trying to make possible.

Feature 2: Friendly to neat tabular workflows

Load the tidyverse first to activate tidy tabular workflows. And yes, probaverse is named after the ticker because it wants to be a “tidyverse for probability.”

library(tidyverse)

You can safely ignore this next part unless you want to see me sorting out some financial details for you.

# Wrangle the stocks data frame using tidyverse.
stocks <- as_tibble(EuStockMarkets) |>
 mutate(across(everything(), \(x) 100 * (1 - x / lag(x)))) |>
 drop_na()

The stocks data i have wrangled is a table of daily percent loss for four major European stock indices. The dates do not matter for this example, so they have been omitted.

stocks

# A tibble: 1,859 × 4
DAX SMI CAC FTSE
   
1 0.928 -0.620 1.26 -0.679
2 0.441 0.586 1.86 0.488
3 -0.904 -0.328 0.576 -0.907
4 0.178 -0.148 -0.878 -0.579
5 0.467 0.889 0.511 0.720
6 -1.25 -0.676 -1.18 -0.855
7 -0.578 -1.23 -1.32 -0.824
8 0.287 0.358 0.193 -0.0837
9 -0.637 -1.11 -0.0171 0.522
10 -0.118 -0.437 -0.314 -1.41
# ℹ 1,849 more rows

Let’s focus on the DAX stock index first. Apply an empirical distribution like last time (note that I’m using a data mask in dst_empirical() this time).

# Fit an empirical distribution to the DAX stock index.
dax <- dst_empirical(DAX, data = stocks)
# Inspect the CDF.
plot(dax, xlab = "Daily Loss (%)")
Empirical CDF of daily losses on the DAX stock index.

You can easily calculate some standard quantiles in tabular form so that the input is placed next to the calculated output: just use the enframe_ prefix instead of eval_ as we did above with the zero distribution.

enframe_quantile(dax, at = c(0.25, 0.5, 0.75), arg_name = "prob")

# A tibble: 3 × 2
prob quantile
 
1 0.25 -0.638
2 0.5 -0.0473
3 0.75 0.468

Or, more to the point here – and invoking the probaverse’s weakness for risk-based work – you can calculate return levels (also called ‘Value at Risk’ in financial applications) for specific return periods. If you don’t know what these are, they’re just fancy names for quantiles.

return_periods <- c(5, 50, 100, 200, 500)
enframe_return(
 dax,
 at = return_periods,
 arg_name = "return_period",
 fn_prefix = "daily_loss_pct"
)

# A tibble: 5 × 2
return_period daily_loss_pct
 
1 5 0.621
2 50 2.17
3 100 2.75
4 200 3.08
5 500 3.71

The table output becomes even more powerful when inserted into a table of models, as it facilitates comparisons and trends. To demonstrate this, build a model for each stock. First, extend the data for this task.

# Lengthen the data using tidyverse.
stocks2 <- pivot_longer(
 stocks,
 everything(),
 names_to = "stock",
 values_to = "daily_loss_pct"
)
# Inspect
stocks2

# A tibble: 7,436 × 2
stock daily_loss_pct
 
1 DAX 0.928
2 SMI -0.620
3 CAC 1.26
4 FTSE -0.679
5 DAX 0.441
6 SMI 0.586
7 CAC 1.86
8 FTSE 0.488
9 DAX -0.904
10 SMI -0.328
# ℹ 7,426 more rows

Build a model for each stock using a group_by + summarise workflow of the neatverse (please excuse the current need to package the distribution list()). Note that distributions become table entries, here identified by their class .

# Create an Empirical distribution for each stock.
models <- stocks2 |>
 group_by(stock) |>
 summarise(model = list(dst_empirical(daily_loss_pct)))
# Inspect
models

# A tibble: 4 × 2
stock model
 
1 CAC 
2 DAX 
3 FTSE 
4 SMI 

Now you can use a tidy workflow to calculate and expand tables of quantiles for each model. In fact, this workflow is so common that I’m considering adding a special verb for it.

return_levels <- models |>
 mutate(
 df = map(
 model,
 enframe_return,
 at = return_periods,
 arg_name = "return_period",
 fn_prefix = "daily_loss_pct"
 )
 ) |>
 unnest(df) |>
 select(!model)
# Inspect
print(return_levels, n = Inf)

# A tibble: 20 × 3
stock return_period daily_loss_pct
  
1 CAC 5 0.757
2 CAC 50 2.37
3 CAC 100 2.78
4 CAC 200 3.41
5 CAC 500 3.97
6 DAX 5 0.621
7 DAX 50 2.17
8 DAX 100 2.75
9 DAX 200 3.08
10 DAX 500 3.71
11 FTSE 5 0.542
12 FTSE 50 1.58
13 FTSE 100 2.05
14 FTSE 200 2.31
15 FTSE 500 2.87
16 SMI 5 0.552
17 SMI 50 2.03
18 SMI 100 2.52
19 SMI 200 2.91
20 SMI 500 3.55

The result is a tidy data set that is ready for most analyses. For example, you can easily make a comparison of the return levels of each stock. I create these plots all the time to enable risk-informed decision making.

return_levels |>
 mutate(stock = fct_reorder2(stock, return_period, daily_loss_pct)) |>
 ggplot(aes(return_period, daily_loss_pct, colour = stock)) +
 geom_point() +
 geom_line() +
 theme_bw() +
 scale_x_log10(
 "Return Period (days)",
 minor_breaks = c(1:10, 1:10 * 10, 1:10 * 100)
 ) +
 scale_y_continuous("Daily Loss", label = scales::label_number(suffix = "%")) +
 annotation_logticks(side = "b") +
 scale_colour_discrete("Stock Index")
Return level chart for daily loss rates of stock indices.

Feature 3: Create the distribution you need

You can create your own distributions with distionary by specifying only a minimal set of properties; all other properties are automatically derived and can be retrieved if necessary.

Suppose you need an Inverse Gamma distribution, but it is not available in the distionary. Currently, Distionary assumes that you have at least the thickness and CDF; you could pick this up at extraDept package (functions dinvgamma() And pinvgamma()). Connect them to distaries distribution() function and access a variety of properties you didn’t specify, such as the function mean, variance, skewness, and hazard.

# Make an Inverse Gamma distribution (minimal example).
ig <- distribution(
 density = function(x) extraDistr::dinvgamma(x, alpha = 5, beta = 20),
 cdf = function(x) extraDistr::pinvgamma(x, alpha = 5, beta = 20),
 .vtype = "continuous",
)
# Calculate anything.
mean(ig)

[1] 5

variance(ig)

[1] 8.333333

skewness(ig)

[1] 3.464085

plot(ig, "hazard", to = 20, n = 1000, xlab = "Outcome")
Hazard function of an inverse gamma distribution.

You could also consider giving the distribution a .name – it pays off when you juggle multiple models. Add .parameters provides additional specificity to the distribution .namebut are not yet used for functional purposes.

Here is a more complete implementation of the Inverse Gamma Distribution, this time implemented as a function of the two parameters. Note that I also check that the parameters are positive (cheers to the checkmate package).

dst_invgamma <- function(alpha, beta) {
 checkmate::assert_number(alpha, lower = 0)
 checkmate::assert_number(beta, lower = 0)
 distribution(
 density = \(x) extraDistr::dinvgamma(x, alpha = alpha, beta = beta),
 cdf = \(x) extraDistr::pinvgamma(x, alpha = alpha, beta = beta),
 quantile = \(p) extraDistr::qinvgamma(p, alpha = alpha, beta = beta),
 random = \(n) extraDistr::rinvgamma(n, alpha = alpha, beta = beta),
 .name = "Inverse Gamma",
 .parameters = list(alpha = alpha, beta = beta),
 .vtype = "continuous",
 )
}

Now we can create the same Inverse Gamma distribution as before:

ig2 <- dst_invgamma(5, 20)
# Inspect
ig2

Inverse Gamma distribution (continuous)
--Parameters--
alpha beta
5 20

By the way, this feature – the ability to inspect other distribution properties even if they are unspecified – is great for learning about probability. That’s because you see the many ways distributions can be represented, not just by the usual density or probability mass functions you see in textbooks.

This feature also enables extensibility of the probaverse. For example, probaverse’s distplyr package creates mixture distributions, which have no explicit formula for the quantile function. However, this isn’t a problem: the distribution can still be defined, and distionary will figure out what the quantiles are.

What’s to come?

Currently, the distribution package provides important functionality for defining and evaluating distribution objects. Future goals include:

  • Wider coverage. It goes beyond univariate continuous distributions to make mixed discrete/continuous and multivariate use cases feel premium.
  • Risk own statistics. By making cost functions, tail expectations, and other decision metrics evaluate more naturally.
  • Workflow ergonomics. Relaxing the requirement to provide density/CDF pairs and adding verbs that streamline the common mutatemapunnest patterns.

If this turns you on, join the conversation by opening a song or contributions.

Special thanks to the rOpenSci reviewers Katrina Brock And Christophe Dutang for insightful comments that has improved this package. Also thanks BGC Engineering Inc.the R consortiumand the European Space Agency together with the Polytechnic University of Milan for supporting this project.


#Introducing #distionarium #building #researching #distributions #bloggers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *