Double descent explained | R bloggers

This is a post about the famous double descent phenomenon in machine learning. I try to construct as simple an example of dual descent as possible and explain exactly how it arises and why it does not contradict the bias-variance trade-off.

I was inspired to write this post by an excellent one video from Welch Labs titled What goes wrong in the books about AI. Despite the provocative title, the video explains why dual ancestry doesn’t really mean the books are wrong. A blog post by Alex Shtoff contains detailed Python code for an example similar to the one in the video.

I couldn’t find an equally good post with R code, and there seems to be a lot of misinformation about dual ancestry online, so I thought I’d write my own post. Instead of following Shtoff’s example, I also decided to create something simpler. How simple can a model be, but still exhibit the phenomenon of double descent? To answer that, I need to explain what dual ancestry is.

One of the most important ideas in machine learning is the bias-variance trade-off (they ask about it in every interview). This is the idea that, as a model becomes more and more complex, it will fit the training data better and better (the yellow line in the image below indicates training error), but at some point it will overfit and perform worse on unseen test data (the red line in the image indicates testing error). Since you typically want the model to perform well on unseen test data, this is a problem. It means that you have to be conservative and choose a model that is not too lavish.

Conceptually, you should think of a U-shaped curve. You want to select the model at the bottom of the U.

As neural networks became popular, people started to notice that there is more to it than this. As you add more and more parameters to your neural network, you will see the U-shaped curve. But beyond a certain point, the test error starts to decrease again.

This is called double descent. As your network gets bigger and bigger, it will stop overfitting. If you make the network large enough, it can actually have a lower test error than the model at the bottom of the U. That’s one reason why those craven AI companies are hoarding GPUs so they can train models with billions of parameters. These models are ordinary better than smaller models.

But why? What’s going on?

Conceptually, you can think of the training data in a machine learning problem as a $n \times k$ matrix and a $n$ vector. The rows of the $n \times k$ matrix correspond to data points. The columns correspond to the values of each attribute. The $n$ vector $y$ contains the desired output for each data point.

When $n > k$ you are in the traditional situation found in statistics. You have more data points than features. In this case, there is probably no single model that fits your data exactly. You have to compromise by doing some kind of averaging process, which is roughly what linear regression does.

When $k > n$, you have too many functions. In this case, there will likely be many models that fit your data exactly. Most machine learning algorithms will take some sort of average of these models, or select one of them as the best model.

When $n$ and $k$ are equal, you have exactly the same number of features as data points. To start with, you have $n$ equations in $n$ unknowns. There will usually be exactly one solution, giving you only one model choice.

To put it another way, a machine learning problem with $n$ data points and $k$ features has roughly

\[\left( \begin{matrix} \max(n, k) \\ \min(n, k) \end{matrix} \right)\]

degrees of freedom because this is the number of fully ranked submatrices in a generic $n \times k$ matrix. These degrees of freedom can be ‘averaged’ or exploited in some way to reduce the variance of the fitted model. Many machine learning algorithms work by adding features to an existing data set (this can also be done by hand, in which case this is called feature engineering). As the number of added features increases, there are more opportunities for averaging. These opportunities can sometimes be taken advantage of to produce a better performing model.

Schematically, what you expect to see looks something like this:

plot.ts(-lchoose(pmax(10, 0:50), pmin(10, 0:50)), xlab="model size", ylab="test error", 
xaxt="n", yaxt="n", las=1)

This suggests that an easy way to make double descent appear is to create an algorithm that adds randomly generated features to a dataset, which we will do now.

We generate a univariate data set with a cubic form supported by $[-1, 1]$.

make_data <- function(n, sigma){
  x <- 2*(runif(n) - 0.5)
  y <- x * (x-0.6) * (x+0.6) + sigma * rnorm(n)
  data.frame(x=x, y=y)
}

Our model will be very simple. We generate functions from the form

\[f_b(x) = \begin{cases} 1 & x > b\\
0 & x < b \end{cases}\]

and add them to the data. We use these functions to predict $y$ for new values of $x$. The following function adds the functions to the dataset when the values of $b$ are stored in a so-called vector cutoffs.

model_matrix <- function(dat, cutoffs, s_noise=0){
  n <- nrow(dat)
  N <- length(cutoffs)
  extra_features <- outer(dat$x, cutoffs, ">") + s_noise*rnorm(N*n)
  cbind(1, dat$x, extra_features)
}

To avoid adding two identical features, we add a small amount of noise controlled by the parameter s_noise. To include an intercept term, we also add a column of ones, which is equivalent to allowing horizontally mirrored versions of the features.

To fit the model, we randomly add $K$ of these functions. We use the Moore-Penrose pseudoinverse (de ginv function from the MASS package) of the model matrix to obtain a vector of regression coefficients bb. This corresponds to linear regression when $n > k$. When $n < k$ this corresponds to selecting the linear model with the smallest norm that fits the data.

library(MASS)

fit_model <- function(dat, K, s_noise=0.01){
  # N is the number of random features added
  # dat is a univariate data set with columns x, y
  a <- min(dat$x)
  b <- max(dat$x)
  
  cutoffs <- runif(K)*(b-a) + a # random cutoffs

  bb <- ginv(model_matrix(dat, cutoffs, s_noise)) %*% dat$y
  list(bb=bb, cutoffs=cutoffs)
}

predict_model <- function(model, dat){
  # predict from the model from data set with column x
  model_matrix(dat, model$cutoffs, 0) %*% model$bb
}

Here’s an example that shows we can really fit simulated data into this model well.

set.seed(2510)
dat <- make_data(50, 0.1)
model <- fit_model(dat, 20)
plot(dat, main="Training data and fitted model", las=1)
dat$pred <- predict_model(model, dat)
lines(dat$x[order(dat$x)], dat$pred[order(dat$x)], "l", col="red",
      lwd=2)

This shows the connection to the training data, but what we’re really interested in is the connection to unseen test data. We can use the following function to generate many test datasets and calculate the MSE of a pre-fitted model.

calc_test_error <- function(model, nrow_test, sigma_test, n_reps){
  
  mse <- rep(0, n_reps)
  for (i in 1:n_reps){
    dat_test <- make_data(nrow_test, sigma_test)
    pred <- predict_model(model, dat_test)
    mse[i] <- mean((pred - dat_test$y)^2)
  }
  mean(mse)
}

We also want a function to get the variance of each model. Here we generate many sets of training data from the same distribution and calculate the variance of the model’s predictions at each point. A higher variance means the model captures more noise (and less signal) in the data.

calc_variance <- function(n_pts, n_model, sigma, n_reps){
  
  x_test <- seq(-1, 1, len=n_pts)
  fitted <- matrix(0, nr=n_reps, nc=length(x_test))
  
  for (i in 1:n_reps){
    dat <- make_data(n_pts, sigma)
    model <- fit_model(dat, n_model)
    fitted[i,] <- predict_model(model, data.frame(x=x_test)) 
  }
  mean(apply(fitted, 2, var))
}

Now everything is in place. The following code generates a dataset with $10$ points and fits models with different numbers of additional features (from $0$ to $50$). Because the additional features are randomly generated, each model fit is repeated several times to estimate the average error err on invisible test data and the mean variance.

set.seed(2510)
K <- 50
model_reps <- 10
dat <- make_data(10, 0.1)
err <- vars <- matrix(0, nr=K+1, nc=model_reps)
for (i in 0:K){
  print(i)
  for (j in 1:model_reps){
    model <- fit_model(dat, i)
    err[i+1, j] <- calc_test_error(model, 100, 0.1, 1000)
    vars[i+1, j] <- calc_variance(10, i, 0.1, 1000)
  }
}

It takes a few minutes to run the code. Here is the final plot showing the classic behavior of double descent.

plot_col <- rep("blue", K+1)
plot_col[1:9] <- "forestgreen" # used later
df <- data.frame(vars=rowMeans(log(vars)), 
                 err=rowMeans(log(err)), 
                 plot_col=plot_col)

plot(0:K, df$err, "l", xlab="Number of added features",
     ylab="log(test error)", las=1, main="Double Descent",
     col="blue", lwd=2)
abline(v=8, lty=2, col="red")

The red dotted line shows that the maximum test error occurs with $K=8$ added features. Why $8$? This is because our model matrix has $K+2$ columns and we have $10$ data points. So if $K=8$ we get a square model matrix, which means that there is exactly one possible model that can output the $y$ training values, given the $x$ training values. This is the point where the maximum amount of overfitting occurs.

Also note that the models with a very large number of additional features had the lowest test error; even lower than the models with only a few extra features. Just like real deep learning!

The following graph shows how the model’s variance changes with the number of additional features added. Note that this plot has the same shape as the double descent plot. As we add more features, the variance decreases.

plot(0:K, df$vars, "l", xlab="Number of added features",
     ylab="log(variance)", las=1, main="Model variance",
     lwd=2, col="darkgreen")
abline(v=8, lty=2, col="red")

This is the key to understanding why dual ancestry does not contradict the traditional idea of the bias-variance trade-off. When we plot the U-shaped curve in the bias versus variance trade-off, this is the variance of the model we should plot on the $x$ axis, not the ‘model complexity’ or ‘model size’ or some other similar idea.

df$K <- 0:K
df <- df[order(df$vars), ]
plot(df$vars, df$err, col=df$plot_col,
     pch=19, xlab="log(variance)", ylab="log(test error)",
     las=1, cex=1.3)

loess_fit <- loess(df$err ~ df$vars, span = 0.6)
lines(df$vars, predict(loess_fit), col=rgb(0, 0, 0, 0.1), lwd=10)

When we plot the test error against the variance, we still get the well-known U-shaped curve. The green points are the models that are below the $K=8$ threshold. The blue points are the models that are above the threshold. Together they form a nice, familiar pattern.

This simple example illustrates some ideas about deep learning. A lot of research is currently being done into why deep neural networks are so effective in practice. One idea that has been put forward is the so-called lottery hypothesis. This assumes that a deep neural network initialized with random weights essentially computes a large number of random features, just like the model above. By chance, some of these features are probably useful for predictions. The process of training the network tends to promote these useful features, leading to an effective model. The larger the network, the more random functions are created and the more likely some of them will be useful.

(But as usual, the devil is in the details. If we were to increase the dimension of our dataset in the example above, we would quickly discover that our random features will be zero almost everywhere. The very clever way in which deep neural networks are set up makes this unlikely in the case of neural networks, but that’s a topic for another day.)

Anyway, I hope this post has convinced you that dual ancestry isn’t all that mysterious (it arises from just a few lines of R code) and that the bias vs. variance trade-off is alive and well!

#Double #descent #explained #bloggers

Double descent explained | R bloggers

Related

Like this:

Related

Similar Posts

Day trip to Sausalito: what to see, eat and do

The problem of Monty Hall simulation R-Bloggers

Leave a Reply Cancel reply

Related

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply