1   Guess the correlation

Get the packages

To use torch, you first need to install it. To get the CRAN version, do

install.packages("torch")

Does it work? Here’s a quick test:

library(torch)
torch_tensor(1)
torch_tensor
 1
[ CPUFloatType{1} ]

Now, while torch contains all the core functionality, there is a whole ecosystem built around it.

Notably, torchvision is essential to image-processing tasks. In this example, we don’t use it much – overtly, that is. It’s used more prominently behind the scenes. Let’s get it:

install.packages("torchvision")
library(torchvision)

Finally, there is an evolving package, named torchdatasets , that wraps datasets in a convenient format, rendering them immediately usable from torch. Let’s get this as well, as we’re going to use one of the datasets it provides.

remotes::install_github("mlverse/torchdatasets")
library(torchdatasets)

Get the dataset

“Guess the correlation” is a fun dataset that tasks one – a person, if they feel like, or a program, if we train it – to estimate the (linear) correlation between two variables displayed in a scatterplot.

torchdatasets will download, unpack, and preprocess it for us.

The training set is huge; it has 150000 observations. For instruction purposes, we don’t really need so much data – we’ll restrict ourselves to small subsets, for each of training, validation, and test sets.

train_indices <- 1:10000
val_indices <- 10001:15000
test_indices <- 15001:20000

Now, the following snippet does the following:

  • download and unpack the dataset,

  • do some custom preprocessing on the images (on top of what is already done by default) – more on that soon

  • take just the first 10000 observations and put them in a torch Dataset object named train_ds.

    add_channel_dim <- function(img) img$unsqueeze(1)
    crop_axes <- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)
    
    root <- file.path(tempdir(), "correlation")
    
    train_ds <- guess_the_correlation_dataset(
    # where to unpack
    root = root,
    # additional preprocessing 
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    # don't take all data, but just the indices we pass in
    indexes = train_indices,
    download = TRUE
    )
    

As we’re at it, let’s do the same for the validation and test sets. We don’t need to download again, as we’re building on the same underlying data. We just pick different observations.

valid_ds <- guess_the_correlation_dataset(
    root = root,
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    indexes = val_indices,
    download = FALSE
  )

test_ds <- guess_the_correlation_dataset(
    root = root,
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    indexes = test_indices,
    download = FALSE
  )

Let’s counter-check we got what we wanted. How many items are there in each set?

length(train_ds)
length(valid_ds)
length(test_ds)
[1] 10000
[1] 5000
[1] 5000

And how does a single observation look like? Here is the first one:

train_ds[1]
$x
torch_tensor
(1,.,.) = 
 Columns 1 to 9  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
  0.0000  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999  0.9999
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{1,130,130} ]

$y
torch_tensor
-0.45781
[ CPUFloatType{} ]

$id
[1] "arjskzyc"

It’s a list of three items, the last of which we’re not interested in for our purposes.

The second, a scalar tensor, is the correlation value, the thing we want the network to learn. The first, x, is the scatterplot: a tensor representing an image of dimensionality 130*130. But wait – what is that 1 in the shape output?

[ CPUFloatType{1,130,130} ]

This really is a three-dimensional tensor! The first dimension holds different channels – or the single channel, if the image has but one. In fact, the reason x came in this format is that we requested it, here:

add_channel_dim <- function(img) img$unsqueeze(1)

train_ds <- guess_the_correlation_dataset(
    # ...
    transform = function(img) crop_axes(img) %>% add_channel_dim(),
    # ...
  )

add_channel_dim() was passed in as a custom transformation, to be applied to every item of the dataset. It calls one of torch’s many tensor operations, unsqueeze(), that adds a singleton dimension at a requested position.

How about the second custom transformation?

crop_axes <- function(img) transform_crop(img, top = 0, left = 21, height = 131, width = 130)

Here, we crop the image, cutting off the axes and labels on the left and bottom. These image regions don’t contribute any distinctive information, and having the images be smaller saves memory.

Work with batches

Now, we’ve done so much work already, but you haven’t actually seen any of the scatterplots yet! The reason we’ve been waiting until now is that we want to show a bunch of them at a time, and for that, we need to know how to handle batches of data.

So let’s create a DataLoader object from the training set. We’ll soon use it to train the model, but right now, we’ll just plot the first batch.

A DataLoader needs to know where to get the data – namely, from the Dataset it gets passed –, as well as how many items should go in a batch. Optionally, it can return data in random order (shuffle = TRUE).

train_dl <- dataloader(train_ds, batch_size = 64, shuffle = TRUE)

Like a Dataset, we can query a DataLoader for its length. For the Dataset, this meant number of items; for a DataLoader , it means number of batches:

length(train_dl)
[1] 157

To access the first batch, we create an iterator from the DataLoader and ask it for the first batch. Even if it weren’t for plotting, you might do this just to check that the dimensions look ok:

batch <- dataloader_make_iter(train_dl) %>% dataloader_next()

dim(batch$x)
dim(batch$y)
[1]  64   1 130 130
[1] 64

And plot! Note how we first remove the channels dimension – as.raster() wouldn’t like it – and then, convert the tensor to R for further processing:

par(mfrow = c(8,8), mar = rep(0, 4))

images <- as.array(batch$x$squeeze(2))

images %>%
  purrr::array_tree(1) %>%
  purrr::map(as.raster) %>%
  purrr::iwalk(~{plot(.x)})

Want to try your skill at guessing these? Here is the corresponding ground truth:

batch$y %>% as.numeric() %>% round(digits = 2)
[1] -0.29  0.58 -0.57  0.56  0.10  0.09  0.21  0.45 -0.24  0.65  0.70  0.40  0.71  0.20  0.07  0.66  0.65 -0.56  0.73
[20] -0.40 -0.18 -0.42 -0.46 -0.45 -0.77  0.09 -0.19  0.40 -0.70 -0.04 -0.16 -0.13 -0.18  0.01  0.25  0.54  0.21  0.28
[39]  0.49  0.86 -0.70  0.51  0.47 -0.46  0.88  0.00  0.24  0.28  0.28 -0.04 -0.74  0.43  0.74  0.01 -0.21  0.66 -0.45
[58] -0.44  0.50 -0.69 -0.65 -0.66 -0.55 -0.53

Now, just as they got their own Dataset objects, test and validation data each need their own DataLoader.

valid_dl <- dataloader(valid_ds, batch_size = 64)
length(valid_dl)
[1] 79
test_dl <- dataloader(test_ds, batch_size = 64)
length(test_dl)
[1] 79

And we’re ready to create the model!

Create the model

Let’s first see what we’re trying to accomplish. Our input data are images; normally this means we’ll work with some kind of convolutional neural network (CNN). In torch, a neural network is a module: a container for more granular modules, which themselves may be built up of yet more fine-grained modules. While in theory, this kind of compositionality is unlimited, in our example there are just two levels: a top-level module representing the model, and submodules that, in other frameworks, would be called layers.

The overall model is created by a call to nn_module(). This instantiates an nn_Module, an R6 class that knows how to act as a neural network. This object can have any number of methods, but two are essential:

  • initialize(), the place to instantiate any submodules; and
  • forward(), the place to define what should happen when this module is called.

In initialize() , we instantiate five submodules – two convolutional layers and two linear ones:

# zooming in on just initialize() - don't run standalone

net <- nn_module(
  
  # ...
  
  initialize = function() {
    
    self$conv1 <- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 <- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 <- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 <- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 1)
    
  },
  
 # ...
)

The convolutional (often abbreviated conv) layers apply a filter (or: kernel) of size 3 x 3. This filter slides over the image and computes local aggregates. In fact, there is not just a single filter, there are:

  • 32 of them in the first conv layer,

  • 64 in the second, and

  • 128 in the third.

The filters are trained to pick up informative spatial features, features that will be able to tell us something about the image.

In addition to the three conv layers, we have two linear ones. These are the prototypical neural network layers that get input from all units in the previous layer, combine individual contributions as they see fit, and send on their own individual results to all units in the next layer. The first linear layer will act on the features received from the last conv layer; it consists of 128 units. The second one is the output layer. It outputs a single numeric value, a value that represents the guess our network is making about the size of the correlation.

Now, while initialize() defines the layers, forward() specifies the order in which to call them – and what to do “in between”:

# zooming in on just forward() - don't run standalone

net <- nn_module(
  
 # ...
  
  forward = function(x) {
    
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv2() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv3() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      
      self$fc2()
  }
)

What are these things that happen “in between”? In fact, they are of different types.

Firstly, we have nnf_relu(), called three times: after each of the conv layers and after the first linear layer. This is a so-called activation function – a function that takes the raw results computed by a layer and performs some operation on them. In the case of nnf_relu() (ReLU - Rectified Linear Unit) what it does is leave positive values alone while setting negative ones to 0. You’ll encounter additional activation functions when you continue your torch journey, but ReLU is among the very-most-in-use ones today.

Secondly, we have nnf_avg_pool2d(2) , called after each conv layer. This function downsizes the image, replacing a 2 x 2 patch of pixels by its average. So while we’re going up in the number of channels (from 1 via 32 and 64 to 128), we decrease spatial resolution.

Thirdly, there is torch_flatten(). This one doesn’t compute anything - it just reshapes its inputs, going – in this case – from a four-dimensional structure outputted by the second conv layer to the two-dimensional one expected by the first linear layer.

Now, here is the complete model creation code:

torch_manual_seed(777)

net <- nn_module(
  
  "corr-cnn",
  
  initialize = function() {
    
    self$conv1 <- nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = 3)
    self$conv2 <- nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = 3)
    self$conv3 <- nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = 3)
    
    self$fc1 <- nn_linear(in_features = 14 * 14 * 128, out_features = 128)
    self$fc2 <- nn_linear(in_features = 128, out_features = 1)
    
  },
  
  forward = function(x) {
    
    x %>% 
      self$conv1() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv2() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      self$conv3() %>%
      nnf_relu() %>%
      nnf_avg_pool2d(2) %>%
      
      torch_flatten(start_dim = 2) %>%
      self$fc1() %>%
      nnf_relu() %>%
      
      self$fc2()
  }
)

model <- net()

Even before training, we can call the model on a batch of data – this immediately tells us if we got all shapes matching up:

model(batch$x)
torch_tensor
0.01 *
-2.8979
 -2.8873
 -2.8699
 -2.9787
 -2.8223
 -3.0255
 -3.1181
 -3.0603
 -3.0520
 -2.8242
 -3.0000
 -2.9150
 -2.9497
 -2.7662
 -2.7980
 -2.9540
 -2.8548
 -2.7927
 -3.0426
 -2.9540
 -2.8846
 -2.8008
 -2.8966
 -2.8358
 -2.9266
 -2.9022
 -2.8667
 -2.8716
 -2.7371
... [the output was truncated (use n=-1 to disable)]
[ CPUFloatType{64,1} ]

Now we get ready for training the model.

Train the network

In neural network libraries, optimizers take care of adapting the network weights. One of the most popular algorithms is Adam. When we create the object, we tell it what it’s supposed to work on – the model’s parameters:

optimizer <- optim_adam(model$parameters)

How does it know which way to change the parameters? We need to decide on a reasonable way to assess model performance. Here we’re trying to predict a numerical value – the size of the correlation. Thus, a measure like mean squared error is adequate. This line (in the upcoming code snippet) computes mean squared error between model predictions and target value:

loss <- nnf_mse_loss(output, b$y$unsqueeze(2))

Overall, what happens for every training batch is the following:

  1. We obtain the current model predictions:

    output <- model(b$x)
    
  2. We calculate the loss, a measure of divergence between model estimate and ground truth:

    loss <- nnf_mse_loss(output, b$y$unsqueeze(2))
    
  3. We have that loss propagate back through the network, causing gradients to be computed for all parameters:

    loss$backward()
    
  4. We ask the optimizer to update the parameters accordingly:

    optimizer$step()
    

Here are the four steps, conveniently packaged into a function. If you’re wondering about the very first line, this is just a formality (a required one, though!) related to how the optimizer does its accounting.

train_batch <- function(b) {
  
  optimizer$zero_grad()
  
  # get predictions
  output <- model(b$x)
  
  # calculate loss
  loss <- nnf_mse_loss(output, b$y$unsqueeze(2))

  # have gradients get calculated        
  loss$backward()
  
  # have gradients get applied
  optimizer$step()
  
  loss$item()
  
}

Whenever we’ll have passed through the training set once, we’ll want to double-check performance on the validation set.

valid_batch <- function(b) {

  output <- model(b$x)
  loss <- nnf_mse_loss(output, b$y$unsqueeze(2))
  loss$item()
  
}

While these functions will be called to iterate over batches, the overall workflow is an iteration over epochs.

Every time we start training, we put the model in training state:

model$train()

… while every time we assess its performance, we change the state to evaluation:

model$eval()

Let’s do ten epochs:

num_epochs <- 10

for (epoch in 1:num_epochs) {

  # don't forget to do this
  model$train()
  
  train_losses <- c()

  for (b in enumerate(train_dl)) {
    
    loss <- train_batch(b)
    train_losses <- c(train_losses, loss)
    
  }

  # don't forget to do this either
  model$eval()
  
  valid_losses <- c()

  for (b in enumerate(valid_dl)) {
    
    loss <- valid_batch(b)
    valid_losses <- c(valid_losses, loss)
    
  }

  cat(sprintf("\nLoss at epoch %d: training: %1.5f, validation: %1.5f\n", epoch, mean(train_losses), mean(valid_losses)))
  
}
Loss at epoch 1: training: 0.05769, validation: 0.01608

Loss at epoch 2: training: 0.01122, validation: 0.00785

Loss at epoch 3: training: 0.00587, validation: 0.00443

Loss at epoch 4: training: 0.00372, validation: 0.00317

Loss at epoch 5: training: 0.00250, validation: 0.00273

Loss at epoch 6: training: 0.00206, validation: 0.00248

Loss at epoch 7: training: 0.00174, validation: 0.00166

Loss at epoch 8: training: 0.00129, validation: 0.00144

Loss at epoch 9: training: 0.00105, validation: 0.00136

Loss at epoch 10: training: 0.00093, validation: 0.00123

As you can see, the network has made good progress – on both training and validation set. How about the test set? And how good of a fit are the inferred correlations?

Evaluate performance

To find out, we keep track of predictions and true correlations, so we can compare them later.

Again, before running the respective code we want to make sure we’re in evaluation mode.

model$eval()

test_batch <- function(b) {
  
  output <- model(b$x)
  loss <- nnf_mse_loss(output, b$y$unsqueeze(2))
  
  preds <<- c(preds, output %>% as.numeric())
  targets <<- c(targets, b$y %>% as.numeric())
  test_losses <<- c(test_losses, loss$item())
  
}

test_losses <- c()
preds <- c()
targets <- c()

for (b in enumerate(test_dl)) {
  test_batch(b)
}

mean(test_losses)
[1] 0.001346638

So no surprises from the loss. Now let’s inspect the predictions – and as we’re dealing with those all the time, why not create a scatterplot?

df <- data.frame(preds = preds, targets = targets)

library(ggplot2)

ggplot(df, aes(x = targets, y = preds)) +
  geom_point(size = 0.1) +
  theme_classic() +
  xlab("true correlations") +
  ylab("model predictions")

Want to guess the correlation …?

So that’s it - you’ve seen the complete workflow end-to-end, from data loading to model evaluation. The next tutorial asks a few what if? questions – e.g., what if I don’t want to predict a numerical output? – and offers some ideas for experimentation.