Datasets and data loaders

Central to data ingestion and preprocessing are datasets and data loaders.

torch comes equipped with a bag of datasets related to, mostly, image recognition and natural language processing (e.g., mnist_dataset()), which can be iterated over by means of dataloaders:

# ...
ds <- mnist_dataset(
  dir, 
  download = TRUE, 
  transform = function(x) {
    x <- x$to(dtype = torch_float())/256
    x[newaxis,..]
  }
)

dl <- dataloader(ds, batch_size = 32, shuffle = TRUE)

for (b in enumerate(dl)) {
  # ...

Cf. vignettes/examples/mnist-cnn.R for a complete example.

What if you want to train on a different dataset? In these cases, you subclass Dataset, an abstract container that needs to know how to iterate over the given data. To that purpose, your subclass needs to implement .getitem(), and say what should be returned when the data loader is asking for the next batch.

In .getitem(), you can implement whatever preprocessing you require. Additionally, you should implement .length(), so users can find out how many items there are in the dataset.

While this may sound complicated, it is not at all. The base logic is straightforward – complexity will, naturally, correlate with how involved your preprocessing is. To provide you with a simple but functional prototype, here we show how to create your own dataset to train on Allison Horst's penguins.

A custom dataset

library(palmerpenguins)
library(magrittr)

penguins
#> # A tibble: 344 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           39.1          18.7              181        3750
#>  2 Adelie  Torge…           39.5          17.4              186        3800
#>  3 Adelie  Torge…           40.3          18                195        3250
#>  4 Adelie  Torge…           NA            NA                 NA          NA
#>  5 Adelie  Torge…           36.7          19.3              193        3450
#>  6 Adelie  Torge…           39.3          20.6              190        3650
#>  7 Adelie  Torge…           38.9          17.8              181        3625
#>  8 Adelie  Torge…           39.2          19.6              195        4675
#>  9 Adelie  Torge…           34.1          18.1              193        3475
#> 10 Adelie  Torge…           42            20.2              190        4250
#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Datasets are R6 classes created using the dataset() constructor. You can pass a name and various member functions. Among those should be initialize(), to create instance variables, .getitem(), to indicate how the data should be returned, and .length(), to say how many items we have.

In addition, any number of helper functions can be defined.

Here, we assume the penguins have already been loaded, and all preprocessing consists in removing lines with NA values, transforming factors to numbers starting from 0, and converting from R data types to torch tensors.

In .getitem, we essentially decide how this data is going to be used: All variables besides species go into x, the predictor, and species will constitute y, the target. Predictor and target are returned in a list, to be accessed as batch[[1]] and batch[[2]] during training.

penguins_dataset <- dataset(
  
  name = "penguins_dataset",
  
  initialize = function() {
    self$data <- self$prepare_penguin_data()
  },
  
  .getitem = function(index) {
    
    x <- self$data[index, 2:-1]
    y <- self$data[index, 1]$to(torch_long())
    
    list(x, y)
  },
  
  .length = function() {
    self$data$size()[[1]]
  },
  
  prepare_penguin_data = function() {
    
    input <- na.omit(penguins) 
    # conveniently, the categorical data are already factors
    input$species <- as.numeric(input$species)
    input$island <- as.numeric(input$island)
    input$sex <- as.numeric(input$sex)
    
    input <- as.matrix(input)
    torch_tensor(input)
  }
)

Let’s create the dataset , query for it’s length, and look at its first item:

tuxes <- penguins_dataset()
tuxes$.length()
#> [1] 333
tuxes$.getitem(1)
#> [[1]]
#> torch_tensor
#>     3.0000
#>    39.1000
#>    18.7000
#>   181.0000
#>  3750.0000
#>     2.0000
#>  2007.0000
#> [ CPUFloatType{7} ]
#> 
#> [[2]]
#> torch_tensor
#> 1
#> [ CPULongType{} ]

To be able to iterate over tuxes, we need a data loader (we override the default batch size of 1):

dl <-tuxes %>% dataloader(batch_size = 8)

Calling .length() on a data loader (as opposed to a dataset) will return the number of batches we have:

dl$.length()
#> [1] 42

And we can create an iterator to inspect the first batch:

iter <- dl$.iter()
b <- iter$.next()
b
#> [[1]]
#> torch_tensor
#>     3.0000    39.1000    18.7000   181.0000  3750.0000     2.0000  2007.0000
#>     3.0000    39.5000    17.4000   186.0000  3800.0000     1.0000  2007.0000
#>     3.0000    40.3000    18.0000   195.0000  3250.0000     1.0000  2007.0000
#>     3.0000    36.7000    19.3000   193.0000  3450.0000     1.0000  2007.0000
#>     3.0000    39.3000    20.6000   190.0000  3650.0000     2.0000  2007.0000
#>     3.0000    38.9000    17.8000   181.0000  3625.0000     1.0000  2007.0000
#>     3.0000    39.2000    19.6000   195.0000  4675.0000     2.0000  2007.0000
#>     3.0000    41.1000    17.6000   182.0000  3200.0000     1.0000  2007.0000
#> [ CPUFloatType{8,7} ]
#> 
#> [[2]]
#> torch_tensor
#>  1
#>  1
#>  1
#>  1
#>  1
#>  1
#>  1
#>  1
#> [ CPULongType{8} ]

To train a network, we can use enumerate to iterate over batches.

Training with data loaders

Our example network is very simple. (In reality, we would want to treat island as the categorical variable it is, and either one-hot-encode or embed it.)

net <- nn_module(
  "PenguinNet",
  initialize = function() {
    self$fc1 <- nn_linear(7, 32)
    self$fc2 <- nn_linear(32, 3)
  },
  forward = function(x) {
    x %>% 
      self$fc1() %>% 
      nnf_relu() %>% 
      self$fc2() %>% 
      nnf_log_softmax(dim = 1)
  }
)

model <- net()

We still need an optimizer:

optimizer <- optim_sgd(model$parameters, lr = 0.01)

And we’re ready to train:

for (epoch in 1:10) {
  
  l <- c()
  
  for (b in enumerate(dl)) {
    optimizer$zero_grad()
    output <- model(b[[1]])
    loss <- nnf_nll_loss(output, b[[2]])
    loss$backward()
    optimizer$step()
    l <- c(l, loss$item())
  }
  
  cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}
#> Loss at epoch 1: 335.176875
#> Loss at epoch 2: 2.068251
#> Loss at epoch 3: 2.068251
#> Loss at epoch 4: 2.068251
#> Loss at epoch 5: 2.068251
#> Loss at epoch 6: 2.068251
#> Loss at epoch 7: 2.068251
#> Loss at epoch 8: 2.068251
#> Loss at epoch 9: 2.068251
#> Loss at epoch 10: 2.068251