Note: This is an R port of the official tutorial available here. All credits goes to Justin Johnson.

R arrays are great, but they cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately pure R won’t be enough for modern deep learning.

Here we introduce the most fundamental torch concept: the Tensor. A torch Tensor is conceptually similar to an R array: a Tensor is an n-dimensional array, and torch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.

Also unlike R, torch Tensors can utilize GPUs to accelerate their numeric computations. To run a torch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use torch Tensors to fit a two-layer network to random data. Like the R before we need to manually implement the forward and backward passes through the network:

if (cuda_is_available()) {
   device <- torch_device("cuda")
} else {
   device <- torch_device("cpu")
}
   
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N <- 64
D_in <- 1000
H <- 100
D_out <- 10

# Create random input and output data
x <- torch_randn(N, D_in, device=device)
y <- torch_randn(N, D_out, device=device)

# Randomly initialize weights
w1 <- torch_randn(D_in, H, device=device)
w2 <- torch_randn(H, D_out, device=device)

learning_rate <- 1e-6
for (t in seq_len(500)) {
   # Forward pass: compute predicted y
   h <- x$mm(w1)
   h_relu <- h$clamp(min=0)
   y_pred <- h_relu$mm(w2)
   
   # Compute and print loss
   loss <- as.numeric((y_pred - y)$pow(2)$sum())
   if (t %% 100 == 0 || t == 1)
      cat("Step:", t, ":", loss, "\n")
   
   # Backprop to compute gradients of w1 and w2 with respect to loss
   grad_y_pred <- 2.0 * (y_pred - y)
   grad_w2 <- h_relu$t()$mm(grad_y_pred)
   grad_h_relu <- grad_y_pred$mm(w2$t())
   grad_h <- grad_h_relu$clone()
   grad_h[h < 0] <- 0
   grad_w1 <- x$t()$mm(grad_h)
   
   # Update weights using gradient descent
   w1 <- w1 - learning_rate * grad_w1
   w2 <- w2 - learning_rate * grad_w2
}
#> Step: 1 : 31910988 
#> Step: 100 : 1089.979 
#> Step: 200 : 17.10232 
#> Step: 300 : 0.4140661 
#> Step: 400 : 0.01132045 
#> Step: 500 : 0.0005748363

In the next example we will use autograd instead of computing the gradients manually.