Note: This is an R port of the official tutorial available here. All credits goes to Justin Johnson.

In the previous examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use automatic differentiation to automate the computation of backward passes in neural networks. The autograd feature in torch provides exactly this functionality. When using autograd, the forward pass of your network will define a computational graph; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If x is a Tensor that has x$requires_grad=TRUE then x$grad is another Tensor holding the gradient of x with respect to some scalar value.

Here we use torch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

if (cuda_is_available()) {
   device <- torch_device("cuda")
} else {
   device <- torch_device("cpu")
}
   
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N <- 64
D_in <- 1000
H <- 100
D_out <- 10

# Create random input and output data
# Setting requires_grad=FALSE (the default) indicates that we do not need to 
# compute gradients with respect to these Tensors during the backward pass.
x <- torch_randn(N, D_in, device=device)
y <- torch_randn(N, D_out, device=device)

# Randomly initialize weights
# Setting requires_grad=TRUE indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 <- torch_randn(D_in, H, device=device, requires_grad = TRUE)
w2 <- torch_randn(H, D_out, device=device, requires_grad = TRUE)

learning_rate <- 1e-6
for (t in seq_len(500)) {
   # Forward pass: compute predicted y using operations on Tensors; these
   # are exactly the same operations we used to compute the forward pass using
   # Tensors, but we do not need to keep references to intermediate values since
   # we are not implementing the backward pass by hand.
   y_pred <- x$mm(w1)$clamp(min=0)$mm(w2)
   
   # Compute and print loss using operations on Tensors.
   # Now loss is a Tensor of shape (1,)
   loss <- (y_pred - y)$pow(2)$sum()
   if (t %% 100 == 0 || t == 1)
      cat("Step:", t, ":", as.numeric(loss), "\n")
   
   # Use autograd to compute the backward pass. This call will compute the
   # gradient of loss with respect to all Tensors with requires_grad=True.
   # After this call w1$grad and w2$grad will be Tensors holding the gradient
   # of the loss with respect to w1 and w2 respectively.
   loss$backward()
   
   # Manually update weights using gradient descent. Wrap in `with_no_grad`
   # because weights have requires_grad=TRUE, but we don't need to track this
   # in autograd.
   # You can also use optim_sgd to achieve this.
   with_no_grad({
      
      # operations suffixed with an `_` operates on in-place on the tensor.
      w1$sub_(learning_rate * w1$grad)
      w2$sub_(learning_rate * w2$grad)
      
      # Manually zero the gradients after updating weights
      w1$grad$zero_()
      w2$grad$zero_()
   })
}
#> Step: 1 : 30390018 
#> Step: 100 : 576.8492 
#> Step: 200 : 3.614999 
#> Step: 300 : 0.04254073 
#> Step: 400 : 0.00102863 
#> Step: 500 : 0.0001272913

In the next example we will learn how to create new autograd functions.