Note: This is an R port of the official tutorial available here. All credits goes to Justin Johnson.

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In torch we can easily define our own autograd operator by defining a subclass of autograd_function and implementing the forward and backward functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the ReLU nonlinearity, and use it to implement our two-layer network:

# We can implement our own custom autograd Functions by subclassing
# autograd_functioon and implementing the forward and backward passes
# which operate on Tensors.
my_relu <- autograd_function(
   # In the forward pass we receive a Tensor containing the input and return
   # a Tensor containing the output. ctx is a context object that can be used
   # to stash information for backward computation. You can cache arbitrary
   # objects for use in the backward pass using the ctx$save_for_backward method.
   forward = function(ctx, input) {
      ctx$save_for_backward(input = input)
      input$clamp(min = 0)
   },
   # In the backward pass we receive a Tensor containing the gradient of the loss
   # with respect to the output, and we need to compute the gradient of the loss
   # with respect to the input.
   backward = function(ctx, grad_output) {
      v <- ctx$saved_variables
      grad_input <- grad_output$clone()
      grad_input[v$input < 0] <- 0
      list(input = grad_input)
   }
)

if (cuda_is_available()) {
   device <- torch_device("cuda")
} else {
   device <- torch_device("cpu")
}
   
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N <- 64
D_in <- 1000
H <- 100
D_out <- 10

# Create random input and output data
# Setting requires_grad=FALSE (the default) indicates that we do not need to 
# compute gradients with respect to these Tensors during the backward pass.
x <- torch_randn(N, D_in, device=device)
y <- torch_randn(N, D_out, device=device)

# Randomly initialize weights
# Setting requires_grad=TRUE indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 <- torch_randn(D_in, H, device=device, requires_grad = TRUE)
w2 <- torch_randn(H, D_out, device=device, requires_grad = TRUE)

learning_rate <- 1e-6
for (t in seq_len(500)) {
   # Forward pass: compute predicted y using operations on Tensors; these
   # are exactly the same operations we used to compute the forward pass using
   # Tensors, but we do not need to keep references to intermediate values since
   # we are not implementing the backward pass by hand.
   y_pred <- my_relu(x$mm(w1))$mm(w2)
   
   # Compute and print loss using operations on Tensors.
   # Now loss is a Tensor of shape (1,)
   loss <- (y_pred - y)$pow(2)$sum()
   if (t %% 100 == 0 || t == 1)
      cat("Step:", t, ":", as.numeric(loss), "\n")
   
   # Use autograd to compute the backward pass. This call will compute the
   # gradient of loss with respect to all Tensors with requires_grad=True.
   # After this call w1$grad and w2$grad will be Tensors holding the gradient
   # of the loss with respect to w1 and w2 respectively.
   loss$backward()
   
   # Manually update weights using gradient descent. Wrap in `with_no_grad`
   # because weights have requires_grad=TRUE, but we don't need to track this
   # in autograd.
   # You can also use optim_sgd to achieve this.
   with_no_grad({
      
      # operations suffixed with an `_` operates on in-place on the tensor.
      w1$sub_(learning_rate * w1$grad)
      w2$sub_(learning_rate * w2$grad)
      
      # Manually zero the gradients after updating weights
      w1$grad$zero_()
      w2$grad$zero_()
   })
}
#> Step: 1 : 33210170 
#> Step: 100 : 429.2728 
#> Step: 200 : 2.536572 
#> Step: 300 : 0.02151564 
#> Step: 400 : 0.0003944117 
#> Step: 500 : 4.690059e-05

In the next example we will learn how to use the neural networks abstractions in torch.