Randomly zero out entire channels (a channel is a 3D feature map,
e.g., the \(j\)-th channel of the \(i\)-th sample in the
batched input is a 3D tensor \(input[i, j]\)) of the input tensor).
Each channel will be zeroed out independently on every forward call with
probability p
using samples from a Bernoulli distribution.