For further details regarding the algorithm we refer to Decoupled Weight Decay Regularization
Usage
optim_adamw(
params,
lr = 0.001,
betas = c(0.9, 0.999),
eps = 1e-08,
weight_decay = 0.01,
amsgrad = FALSE
)
Arguments
- params
(iterable): iterable of parameters to optimize or dicts defining parameter groups
- lr
(float, optional): learning rate (default: 1e-3)
- betas
(
Tuple[float, float]
, optional): coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))- eps
(float, optional): term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay
(float, optional): weight decay (L2 penalty) (default: 0)
- amsgrad
(boolean, optional): whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: FALSE)