optimizers

ASGD

class ASGD(lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0, decay_bias=False)

Lazy version of ASGD.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). It has been proposed in Acceleration of stochastic approximation by averaging.

Parameters
  • lr – learning rate (default: 1e-2)

  • lambd – decay term (default: 1e-4)

  • alpha – power for eta update (default: 0.75)

  • t0 – point at which to start averaging (default: 1e6)

  • weight_decay – weight decay (L2 penalty) (default: 0) .. _Acceleration of stochasticapproximation by averaging:

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import ASGD
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = ASGD(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _ =freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _ =unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.01]

Adadelta

class Adadelta(lr=1.0, rho=0.9, eps=1e-06, weight_decay=0, decay_bias=False)

Lazy version of Adadelta.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). It has been proposed in `ADADELTA: An Adaptive Learning Rate Method`__.

Parameters
  • rho – coefficient used for computing a running average of squared gradients (default: 0.9)

  • eps – term added to the denominator to improve numerical stability (default: 1e-6)

  • lr – coefficient that scale delta before it is applied to the parameters (default: 1.0)

  • weight_decay – weight decay (L2 penalty) (default: 0) __ https://arxiv.org/abs/1212.5701

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import Adadelta
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = Adadelta(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _ =freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _ = unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.01]

Adagrad

class Adagrad(lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10, decay_bias=False)

Lazy version of Adagrad.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). It has been proposed in Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

Parameters
  • lr – learning rate (default: 1e-2)

  • lr_decay – learning rate decay (default: 0)

  • weight_decay – weight decay (L2 penalty) (default: 0)

  • eps – term added to the denominator to improve numerical stability (default: 1e-10)

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import Adagrad
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = Adagrad(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _ =freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _ =unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.01]

Adam

class Adam(lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, decay_bias=False)

Lazy version of Adam.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). It has been proposed in Adam: A Method for Stochastic Optimization. The implementation of the L2 penalty follows changes proposed in Decoupled Weight Decay Regularization.

Parameters
  • lr – learning rate (default: 1e-3)

  • betas – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay – weight decay (L2 penalty) (default: 0)

  • amsgrad – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) .. _Adam: A Method for Stochastic Optimization:

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import Adam
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = Adam(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _ = freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _ =unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.01]

AdamW

class AdamW(lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, decay_bias=False)

Lazy version of AdamW.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

Parameters
  • lr – learning rate (default: 1e-3)

  • betas – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay – weight decay coefficient (default: 1e-2)

  • amsgrad – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import AdamW
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = AdamW(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _=freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1, 0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _=unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.1, 0.01, 0.01]

LazyOptimizer

class LazyOptimizer(*, decay_bias=False, **kwargs)

A base wrapper class for torch optimizers that is initilized without model parameters.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()).

add_model(model, **kwargs)

add all trainable parameters from this model to the optimizer

if extra kwargs are provided and the optimizer is already initilized these will be used to create param groups. If ::attr::decay_bias is False and weight_decay is non-zero, or a non-zero weight_decay value is passed to kwargs then weights and biases will be split into different parameter groups

get_lrs()

RMSprop

class RMSprop(lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, decay_bias=False)

Lazy version of RMSprop.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). Proposed by G. Hinton in his course.

The centered version first appears in Generating Sequences With Recurrent Neural Networks.

The implementation here takes the square root of the gradient average before adding epsilon (note that TensorFlow interchanges these two operations). The effective learning rate is thus \(lpha/(\sqrt{v} + \epsilon)\) where \(lpha\) is the scheduled learning rate and \(v\) is the weighted moving average of the squared gradient.

Parameters
  • lr – learning rate (default: 1e-2)

  • momentum – momentum factor (default: 0)

  • alpha – smoothing constant (default: 0.99)

  • eps – term added to the denominator to improve numerical stability (default: 1e-8)

  • centered – if True, compute the centered RMSProp, the gradient is normalized by an estimation of its variance (Defaults to None)

  • weight_decay – weight decay (L2 penalty) (default: 0)

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import RMSprop
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = RMSprop(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _ = freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _= unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.01]

SGD

class SGD(lr, momentum=0, dampening=0, weight_decay=0, nesterov=False, decay_bias=False)

Lazy version of SGD.

Supports instantiation without model parameters, exclusion of biases from weight decay where applicable and adding only trainable parameters directly from models (see add_model()). Nesterov momentum is based on the formula from `On the importance of initialization and momentum in deep learning`__.

Parameters
  • lr (float) – learning rate

  • momentum – momentum factor (default: 0)

  • weight_decay – weight decay (L2 penalty) (default: 0)

  • dampening – dampening for momentum (default: 0)

  • nesterov – enables Nesterov momentum (default: False)

  • decay_bias (bool) – if True include biases in weight_decay. (default: False)

Example

>>> import torch
>>> from torch import nn
>>> from hearth.optimizers import SGD
>>> from hearth.grad import freeze, unfreeze
>>>
>>> # no need to tell it about the model
>>> opt = SGD(lr=.1)
>>> opt.initialized
False

define a model at some point, we’ll also freeze a single layer:

>>> model = nn.Sequential(nn.Linear(4, 5),
...                       nn.Linear(5, 6),
...                       nn.Linear(6, 7))
>>> _=freeze(model[0])
>>> opt.add_model(model)
>>> opt.get_lrs()
[0.1]

later we may unfreeze the first layer of our model and add that to our optimizer with a different learning rate:

>>> _=unfreeze(model[0])
>>> opt.add_model(model[0], lr=.01)
>>> opt.get_lrs()
[0.1, 0.01]