modules¶
AbsolutePositionalEmbedding¶
- class AbsolutePositionalEmbedding(features, max_len=512, padding_idx=0)¶
Absolute learned positional embeddings a la bert.
- Parameters
features (
int
) – number of embedding features.max_len (
int
) – max sequence length.padding_idx (
int
) – used to mask padding timesteps
Example
>>> from hearth.modules import AbsolutePositionalEmbedding >>> >>> emb = AbsolutePositionalEmbedding(256, max_len=512) >>> tokens = torch.tensor([[99, 6, 55, 1, 0, 0], ... [8, 22, 7, 8, 3, 11]]) >>> out = emb(tokens) >>> out.shape torch.Size([2, 6, 256])
>>> (out[tokens == 0] == 0).all() tensor(True)
- forward(tokens)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
AttentionPool1D¶
- class AttentionPool1D(in_features)¶
Attention pool for (batch first) timeseries data with masking.
- Parameters
in_features (
int
) – number of input features.
Example
>>> from hearth.modules import AttentionPool1D >>> >>> pool = AttentionPool1D(16) >>> mask = torch.tensor([[ True, True, True, False, False], ... [ True, True, True, True, True]]) >>> inp = torch.rand(2, 5, 16) >>> out = pool(inp, mask) >>> out.shape torch.Size([2, 16])
- attention(x, mask)¶
get attention weights for the given input and mask.
- Parameters
x (
Tensor
) – (batch, timesteps, features) inputsmask (
Tensor
) – (batch, timesteps) mask where valid timesteps are True and padding timesteps are False.
- Return type
Tensor
- Returns
Tensor (batch, timesteps)
- forward(x, mask)¶
pool timesteps using attention.
- Parameters
x (
Tensor
) – (batch, timesteps, features) inputsmask (
Tensor
) – (batch, timesteps) mask where valid timesteps are True and padding timesteps are False.
- Return type
Tensor
- Returns
Tensor (batch, features)
- training: bool¶
BaseModule¶
- class BaseModule¶
A base class like nn.Module but with a few extra useful bits.
- features include:
- classmethod from_config(config)¶
given a valid config return a new instance of this Module.
- classmethod load(model_dir, strict=True)¶
create a new instance of this with config and parameters loaded from
model_dir
.- This method expects model dir to have the following files:
state.pt: the state dict for this model
config.json : the config of this model, nessisary to reinstantiate a new model with the
from_config()
method.
you can use the
save()
method on an instance of this module to create them.- Args:
model_dir: directory to load from. strict: if we shold be strict about loading the state dict. Defaults to True.
- freeze()¶
freeze this model in place.
- unfreeze()¶
unfreeze this model in place.
- script()¶
torchscript this model using jit.script.
- Return type
RecursiveScriptModule
- trainable_parameters()¶
yields trainable parameters from this model.
- Return type
Iterator
[Parameter
]
- config()¶
get the config for this module.
This config can be passed to the
from_config()
class method to create a new instance of this module.
- save(model_dir)¶
save this models state and config to a
model_dir
so it can be re-created later.- This method will generate the following files in the
model_dir
: state.pt: the state dict for this model
config.json : the config of this model, nessisary to reinstantiate a new model with the
from_config()
method.
- Parameters
model_dir (
str
) – directory to save stuff in.
- This method will generate the following files in the
- blocks()¶
this override this method to define depth based sections of your network.
Defining
blocks()
is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods likebottleneck()
andunbottle()
know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves benn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self
- Return type
Iterator
[Module
]
- reverse_blocks()¶
yields blocks from this modules
blocks()
method in reverse order (from output to input.)- Return type
Iterator
[Module
]
- bottleneck(n)¶
- unbottle()¶
unfreeze (inplace) a the next deepest block that is wholly or partially frozen.
- Returns
- returns the unfrozen block if there was anything to unbottle
otherwise None
- Return type
Union[None, nn.Module]
- training: bool¶
Bertish¶
- class Bertish(features=256, layers=4, vocab_size=30522, attn_heads=4, boom_scale=4, max_len=512, padding_idx=0, layer_norm_eps=1e-12, dropout=0.1, attn_dropout=0.1, activation='gelu', drop_head=False, pre_norm=False)¶
Bert style base transformer model, supports loading huggingface transformers weights via load_transformers_bert_state_dict method.
- Parameters
features (
int
) – number of base features. Defaults to 256.layers (
int
) – number of transformer encoder layers. Defaults to 4.vocab_size (
int
) – size of vocabulary. Defaults to 30522.attn_heads (
int
) – number of attention heads. Defaults to 4.boom_scale (
int
) – scale for intermediate size in feedforward network. Defaults to 4 (commonly used in bert archetectures).max_len (
int
) – max sequence length (needed for positional embeddings). Defaults to 512.padding_idx (
int
) – padding idx. Defaults to 0.layer_norm_eps (
float
) – epsilon for layer norms used throught the network. Defaults to 1e-12.dropout (
float
) – dropout rate for non-attention parts of network. Defaults to 0.1.attn_dropout (
float
) – dropout rate for attention. Defaults to 0.1.activation (
str
) – named activation for feedforward part of network. Defaults to ‘gelu’.drop_head (
bool
) – If true drop attention from entire attention heads as dropout strategy otherwise drop timesteps from different heads. Defaults to False.careful (pre_norm If true use pre-normalization strategy which may require less) – lr scheduling etc… Defaults to False as in bert-based models.
- blocks()¶
this override this method to define depth based sections of your network.
Defining
blocks()
is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods likebottleneck()
andunbottle()
know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves benn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self
- forward(tokens)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- load_transformers_bert_state_dict(sd)¶
load statedict from a hugging face transformers model
- training: bool¶
Boom¶
- class Boom(in_features, scale=4, activation='gelu', dropout=0.1)¶
feedforward part of std transformer network sometimes affectionatly referred to as the bOOm layer (since it expands and contracts).
- Parameters
in_features (
int
) – number of input features.scale (
int
) – scale for intermediate size (the OO in bOOm). Defaults to 4 (commonly used in bert archetectures.)activation (
str
) – named activation. Defaults to ‘gelu’.dropout (
float
) – Dropout rate for intermediate activation. Defaults to 0.1.
Example
>>> import torch >>> from hearth.modules import Boom >>> >>> layer = Boom(16, scale=4, activation='gelu', dropout=0.1) >>> inp = torch.rand(10, 16) >>> layer(inp).shape torch.Size([10, 16])
- forward(x)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
LayerNormSimple¶
- class LayerNormSimple(normalized_shape, eps=1e-05)¶
Layer norm without learnable bias and gain.
This basically just wraps the standard torch LayerNorm so it has no elementwise affine by default.
- Parameters
normalized_shape (
Union
[int
,List
[int
],Size
]) – input shape from an expected input. If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.eps (
float
) – a value added to the denominator for numerical stability. Default: 1e-5
Example
>>> import torch >>> from hearth.modules import LayerNormSimple >>> >>> layer = LayerNormSimple(5) # 5 feats >>> layer LayerNormSimple(5, eps=1e-05)
>>> x = torch.tensor([[-4.2721, -2.8159, -1.2351, 0.2388, 4.5915], ... [-0.9092, -3.9666, -1.4216, -4.7373, 2.0403], ... [-1.2210, 4.4796, -1.2772, -2.8781, 4.1868]]) >>> layer(x) tensor([[-1.1730, -0.6950, -0.1761, 0.3077, 1.7365], [ 0.3694, -0.9000, 0.1566, -1.2200, 1.5940], [-0.6139, 1.2486, -0.6323, -1.1554, 1.1530]])
- forward(x)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
MaskedLayerNorm¶
- class MaskedLayerNorm(features, eps=1e-05, elementwise_affine=True)¶
layer norm that supports masked input.
Only calcluates stats and normalizes non-padding timesteps.
Inputs are expected to be of shape (B, T, …) and mask is expected to be a boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false.
Example
>>> from torch import nn >>> from hearth.modules import MaskedLayerNorm >>> >>> layer = MaskedLayerNorm(16) >>> mask = torch.tensor([[ True, True, True, False, False], ... [ True, True, True, True, True]]) >>> inp = torch.rand(2, 5, 16) >>> out = layer(inp, mask) >>> out.shape torch.Size([2, 5, 16])
>>> (out == 0.0).all(-1) tensor([[False, False, False, True, True], [False, False, False, False, False]])
- forward(x, mask)¶
preform layer norm on valid timesteps in x as specifed by mask.
- Parameters
x (
Tensor
) – a tensor of shape (B, T, …)mask (
Tensor
) – boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false.
- Return type
Tensor
- normalized_shape: Tuple[int, ...]¶
- eps: float¶
- elementwise_affine: bool¶
ReZero¶
- class ReZero(block, dropout=0.0)¶
Implements ReZero residual connection around a block with dropout (as in transformer implementation).
Example
>>> import torch >>> from hearth.modules import ReZero >>> >>> transformation = nn.Sequential(nn.Linear(10, 10), ... nn.ReLU(), ... nn.Dropout(.1), ... nn.Linear(10, 10) ... ) >>> re_zero = ReZero(transformation, dropout=.1) >>> >>> x = torch.normal(0, 1, size=(5, 10)) >>> y = re_zero(x) >>> y.shape torch.Size([5, 10])
since
re_zero
’s weight parameter has not actually been trainedy
its going to be equal tox
and nothing from the transformation will be added to the input… as training goes on this should change.>>> (y == x).all() tensor(True)
- forward(x)¶
forward for ReZero.
- Return type
Tensor
- training: bool¶
Residual¶
- class Residual(block)¶
wraps a block in a residual connection \(y = block(x) + x\).
- Parameters
block (
Module
) – the module to wrap.
Example
>>> import torch >>> from torch import nn >>> from hearth.modules import Residual >>> _ = torch.manual_seed(0) >>> >>> res = Residual(nn.Linear(4, 4)) >>> >>> x = torch.rand(2, 4) # (batch, feats) >>> res(x) tensor([[ 0.6371, 1.5493, 0.0031, -0.0379], [ 0.3584, 0.8512, 0.5208, -0.7607]], grad_fn=<AddBackward0>)
- forward(x)¶
forward padd for
Residual
wrapper.- Return type
Tensor
- training: bool¶
SelfAttention¶
- class SelfAttention(in_features, out_features=None, n_heads=1, bias=True, dropout=0.1, drop_head=False)¶
Scaled self attention layer for timeseries.
- Parameters
in_features (
int
) – number of input features.out_features (
Optional
[int
]) – number of output features, if None defaults to in_features. Defaults to None.n_heads (
int
) – number of attention heads, must evenly divide out_features. Defaults to 1.bias (
bool
) – include bias in all layers. Defaults to True.dropout – attention dropout rate. Defaults to 0.1.
drop_head (
bool
) – If true drop attention from entire attention heads as dropout strategy otherwise drop timesteps from different heads. Defaults to False.
Example
>>> from torch import nn >>> from hearth.modules import SelfAttention >>> >>> layer = SelfAttention(16, 32, n_heads=4) >>> mask = torch.tensor([[ True, True, True, False, False], ... [ True, True, True, True, True]]) >>> inp = torch.rand(2, 5, 16) >>> out = layer(inp, mask) >>> out.shape torch.Size([2, 5, 32])
>>> (out == 0.0).all(-1) tensor([[False, False, False, True, True], [False, False, False, False, False]])
- forward(x, mask)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
TimeMasked¶
- class TimeMasked(layer)¶
Call the wrapped layer such that only valid timesteps are passed to the underlying layer.
Inputs are expected to be of shape (B, T, …) and mask is expected to be a boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false underlying layer is expected to accept a single input of the masked shape.
Example
>>> from torch import nn >>> from hearth.modules import TimeMasked >>> >>> layer = TimeMasked(nn.Sequential(nn.Linear(8, 12), nn.LayerNorm(12))) >>> mask = torch.tensor([[ True, True, True, False, False], ... [ True, True, True, True, True]]) >>> inp = torch.rand(2, 5, 8) >>> out = layer(inp, mask) >>> out.shape torch.Size([2, 5, 12])
>>> (out == 0.0).all(-1) tensor([[False, False, False, True, True], [False, False, False, False, False]])
- blocks()¶
this override this method to define depth based sections of your network.
Defining
blocks()
is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods likebottleneck()
andunbottle()
know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves benn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self
- forward(x, mask)¶
call the underlying layer using only timesteps from mask padding invalid timesteps with 0.
- Parameters
x (
Tensor
) – a tensor of shape (B, T, …)mask (
Tensor
) – boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false.
- Return type
Tensor
- training: bool¶
TransformerEmbedding¶
- class TransformerEmbedding(vocab_size, features, max_len=512, dropout=0.1, layer_norm_eps=1e-12, padding_idx=0)¶
simple version of embedding used in transformer models.
Note
this embedding does not use token type embeddings as used in some bert models. If using with a pretrained model it’s reccomended that you first add these to word embeddings.
- Parameters
vocab_size (
int
) – vocabulary size for word embeddings.features (
int
) – number of features in embedding space.max_len (
int
) – maximum sequence length for positional embeddings. Defaults to 512.dropout (
float
) – dropout rate. Defaults to 0.1.layer_norm_eps (
float
) – epsilon for layer normalization. Defaults to 1e-12.padding_idx (
int
) – index for non-valid padding timesteps. Defaults to 0.
Example
>>> from torch import nn >>> from hearth.modules import TransformerEmbedding >>> >>> emb = TransformerEmbedding(1000, 256, padding_idx=0)
>>> tokens = torch.tensor([[99, 6, 55, 1, 0, 0], ... [8, 22, 7, 8, 3, 11]]) >>> out = emb(tokens) >>> out.shape torch.Size([2, 6, 256])
>>> (out == 0.0).all(-1) tensor([[False, False, False, False, True, True], [False, False, False, False, False, False]])
>>> emb.build_mask(tokens) tensor([[ True, True, True, True, False, False], [ True, True, True, True, True, True]])
- build_mask(tokens)¶
get a mask where all valid timesteps are True
- Return type
Tensor
- forward(tokens)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
TransformerEncoder¶
- class TransformerEncoder(features, layers, n_heads, dropout=0.1, attn_dropout=0.1, activation='gelu', boom_scale=4, drop_head=False, pre_norm=False, layer_norm_eps=1e-12)¶
a stack of transformer encoder layers
- Parameters
features (
int
) – number of features.layers (
int
) – number of layers.n_heads (
int
) – number of attention heads.dropout (
float
) – general dropout rate used in feedforward part of network and between residual connections. Defaults to 0.1.attn_dropout (
float
) – dropout for self attention. Defaults to 0.1.activation (
str
) – named activation for feedforward part of network. Defaults to ‘gelu’.boom_scale (
int
) – scale for intermediate size in feedforward network. Defaults to 4 (commonly used in bert archetectures).otherwise (drop_head If true drop attention from entire attention heads as dropout strategy) – drop timesteps from different heads. Defaults to False.
pre_norm (
bool
) – If true use pre-normalization strategy which may require less careful lr scheduling etc… Defaults to False as in bert-based models.layer_norm_eps (
float
) – epsilon for layer norms used throught the network. Defaults to 1e-12.
Example
>>> import torch >>> from hearth.modules import TransformerEncoder >>> >>> model = TransformerEncoder(16, layers=3, n_heads=4, boom_scale=4) >>> mask = torch.tensor([[ True, True, True, False, False], ... [ True, True, True, True, True]]) >>> inp = torch.rand(2, 5, 16) >>> model(inp, mask).shape torch.Size([2, 5, 16])
>>> model.depth() 3
- blocks()¶
this override this method to define depth based sections of your network.
Defining
blocks()
is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods likebottleneck()
andunbottle()
know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves benn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self
- forward(x, mask)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶
TransformerEncoderBlock¶
- class TransformerEncoderBlock(features, n_heads, dropout=0.1, attn_dropout=0.1, activation='gelu', boom_scale=4, drop_head=False, pre_norm=False, layer_norm_eps=1e-12)¶
a single transformer encoder block.
- Parameters
features (
int
) – number of features.n_heads (
int
) – number of attention heads.dropout (
float
) – general dropout rate used in feedforward part of network and between residual connections. Defaults to 0.1.attn_dropout (
float
) – dropout for self attention. Defaults to 0.1.activation (
str
) – named activation for feedforward part of network. Defaults to ‘gelu’.boom_scale (
int
) – scale for intermediate size in feedforward network. Defaults to 4 (commonly used in bert archetectures).otherwise (drop_head If true drop attention from entire attention heads as dropout strategy) – drop timesteps from different heads. Defaults to False.
pre_norm (
bool
) – If true use pre-normalization strategy which may require less careful lr scheduling etc… Defaults to False as in bert-based models.layer_norm_eps (
float
) – epsilon for layer norms used throught the network. Defaults to 1e-12.
Example
>>> import torch >>> from hearth.modules import TransformerEncoderBlock >>> >>> layer = TransformerEncoderBlock(16, n_heads=4, boom_scale=4) >>> mask = torch.tensor([[ True, True, True, False, False], ... [ True, True, True, True, True]]) >>> inp = torch.rand(2, 5, 16) >>> layer(inp, mask).shape torch.Size([2, 5, 16])
with pre-norm scheme:
>>> layer = TransformerEncoderBlock(16, n_heads=4, boom_scale=4, pre_norm=True) >>> layer(inp, mask).shape torch.Size([2, 5, 16])
- forward(x, mask)¶
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type
Tensor
- training: bool¶