modules¶

AbsolutePositionalEmbedding¶

class AbsolutePositionalEmbedding(features, max_len=512, padding_idx=0)¶

Absolute learned positional embeddings a la bert.

Parameters

features (int) – number of embedding features.
max_len (int) – max sequence length.
padding_idx (int) – used to mask padding timesteps

Example

>>> from hearth.modules import AbsolutePositionalEmbedding
>>>
>>> emb = AbsolutePositionalEmbedding(256, max_len=512)
>>> tokens = torch.tensor([[99, 6, 55, 1, 0, 0],
...                        [8, 22, 7, 8, 3, 11]])
>>> out = emb(tokens)
>>> out.shape
torch.Size([2, 6, 256])

>>> (out[tokens == 0] == 0).all()
tensor(True)

forward(tokens)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶

AttentionPool1D¶

class AttentionPool1D(in_features)¶

Attention pool for (batch first) timeseries data with masking.

Parameters: in_features (int) – number of input features.

Example

>>> from hearth.modules import AttentionPool1D
>>>
>>> pool = AttentionPool1D(16)
>>> mask = torch.tensor([[ True,  True,  True, False, False],
...                      [ True,  True,  True,  True,  True]])
>>> inp = torch.rand(2, 5, 16)
>>> out = pool(inp, mask)
>>> out.shape
torch.Size([2, 16])

attention(x, mask)¶

get attention weights for the given input and mask.

Parameters

x (Tensor) – (batch, timesteps, features) inputs
mask (Tensor) – (batch, timesteps) mask where valid timesteps are True and padding timesteps are False.

Return type

Tensor

Returns

Tensor (batch, timesteps)

forward(x, mask)¶

pool timesteps using attention.

Parameters

x (Tensor) – (batch, timesteps, features) inputs
mask (Tensor) – (batch, timesteps) mask where valid timesteps are True and padding timesteps are False.

Return type

Tensor

Returns

Tensor (batch, features)

training: bool¶

BaseModule¶

class BaseModule¶

A base class like nn.Module but with a few extra useful bits.

features include:

auto config generation based on init args.
standardized loading and saving.
torchscripting via the script() method.
whole model freezing via the freeze() method.
and more…

classmethod from_config(config)¶: given a valid config return a new instance of this Module.

classmethod load(model_dir, strict=True)¶

create a new instance of this with config and parameters loaded from model_dir.

This method expects model dir to have the following files:

state.pt: the state dict for this model

config.json : the config of this model, nessisary to reinstantiate a new model with the from_config() method.

you can use the save() method on an instance of this module to create them.

Args:: model_dir: directory to load from. strict: if we shold be strict about loading the state dict. Defaults to True.

freeze()¶: freeze this model in place.

unfreeze()¶: unfreeze this model in place.

script()¶

torchscript this model using jit.script.

Return type: RecursiveScriptModule

trainable_parameters()¶

yields trainable parameters from this model.

Return type: Iterator[Parameter]

config()¶

get the config for this module.

This config can be passed to the from_config() class method to create a new instance of this module.

save(model_dir)¶

save this models state and config to a model_dir so it can be re-created later.

This method will generate the following files in the model_dir:

state.pt: the state dict for this model
config.json : the config of this model, nessisary to reinstantiate a new model with the from_config() method.

Parameters: model_dir (str) – directory to save stuff in.

blocks()¶

this override this method to define depth based sections of your network.

Defining blocks() is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods like bottleneck() and unbottle() know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves be nn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self

Return type: Iterator[Module]

reverse_blocks()¶

yields blocks from this modules blocks() method in reverse order (from output to input.)

Return type: Iterator[Module]

depth()¶

the block depth of this module (as defined in blocks()).

Return type: int

bottleneck(n)¶

unbottle()¶

unfreeze (inplace) a the next deepest block that is wholly or partially frozen.

Returns

returns the unfrozen block if there was anything to unbottle: otherwise None

Return type

Union[None, nn.Module]

training: bool¶

Bertish¶

class Bertish(features=256, layers=4, vocab_size=30522, attn_heads=4, boom_scale=4, max_len=512, padding_idx=0, layer_norm_eps=1e-12, dropout=0.1, attn_dropout=0.1, activation='gelu', drop_head=False, pre_norm=False)¶

Bert style base transformer model, supports loading huggingface transformers weights via load_transformers_bert_state_dict method.

Parameters

features (int) – number of base features. Defaults to 256.
layers (int) – number of transformer encoder layers. Defaults to 4.
vocab_size (int) – size of vocabulary. Defaults to 30522.
attn_heads (int) – number of attention heads. Defaults to 4.
boom_scale (int) – scale for intermediate size in feedforward network. Defaults to 4 (commonly used in bert archetectures).
max_len (int) – max sequence length (needed for positional embeddings). Defaults to 512.
padding_idx (int) – padding idx. Defaults to 0.
layer_norm_eps (float) – epsilon for layer norms used throught the network. Defaults to 1e-12.
dropout (float) – dropout rate for non-attention parts of network. Defaults to 0.1.
attn_dropout (float) – dropout rate for attention. Defaults to 0.1.
activation (str) – named activation for feedforward part of network. Defaults to ‘gelu’.
drop_head (bool) – If true drop attention from entire attention heads as dropout strategy otherwise drop timesteps from different heads. Defaults to False.
careful (pre_norm If true use pre-normalization strategy which may require less) – lr scheduling etc… Defaults to False as in bert-based models.

blocks()¶

this override this method to define depth based sections of your network.

Defining blocks() is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods like bottleneck() and unbottle() know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves be nn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self

forward(tokens)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

load_transformers_bert_state_dict(sd)¶: load statedict from a hugging face transformers model

training: bool¶

Boom¶

class Boom(in_features, scale=4, activation='gelu', dropout=0.1)¶

feedforward part of std transformer network sometimes affectionatly referred to as the bOOm layer (since it expands and contracts).

Parameters

in_features (int) – number of input features.
scale (int) – scale for intermediate size (the OO in bOOm). Defaults to 4 (commonly used in bert archetectures.)
activation (str) – named activation. Defaults to ‘gelu’.
dropout (float) – Dropout rate for intermediate activation. Defaults to 0.1.

Example

>>> import torch
>>> from hearth.modules import Boom
>>>
>>> layer = Boom(16, scale=4, activation='gelu', dropout=0.1)
>>> inp = torch.rand(10, 16)
>>> layer(inp).shape
torch.Size([10, 16])

forward(x)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶

LayerNormSimple¶

class LayerNormSimple(normalized_shape, eps=1e-05)¶

Layer norm without learnable bias and gain.

This basically just wraps the standard torch LayerNorm so it has no elementwise affine by default.

Parameters

normalized_shape (Union[int, List[int], Size]) – input shape from an expected input. If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension which is expected to be of that specific size.
eps (float) – a value added to the denominator for numerical stability. Default: 1e-5

Reference:: Xu et al: Understanding and Improving Layer Normalization

Example

>>> import torch
>>> from hearth.modules import LayerNormSimple
>>>
>>> layer = LayerNormSimple(5) # 5 feats
>>> layer
LayerNormSimple(5, eps=1e-05)

>>> x = torch.tensor([[-4.2721, -2.8159, -1.2351,  0.2388,  4.5915],
...                  [-0.9092, -3.9666, -1.4216, -4.7373,  2.0403],
...                  [-1.2210,  4.4796, -1.2772, -2.8781,  4.1868]])
>>> layer(x)
tensor([[-1.1730, -0.6950, -0.1761,  0.3077,  1.7365],
        [ 0.3694, -0.9000,  0.1566, -1.2200,  1.5940],
        [-0.6139,  1.2486, -0.6323, -1.1554,  1.1530]])

forward(x)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶

MaskedLayerNorm¶

class MaskedLayerNorm(features, eps=1e-05, elementwise_affine=True)¶

layer norm that supports masked input.

Only calcluates stats and normalizes non-padding timesteps.

Inputs are expected to be of shape (B, T, …) and mask is expected to be a boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false.

Example

>>> from torch import nn
>>> from hearth.modules import MaskedLayerNorm
>>>
>>> layer = MaskedLayerNorm(16)
>>> mask = torch.tensor([[ True,  True,  True, False, False],
...                      [ True,  True,  True,  True,  True]])
>>> inp = torch.rand(2, 5, 16)
>>> out = layer(inp, mask)
>>> out.shape
torch.Size([2, 5, 16])

>>> (out == 0.0).all(-1)
tensor([[False, False, False,  True,  True],
       [False, False, False, False, False]])

forward(x, mask)¶

preform layer norm on valid timesteps in x as specifed by mask.

Parameters

x (Tensor) – a tensor of shape (B, T, …)
mask (Tensor) – boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false.

Return type

Tensor

normalized_shape: Tuple[int, ...]¶

eps: float¶

elementwise_affine: bool¶

ReZero¶

class ReZero(block, dropout=0.0)¶

Implements ReZero residual connection around a block with dropout (as in transformer implementation).

Reference:: Bachlechner et al: ReZero is All You Need: Fast Convergence at Large Depth

Example

>>> import torch
>>> from hearth.modules import ReZero
>>>
>>> transformation = nn.Sequential(nn.Linear(10, 10),
...                                nn.ReLU(),
...                                nn.Dropout(.1),
...                                nn.Linear(10, 10)
...                                )
>>> re_zero = ReZero(transformation, dropout=.1)
>>>
>>> x = torch.normal(0, 1, size=(5, 10))
>>> y = re_zero(x)
>>> y.shape
torch.Size([5, 10])

since re_zero’s weight parameter has not actually been trained y its going to be equal to x and nothing from the transformation will be added to the input… as training goes on this should change.

>>> (y == x).all()
tensor(True)

forward(x)¶

forward for ReZero.

Return type: Tensor

training: bool¶

Residual¶

class Residual(block)¶

wraps a block in a residual connection \(y = block(x) + x\).

Parameters: block (Module) – the module to wrap.

Example

>>> import torch
>>> from torch import nn
>>> from hearth.modules import Residual
>>> _ = torch.manual_seed(0)
>>>
>>> res = Residual(nn.Linear(4, 4))
>>>
>>> x = torch.rand(2, 4) # (batch, feats)
>>> res(x)
tensor([[ 0.6371,  1.5493,  0.0031, -0.0379],
        [ 0.3584,  0.8512,  0.5208, -0.7607]], grad_fn=<AddBackward0>)

forward(x)¶

forward padd for Residual wrapper.

Return type: Tensor

training: bool¶

SelfAttention¶

class SelfAttention(in_features, out_features=None, n_heads=1, bias=True, dropout=0.1, drop_head=False)¶

Scaled self attention layer for timeseries.

Parameters

in_features (int) – number of input features.
out_features (Optional[int]) – number of output features, if None defaults to in_features. Defaults to None.
n_heads (int) – number of attention heads, must evenly divide out_features. Defaults to 1.
bias (bool) – include bias in all layers. Defaults to True.
dropout – attention dropout rate. Defaults to 0.1.
drop_head (bool) – If true drop attention from entire attention heads as dropout strategy otherwise drop timesteps from different heads. Defaults to False.

Example

>>> from torch import nn
>>> from hearth.modules import SelfAttention
>>>
>>> layer = SelfAttention(16, 32, n_heads=4)
>>> mask = torch.tensor([[ True,  True,  True, False, False],
...                      [ True,  True,  True,  True,  True]])
>>> inp = torch.rand(2, 5, 16)
>>> out = layer(inp, mask)
>>> out.shape
torch.Size([2, 5, 32])

>>> (out == 0.0).all(-1)
tensor([[False, False, False,  True,  True],
       [False, False, False, False, False]])

forward(x, mask)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶

TimeMasked¶

class TimeMasked(layer)¶

Call the wrapped layer such that only valid timesteps are passed to the underlying layer.

Inputs are expected to be of shape (B, T, …) and mask is expected to be a boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false underlying layer is expected to accept a single input of the masked shape.

Example

>>> from torch import nn
>>> from hearth.modules import TimeMasked
>>>
>>> layer = TimeMasked(nn.Sequential(nn.Linear(8, 12), nn.LayerNorm(12)))
>>> mask = torch.tensor([[ True,  True,  True, False, False],
...                      [ True,  True,  True,  True,  True]])
>>> inp = torch.rand(2, 5, 8)
>>> out = layer(inp, mask)
>>> out.shape
torch.Size([2, 5, 12])

>>> (out == 0.0).all(-1)
tensor([[False, False, False,  True,  True],
       [False, False, False, False, False]])

blocks()¶

this override this method to define depth based sections of your network.

Defining blocks() is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods like bottleneck() and unbottle() know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves be nn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self

forward(x, mask)¶

call the underlying layer using only timesteps from mask padding invalid timesteps with 0.

Parameters

x (Tensor) – a tensor of shape (B, T, …)
mask (Tensor) – boolean mask of of shape (B, T) where valid timesteps are True and invalid (padding) timesteps are false.

Return type

Tensor

training: bool¶

TransformerEmbedding¶

class TransformerEmbedding(vocab_size, features, max_len=512, dropout=0.1, layer_norm_eps=1e-12, padding_idx=0)¶

simple version of embedding used in transformer models.

Note

this embedding does not use token type embeddings as used in some bert models. If using with a pretrained model it’s reccomended that you first add these to word embeddings.

Parameters

vocab_size (int) – vocabulary size for word embeddings.
features (int) – number of features in embedding space.
max_len (int) – maximum sequence length for positional embeddings. Defaults to 512.
dropout (float) – dropout rate. Defaults to 0.1.
layer_norm_eps (float) – epsilon for layer normalization. Defaults to 1e-12.
padding_idx (int) – index for non-valid padding timesteps. Defaults to 0.

Example

>>> from torch import nn
>>> from hearth.modules import TransformerEmbedding
>>>
>>> emb = TransformerEmbedding(1000, 256, padding_idx=0)

>>> tokens = torch.tensor([[99, 6, 55, 1, 0, 0],
...                        [8, 22, 7, 8, 3, 11]])
>>> out = emb(tokens)
>>> out.shape
torch.Size([2, 6, 256])

>>> (out == 0.0).all(-1)
tensor([[False, False, False, False, True,  True],
        [False, False, False, False, False, False]])

>>> emb.build_mask(tokens)
tensor([[ True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True,  True]])

build_mask(tokens)¶

get a mask where all valid timesteps are True

Return type: Tensor

forward(tokens)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶

TransformerEncoder¶

class TransformerEncoder(features, layers, n_heads, dropout=0.1, attn_dropout=0.1, activation='gelu', boom_scale=4, drop_head=False, pre_norm=False, layer_norm_eps=1e-12)¶

a stack of transformer encoder layers

Parameters

features (int) – number of features.
layers (int) – number of layers.
n_heads (int) – number of attention heads.
dropout (float) – general dropout rate used in feedforward part of network and between residual connections. Defaults to 0.1.
attn_dropout (float) – dropout for self attention. Defaults to 0.1.
activation (str) – named activation for feedforward part of network. Defaults to ‘gelu’.
boom_scale (int) – scale for intermediate size in feedforward network. Defaults to 4 (commonly used in bert archetectures).
otherwise (drop_head If true drop attention from entire attention heads as dropout strategy) – drop timesteps from different heads. Defaults to False.
pre_norm (bool) – If true use pre-normalization strategy which may require less careful lr scheduling etc… Defaults to False as in bert-based models.
layer_norm_eps (float) – epsilon for layer norms used throught the network. Defaults to 1e-12.

Example

>>> import torch
>>> from hearth.modules import TransformerEncoder
>>>
>>> model = TransformerEncoder(16, layers=3, n_heads=4, boom_scale=4)
>>> mask = torch.tensor([[ True,  True,  True, False, False],
...                      [ True,  True,  True,  True,  True]])
>>> inp = torch.rand(2, 5, 16)
>>> model(inp, mask).shape
torch.Size([2, 5, 16])

>>> model.depth()
3

blocks()¶

this override this method to define depth based sections of your network.

Defining blocks() is useful when your’e doing finetuning or want to use depth based learning rates. It’s how methods like bottleneck() and unbottle() know how to iterate over your model. When overriden this method should yield logical depth based sections of your model, which should themselves be nn.Module`s **in depth based order from input to output**. How you section your network or group things together is up to you. If not overridden this method will just yield `self

forward(x, mask)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶

TransformerEncoderBlock¶

class TransformerEncoderBlock(features, n_heads, dropout=0.1, attn_dropout=0.1, activation='gelu', boom_scale=4, drop_head=False, pre_norm=False, layer_norm_eps=1e-12)¶

a single transformer encoder block.

Parameters

features (int) – number of features.
n_heads (int) – number of attention heads.
dropout (float) – general dropout rate used in feedforward part of network and between residual connections. Defaults to 0.1.
attn_dropout (float) – dropout for self attention. Defaults to 0.1.
activation (str) – named activation for feedforward part of network. Defaults to ‘gelu’.
boom_scale (int) – scale for intermediate size in feedforward network. Defaults to 4 (commonly used in bert archetectures).
otherwise (drop_head If true drop attention from entire attention heads as dropout strategy) – drop timesteps from different heads. Defaults to False.
pre_norm (bool) – If true use pre-normalization strategy which may require less careful lr scheduling etc… Defaults to False as in bert-based models.
layer_norm_eps (float) – epsilon for layer norms used throught the network. Defaults to 1e-12.

Example

>>> import torch
>>> from hearth.modules import TransformerEncoderBlock
>>>
>>> layer = TransformerEncoderBlock(16, n_heads=4, boom_scale=4)
>>> mask = torch.tensor([[ True,  True,  True, False, False],
...                      [ True,  True,  True,  True,  True]])
>>> inp = torch.rand(2, 5, 16)
>>> layer(inp, mask).shape
torch.Size([2, 5, 16])

with pre-norm scheme:

>>> layer = TransformerEncoderBlock(16, n_heads=4, boom_scale=4, pre_norm=True)
>>> layer(inp, mask).shape
torch.Size([2, 5, 16])

forward(x, mask)¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type: Tensor

training: bool¶