samplers

BatchSubsequenceSampler

class BatchSubsequenceSampler(dataset, batch_size, sequence_length, drop_last=False)

a batch sampler that generates nested ordered subsequence indexes aligned on batch.

output indexes from this sampler will be lists of shape [batch_size, sequence_lengh].

Note

Internally this uses SubsequenceSampler to generate batches and will always drop the short sequence if the length of the dataset is not divisible by sequence_length.

Parameters
  • dataset (Sized) – something we can call len() on that represents the size of the dataset being batched. It can be the dataset itself, a tensor, list, range etc…

  • batch_size (int) – the desired batch size

  • sequence_length (int) – desired sequence length

  • drop_last (bool) – if True drop the batch if less than batch size.

Example

>>> import torch
>>> from hearth.data.samplers import SubsequenceSampler
>>> _ = torch.manual_seed(0)
>>> sampler = BatchSubsequenceSampler(range(113), batch_size=4, sequence_length=5)
>>> len(sampler)
6

generate batch indexes for sequental sequences of shape (batch_size, sequence_length) the last batch may contain less sequences, but sequence lengths will never change.

>>> for batch_idxes in sampler:
...    print(batch_idxes)
[[20, 21, 22, 23, 24], [10, 11, 12, 13, 14], [55, 56, 57, 58, 59], [30, 31, 32, 33, 34]]
[[5, 6, 7, 8, 9], [60, 61, 62, 63, 64], [80, 81, 82, 83, 84], [15, 16, 17, 18, 19]]
[[0, 1, 2, 3, 4], [90, 91, 92, 93, 94], [103, 104, 105, 106, 107], [40, 41, 42, 43, 44]]
[[85, 86, 87, 88, 89], [50, 51, 52, 53, 54], [35, 36, 37, 38, 39], [75, 76, 77, 78, 79]]
[[45, 46, 47, 48, 49], [65, 66, 67, 68, 69], [70, 71, 72, 73, 74], [108, 109, 110, 111, 112]]
[[98, 99, 100, 101, 102], [25, 26, 27, 28, 29]]
classmethod build_dataloader(dataset, batch_size, sequence_length, drop_last=False, **kwargs)

creates a new DataLoader using BatchSubsequenceSampler.

Note

extra keyword arguments will be passed to torch.utils.data.DataLoader

Parameters
  • dataset (Sized) – actual dataset you’d like to use, this dataset should support nested multi-indexing, torch.utils.dataset.TensorDataset works just fine

  • batch_size (int) – the desired batch size for the sequences

  • sequence_length (int) – desired sequence length for subsequences

  • drop_last (bool) – If true drop the last short batch. Defaults to True.

Return type

DataLoader


SubsequenceSampler

class SubsequenceSampler(dataset, batch_size, drop_shortest=False)

a batch sampler that keeps sequences aligned within batches but shuffles batches.

Often when training on a dataset that represents a full sequence we may want to window that sequence into multiple subsequences and shuffle those. Additionally when batch_size is not divisible by dataset length we choose a different short sequence on each iteration which means batches are not always starting and ending at the same position. The shortest batch will always be last to comply with expected behavior from other batchers.

Parameters
  • dataset (Sized) – something we can call len() on that represents the size of the dataset being batched. It can be the dataset itself, a tensor, list, range etc…

  • batch_size (int) – the desired batch size

  • drop_shortest (bool) – if True drop the shortest batch on each iteration if the size of the dataset is not divisible by batch_size. The short batch will be chosen randomly on each iteration (providing a little extra noise and ensuring we dont see exactly the same subsequences on each iteration). Defaults to False.

Example

>>> import torch
>>> from hearth.data.samplers import SubsequenceSampler
>>> _ = torch.manual_seed(0)
>>>
>>> # subsequence sampler only dataset
>>> # so it could be a range object or a Dataset, a tensor etc...
>>> sampler = SubsequenceSampler(range(15), batch_size=4)
>>> sampler
SubsequenceSampler(dataset=range(0, 15), batch_size=4, drop_shortest=False)

by default we will get one short batch of 3, since batch_size=4 and drop_shortest=False:

>>> for batch in sampler:
...     print(batch)
[11, 12, 13, 14]
[7, 8, 9, 10]
[3, 4, 5, 6]
[0, 1, 2]

between runs we shuffle and choose a new short batch:

>>> for batch in sampler:
...     print(batch)
[8, 9, 10, 11]
[0, 1, 2, 3]
[4, 5, 6, 7]
[12, 13, 14]

when drop_shortest=True the randomly chosen short batch will be dropped…

>>> for batch in SubsequenceSampler(range(15), batch_size=4, drop_shortest=True):
...     print(batch)
[11, 12, 13, 14]
[0, 1, 2, 3]
[7, 8, 9, 10]

and again this will be different for each iteration:

>>> for batch in SubsequenceSampler(range(15), batch_size=4, drop_shortest=True):
...     print(batch)
[11, 12, 13, 14]
[7, 8, 9, 10]
[3, 4, 5, 6]

when batch_size is divisible by the number of examples in the dataset behavior is a little more deterministic, since the indexes in each batch will always be the same:

>>> for batch in SubsequenceSampler(range(12), batch_size=4):
...     print(batch)
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9, 10, 11]

but batches will still be shuffled between iterations:

>>> for batch in SubsequenceSampler(range(12), batch_size=4):
...     print(batch)
[8, 9, 10, 11]
[0, 1, 2, 3]
[4, 5, 6, 7]