tokenizers

Tokenizer

class Tokenizer(vocab, **kwargs)

Base class for tokenizers implementing a few useful abstractions for seralization and batch /non-batch usage.

Not meant to be used directly. All child tokenizers should override the tokenize method at minimum.

tokenize(s)
Return type

List[int]

property vocab_size: int
Return type

int

config()

get jsonable config dict for this Tokenizer.

Used in combination with load and save for tokenizer serialization.

classmethod from_config(config)

load a config dictionary into a new instance of this tokenizer class

save(path)

Save this tokenizer’s config to as a json file at path.

classmethod load(path)

load a new instance of this Tokenizer class from using config found at path.

Return type

Tokenizer

tokenize_batch(batch)

tokenize a batch of strings.

Return type

List[List[int]]


WordPieceTokenizer

class WordPieceTokenizer(vocab, subword_prefix='##', bos='[CLS]', eos='[SEP]', oov='OOV', lower=True)

Bert style word piece tokenizer.

Parameters
  • vocab (Dict[str, int]) – Vocab dictionary mapping text to index.

  • subword_prefix (str) – defines how subwords are marked in the vocab. Defaults to ‘##’ as in Bert.

  • bos (str) – token to prepend to the begining of the sequence. Defaults to ‘[CLS]’ as in Bert.

  • eos (str) – token to prepend to the end of the sequence. Defaults to ‘[SEP]’ as in Bert.

  • oov (str) – Token for handling out of vocabulary stuff. With word piece tokenization this tends to happen very rarely and usually indicates vocab characters as opposed to whole words but for completeness we add both a OOV and ##OOV to the vocabulary replacing the [unused0] and [unused1] tokens if no oov tokens matching this value are found in the vocab. Note that this differs a little from most Bert tokenizer implementations which do not acknowlege OOV. Defaults to ‘OOV’, consider changing if tokenizer is not lowercase.

  • lower (bool) – bool If True lowercase everything. Default to True.

clean(s)
config()

get jsonable config dict for this Tokenizer.

Used in combination with load and save for tokenizer serialization.

subwords(s, _prefix='')
split(s)
tokenize(s)