tokenizers¶
Tokenizer¶
- class Tokenizer(vocab, **kwargs)¶
Base class for tokenizers implementing a few useful abstractions for seralization and batch /non-batch usage.
Not meant to be used directly. All child tokenizers should override the tokenize method at minimum.
- tokenize(s)¶
- Return type
List
[int
]
- property vocab_size: int¶
- Return type
int
- config()¶
get jsonable config dict for this Tokenizer.
Used in combination with load and save for tokenizer serialization.
- classmethod from_config(config)¶
load a config dictionary into a new instance of this tokenizer class
- save(path)¶
Save this tokenizer’s config to as a json file at path.
- classmethod load(path)¶
load a new instance of this Tokenizer class from using config found at path.
- Return type
- tokenize_batch(batch)¶
tokenize a batch of strings.
- Return type
List
[List
[int
]]
WordPieceTokenizer¶
- class WordPieceTokenizer(vocab, subword_prefix='##', bos='[CLS]', eos='[SEP]', oov='OOV', lower=True)¶
Bert style word piece tokenizer.
- Parameters
vocab (
Dict
[str
,int
]) – Vocab dictionary mapping text to index.subword_prefix (
str
) – defines how subwords are marked in the vocab. Defaults to ‘##’ as in Bert.bos (
str
) – token to prepend to the begining of the sequence. Defaults to ‘[CLS]’ as in Bert.eos (
str
) – token to prepend to the end of the sequence. Defaults to ‘[SEP]’ as in Bert.oov (
str
) – Token for handling out of vocabulary stuff. With word piece tokenization this tends to happen very rarely and usually indicates vocab characters as opposed to whole words but for completeness we add both a OOV and ##OOV to the vocabulary replacing the [unused0] and [unused1] tokens if no oov tokens matching this value are found in the vocab. Note that this differs a little from most Bert tokenizer implementations which do not acknowlege OOV. Defaults to ‘OOV’, consider changing if tokenizer is not lowercase.lower (
bool
) – bool If True lowercase everything. Default to True.
- clean(s)¶
- config()¶
get jsonable config dict for this Tokenizer.
Used in combination with load and save for tokenizer serialization.
- subwords(s, _prefix='')¶
- split(s)¶
- tokenize(s)¶