virtex.data.tokenizers


class virtex.data.tokenizers.SentencePieceBPETokenizer(model_path: str)[source]

Bases: object

A tokenizer based on SentencePiece with BPE sub-routine. It encodes caption strings into list of tokens.

Parameters

model_path – Path to the .model file trained by SentencePiece.

get_vocab_size() int[source]

Return number of tokens in vocabulary (including special tokens).

token_to_id(token: str) int[source]

Get integer ID of a string token (<unk> if does not exist).

id_to_token(token_id: int) str[source]

Get string token of an integer ID (<unk> if does not exist).

encode(text: str) List[int][source]

Convert a text string to a list of integer token ids.

decode(token_ids: List[int]) str[source]

Convert a sequence of token IDs to a text string.