How machines read
The most fundamental step of almost every NLP model is a process called tokenization. This week in Let’s Talk Text we’ll discover why a tokenization algorithm called byte-pair encoding (BPE) is used by nearly every large-language model. Often called the dark horse of large language models, it is a crucial building block foundational to the success of models such as BERT, GPT-3, and XLNet. Before we get to BPE, let’s start at the beginning.
Tokenization is the process of representing raw text in smaller units called tokens. These tokens are often then represented as numbers, which are essentially IDs mapping back to the token. This collection of tokens corresponds to a length of the text and often gets further encoded to be represented by a vector.
Fundamentally, tokenization is a hard problem and comes with many challenges. How you tokenize text containing punctuation, contractions, or symbols will inevitably impact the performance of the model. Throughout the history of NLP, there have been many ways to construct these tokens, each with its pros and cons. The standard approaches are word-based, character-based, and subword tokenization.
Word tokenization
This is the standard way to tokenize the text and it’s the most intuitive. You just split up your text based on whitespace. Models such as Word2Vec and GloVe rely on word-level tokenization. The biggest drawback with this approach is that any words that didn’t show up in the training data are unknown to the model and represented with a <UNK>. We may have the word tokenization in our vocabulary, but we’re out of luck if we encounter tkenization.
Character tokenization
How about instead of creating tokens by words, we use characters? That way when we encounter a word outside of our vocabulary, we don’t need to represent it with an <UNK>, but we can use our understanding of the characters in the word to represent it. One of the first context-aware embedding methods ELMo did just that. Unfortunately, the large number of tokens required to represent a sequence of text combined with the fact that a character by itself carries no meaning results in suboptimal performance.
Subword tokenization
Finally, subword tokenization--the best of both worlds. The word nauseant may not be in our vocabulary, but nausea likely is, and therefore we have some understanding that nauseant is an agent that induces nausea. We get to have a reasonable vocabulary size and try to process unseen words. There are several types of subword tokenization that are used by the best performing language models:
CharBPETokenizer: The original BPE
ByteLevelBPETokenizer: The byte-level version of the BPE
SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece
Interestingly, however, some research claims that there’s never been a direct evaluation of the impact of different subword tokenization methods on language model pre-training.
BPE algorithm
So how does it work? Let’s go over the steps of the method.
Use basic tokenization (on spaces, newlines, ...) to separate words from each other
Treat each character of the word as a token
Compute the frequency of all n-grams
Recursively merge tokens prioritizing those with the highest score
A parameter of the tokenization algorithm is the size of the vocabulary. We’ll continue to do merges of our training data until we reach this stopping criteria.
Example tokenization
Consider the two simple sentences:
The cat in the hat
Go to bat for me
First iteration
T h e | c a t | i n | t h e | h a t
G o | t o | b a t | f o r | m e
th → 2
he → 2
ca → 1
at → 3
in → 1
ha → 1
go → 1
to → 1
ba → 1
fo → 1
or → 1
me → 1
Merge at together
Second iteration
T h e | c at | i n | t h e | h at
G o | t o | b at | f o r | m e
th → 2
he → 2
cat → 1
in → 1
hat → 1
go → 1
to → 1
bat → 1
fo → 1
or → 1
me → 1
Merge th together
Th e | c at | i n | th e | h at
G o | t o | b at | f o r | m e
Computational considerations
Current state-of-the-art language models are trained on an incredibly large amount of text. GPT-3 was trained on 45TB of text data from various sources and has a vocabulary size of ~50k tokens. While BPE is a greedy algorithm, which is designed for speed by choosing the local best solution at each iteration, running the algorithm on such a large scale of data takes a lot of time. HuggingFace has made great efforts to create bindings to run BPE with Rust resulting in 1GB of text taking less than 20 seconds. Even using this super-optimized implementation with one worker on 45TB of text data would take:
45 * 1,000 * 20 / 60 / 60 = 250 hours
Source, Source, Source, Source