A Guide to Byte Pair Encoding from First Principles
We pay for AI usage in ‘tokens’ and measure context windows in ‘tokens,’ but what exactly are we counting?
Since machines process integers, not English, we need a reliable way to translate human nuance into rigid numbers without losing meaning. This is the story of Byte Pair Encoding (BPE).
It is the reason you can chat with AI in English, switch to Japanese, paste some Python code, and drop a 🤠 emoji without the system crashing. It gives computers the flexibility to read characters (when needed) but the efficiency to read words (most of the time).
I’ll explain it from first principles, demonstrating exactly why naive tokenization methods break down—and why BPE is the elegant solution we rely on today.
How Big Should a “Token” Be?
If we want to teach a computer the word “Apple”, we have two obvious approaches:
We could turn every single letter into a number (A=1, B=2,…).
- ✅ Pro: We only need a tiny vocabulary for languages like English.
- ❌ Con: Too slow. It’s incredibly inefficient. The model has to process 5 separate tokens just to grasp the single concept of “Apple.” It’s like reading a book one letter at a time.
We could assign a unique number to every single word in the English language.
- ✅ Pro: It’s fast. “Apple” is just one number.
- ❌ Con: Too rigid. Language is infinite. What happens when the model sees a new word like “iPhone17” or a typo? If it’s not in the dictionary, the model breaks or sees an
<Unknown>token.
The Solution: Byte Pair Encoding (BPE)
BPE is the middle ground. It keeps common words whole (like “the” or “apple”) but breaks rare or complex words into smaller, reusable chunks (like “ing”, “pre”, or “ed”).
This gives us the best of both worlds:
- Efficiency: Common words are single tokens.
- Flexibility: It can build any word—even ones it has never seen—out of sub-word pieces. “uninstagrammable” becomes
['un', 'instagram', 'mable'].
Now let’s start our journey to see how BPE is able to do that, step by step, with the diagram as the map.

Part 1: The Training Phase
Before the AI can read, we have to build its vocabulary. This “Training” phase isn’t the complex neural network training you read about in the news; it’s effectively a statistical compression algorithm.
Why do we need Pre-tokenization?
Before analyzing texts, we split them into logical units. If we blindly treated every string as unique, we would face two major headaches:
1. Handling Punctuation: We don’t want “Hello” and “Hello,” to be two different tokens. That wastes vocabulary space. So, we force a split between words and punctuation.
2. Handling Spaces: If we just split by space, we lose the spaces. We wouldn’t know if the original text was “Hello world” or “Helloworld.” To fix this, modern tokenizers glue the space to the start of the next word.
This ensures the process is reversible. We can reconstruct the sentence perfectly by joining the tokens.
- Raw Text: “Hello, world!”
- Prepped:
["Hello", ",", " world", "!"]
How to represent every character in the world?
To a computer, a letter isn’t a shape; it’s a sequence of 1 to 4 bytes (UTF-8). This allows us to represent over 1.1 million possible symbols—from English letters to Kanji to Emojis—using a base vocabulary of just 256 bytes.
h -> [104]
Space -> [32]
🤠 -> [240, 159, 164, 160] (4 bytes)
How BPE Builds a Vocabulary
The BPE algorithm is elegant because it is purely statistical. It doesn’t know grammar; it just counts frequency. The process follows a simple loop:
- Count every pair of adjacent tokens in the dataset.
- Find the most frequent pair.
- Merge that pair into a new, single token.
- Repeat until you reach your target vocabulary size.
Let’s see this in action with a toy dataset: “hug pug hug pug”.
Step 1: The Raw Input (Byte Level)
To the computer, our phrase starts as a sequence of raw Unicode bytes. It doesn’t see words yet; it just sees a stream of integers.
Here is the initial state for [“hug”, “ pug”, “ hug”, “ pug”]:
[
[[104, 117, 103]], # "hug"
[[32, 112, 117, 103]], # " pug" (notice the 32 for space)
[[32, 104, 117, 103]], # " hug"
[[32, 112, 117, 103]] # " pug"
]
Step 2: Counting Pairs
Now, BPE scans this list for adjacent pairs. It asks:“Which two numbers appear side-by-side the most?”
Is it 104 (h) followed by 117 (u) or 117 (u) followed by 103 (g)?
Let’s look at the counts:
| Pair (Chars) | Pair (Bytes) | Frequency |
|---|---|---|
('u', 'g') | (117, 103) | 4 |
('p', 'u') | (112, 117) | 2 |
('h', 'u') | (104, 117) | 2 |
The Winner: u + g (4 occurrences).
Step 3: Merging the Winner
The pair u (117) and g (103) appears most often (4 times). We merge them into a new single token.
But what ID does it get?
Since our “base vocabulary” consists of all possible single bytes (0–255), the next available integer is 256. So, we assign our new token the ID 256.
- Rule: (117, 103) → 256
The Updated Dataset: Every instance of u, g is now replaced by 256.
[
[104, 256], # 'h', 'ug' (Token 256)
[32, 112, 256], # ' ', 'p', 'ug'
[32, 104, 256], # ' ', 'h', 'ug'
[32, 112, 256] # ' ', 'p', 'ug'
]
Step 4: Repeat (Until We Say Stop)
We update our text and count again. Now we might see h (104) next to our new token ug (256). We merge those into 257 (“hug”).
But when do we stop? This is a choice we make. The “Target Vocabulary Size” is a hyperparameter we define before training starts (e.g., GPT-4 uses ~100k tokens).
Stop early? You get a smaller vocabulary, but the model has to use more tokens to represent a sentence.
Stop late? You get a massive vocabulary with specific words (like “uninstagrammable”), but the model’s embedding matrix becomes huge and memory-intensive.
The loop continues until that target count is reached.
In our toy example above, the vocabulary size is naturally limited. Eventually, we run out of pairs because the dataset is so small. We simply can’t create more tokens than there are unique patterns in the text.
However, when training on massive datasets (like Wikipedia or GitHub), the number of possible pairs is effectively infinite. We would never reach a “natural end.” That is why we must set a Target Vocabulary Size (a hard stop) to prevent the dictionary from growing too large.
The Final Output
At the end of training, we discard the text and keep only two things:
- The Merges (Rules): An ordered list of what to combine.
- The Vocabulary: The dictionary mapping IDs to text.
Example:
Merges as list of lists:
[
(117, 103), # 1st merge: u + g -> ug
(104, 256), # 2nd merge: h + ug -> hug
(112, 256) # 3rd merge: p + ug -> pug
]
Vocabulary:
{
...
256: "ug",
257: "hug",
258: "pug"
}
Why Merge Order Matters
The order acts as a tie-breaker to prevent inconsistent splits. Consider the sequence “the”. Suppose we have two learned rules:
- Merge t + h -> th
- Merge h + e -> he
If we apply Rule 1 first, we get the tokens th, e. If we apply Rule 2 first, we get the tokens t, he.
These are totally different ways to split the word! With strict order, BPE guarantees we always pick the one that was most frequent in the training data (e.g., if th appeared more often than he, we always prioritize th).
Part 2: The Inference Phase
This is what happens at runtime (e.g., when you type a prompt into ChatGPT). The model takes the “Rules” we learned and applies them to new text.
Encoding: Text → Numbers
Let’s say you type the word “thug”. The model has never seen this word before! But it knows the pieces.
Prep: It breaks the word into bytes: t, h, u, g.
Check Rules: It checks its rule book.
- “Do I have a token for t+h?” -> No.
- “Do I have a token for u+g?” -> Yes! We learned ug earlier.
Merge: It combines them: t, h, ug.
Final IDs: It swaps those pieces for their ID numbers: [116, 104, 256].
The Result: A clean, compact list of numbers that captures the meaning without needing an infinite dictionary.
Decoding: Numbers → Text
When the AI responds, it just does the reverse. It spits out numbers, looks them up in the vocabulary, and glues the text back together for you to read.
Resources & Code
This post is first of my notes series from learning CS336 Language Modeling from Scratch.
If you want to see how this works in actual Python code, I’ve implemented a BPE tokenizer from scratch on GitHub. I included 4 iterations of the solution, moving from a brute-force approach to a more optimized versions:
- Brute Force Solution: The simplest implementation to understand the logic
- Optimized Solution: Improved performance for larger datasets.
- Multiprocessing Solution: Uses multiple CPU cores to speed up training
- Memory Optimized Solution: Designed to handle massive datasets without crashing RAM.
Comments
Loading comments…