MLWhiz: Recs|ML|GenAI

MLWhiz: Recs|ML|GenAI

What is an LLM? Tokens, Embeddings, and the Big Picture

GenAI Fundamentals Series Part 3: From raw text to next-token prediction. The complete mental model

Jun 14, 2026
∙ Paid

Hey, Rahul here! 👋 Each week, I publish long-form ML+AI posts covering ML, AI, and System design for MLwhiz. Paid subscribers also get how-to guides with full code walkthroughs. I publish occasional extra articles. If you’d like to become a paid subscriber, here’s a button for that:

This is the third part of the genAI Fundamentals series. Each post picks one building block of modern LLMs and explains it from first principles, with code.

LLM pipeline: text to tokens to vectors to probabilities

Have you ever noticed how many screenshots float around LinkedIn of simple problems LLMs get wrong? “How many r’s in strawberry?” Wrong. “Reverse the word lollipop.” Wrong. “What’s 9.11 vs 9.9, which is bigger?” Confidently wrong.

People share these as “gotcha” moments. Proof that LLMs are dumb. Proof that we’re overhyping AI.

But here’s what most of those posts miss: these failures aren’t random. They’re predictable. And once you understand how an LLM actually processes text, you can tell in advance which tasks will trip it up and why.

The strawberry problem? The model never sees individual letters. It sees tokens, chunks like “str”, “aw”, “berry”. By the time any “reasoning” starts, the letters are already gone.

When I started working with LLMs in production, I had the same confusion everyone does. I couldn’t answer basic questions: Why doesn’t the model just process one character at a time? If it did, the strawberry problem would be trivial. What happens between typing a prompt and getting a response? Why does the model sometimes just... stop mid-sentence?

Once I understood the pipeline end-to-end, everything clicked. And those LinkedIn gotcha screenshots? They stopped being surprising. You look at one and think, “yeah, obviously it fails at that.”

This post covers the full picture: how text becomes tokens, how tokens become vectors, what happens inside the model, and how vectors become text again. So, let’s start.


1. What is an LLM, really?

Here’s the shortest accurate description:

An LLM is a function that takes a list of integers and outputs a probability distribution over which integer comes next.

That’s the entire interface. The integers represent tokens (subwords). The output is one probability per possible token in the vocabulary. The model picks one, appends it to the list, and runs the function again. ChatGPT, Claude, Llama, Gemini: they all do exactly this. Everything else, the chat formatting, the system prompts, is built on top of this one operation.

In PyTorch pseudo-code:

logits = model(input_ids)           # list of ints → raw scores
probs = softmax(logits[-1])         # last position → probabilities over vocab
next_token = sample(probs)          # pick one token

Three lines. That’s the core loop. The rest of this post explains what each piece does and where the numbers come from.

The LLM Pipeline: text enters as a prompt, gets tokenized to integers, embedded into vectors, transformed through 32 layers, projected by the LM head to probabilities, and sampled to produce the next token

2. Tokenization: how text becomes numbers

A neural network processes numbers. Text is not numbers. So the first step is converting text into a sequence of integers. This conversion is tokenization, and it’s one of those things that sounds boring until you realize it’s responsible for half the weird behavior you’ve seen from LLMs.

The question is: what should each integer represent?

You have three options.

A. Characters. Each character gets its own ID. “cat” becomes three IDs. The vocabulary stays small. The problem is length: a 4,000-word document is around 20,000 characters, so that’s 20,000 positions for the attention mechanism to chew through, and attention cost grows quadratically with sequence length.

B. Words. Each whole word gets an ID. The vocabulary balloons to 500,000+ for English alone, and you still can’t handle typos (”caat”), brand-new words (”ChatGPT” didn’t exist in older vocabularies), or variants (”running”, “ran”, “runs” all become unrelated entries). Every unseen word is a dead end.

C. Subwords. Common words stay whole, rare words split into known pieces. “unfamiliarize” becomes [”un”, “familiar”, “ize”]. The vocabulary lands in a comfortable middle (32K to 256K). This is what every modern LLM uses.

The rest of this section is about how subword tokenizers actually get built. But first, two concepts on which everything depends.

Unicode and bytes

Unicode is one giant table that assigns a number to every character in every writing system. “A” is 65. “é” is 233. The Chinese character “中” is 20,013. The emoji “😀” is 128,512. About 150,000 characters are defined today, and the table keeps growing. Each number is called a code point.

Bytes are how those numbers actually get stored on a computer. A byte is just a value from 0 to 255 (8 bits, hence “byte”). Everything on your machine, including text, is ultimately a stream of these byte values.

The catch is that Unicode has ~150,000 code points, but a byte only goes up to 255. So you can’t fit “中” (code point 20,013) into a single byte. You need a scheme for packing big code points into sequences of small bytes. That scheme is UTF-8, the encoding that essentially all text uses today.

UTF-8 is variable-length: common characters get fewer bytes, rare ones get more.

  • Plain English (the original ASCII set, code points 0-127) takes 1 byte. “A” is just byte 65.

  • Accented Latin, Greek, Cyrillic, Hebrew, Arabic take 2 bytes.

  • Most Chinese, Japanese, and Korean characters take 3 bytes.

  • Emoji and rarer symbols take 4 bytes.

So “中” is not stored as the single number 20,013. UTF-8 packs that code point into three bytes: 228, 184, 173. If you wrote “中” to a file, those are the three values actually on disk. The string “A中” would be four bytes total: [65, 228, 184, 173].

You might wonder: if “A中” is just [65, 228, 184, 173], how does the computer know that 65 is one character but 228, 184, 173 are three bytes forming a single character? Why not read it as four separate characters? Because UTF-8 is self-describing. The leading bits of each byte announce its role: a byte starting with 0 is a standalone character, a byte starting with 110 or 1110 is the start of a 2- or 3-byte character, and a byte starting with 10 is a continuation. Here, 65 is 01000001 (standalone, “A”), and 228 is 11100100 (start of a 3-byte character, “read the next two bytes with me”), while 184 and 173 both start with 10 (continuations). The decoder can always find character boundaries just by looking at the top bits.

This is the punchline for tokenization: no matter what language or symbol you throw at it, every piece of text is, at the lowest level, a sequence of byte values between 0 and 255. There are only ever 256 possible building blocks.

Hold onto this distinction (characters versus bytes), because the single most important design choice in a tokenizer is which of these it starts from.


BPE: the core algorithm

Byte Pair Encoding (BPE) started life as a data-compression trick in 1994. Sennrich et al. adapted it for NLP in 2016, and it has been the dominant tokenization algorithm ever since. GPT-2, GPT-4, Llama, Qwen, Mistral, DeepSeek: nearly all of them run BPE.

BPE algorithm: start with byte-level splits, find the most frequent pair, merge it into a new token, repeat until target vocabulary size

The algorithm is short. I’ll walk through it, because once you see it the rest falls into place.

Step 1: Start with a base set of atomic units. Say there are 256 of them (I’ll explain what they are in the next section). Every piece of text is now a sequence of these units.

Step 2: Scan the entire training corpus. Find the pair of adjacent units that appears most often. Maybe “e” followed by “r” shows up 50 million times.

Step 3: Merge that pair into one new unit, “er”. Add it to the vocabulary (now 257). Replace every “e” + “r” in the corpus with “er”.

Step 4: Repeat. The next most frequent pair might be “t” + “h” → “th”, then “th” + “e” → “the”. Merge, add, repeat.

Run 128,000 merges and you end up with ~128,256 tokens (the 256 base units plus 128K learned merges). That’s right in the range where a lot of modern models have landed.

Merge order matters. Early merges grab the most common letter combos (”th”, “in”, “er”), which combine into “the”, “ing”, “tion”, and eventually whole common words. Rare words stay split into pieces. The ordered list of merges is saved with the model and replayed at inference time.


Characters or bytes? The choice that actually matters

BPE merges pairs, but Step 1 has to start from something. This is the real decision, and it’s where the two main flavors of BPE split apart.

Character-level BPE starts from Unicode characters. The catch: there are ~150,000 of them. You either spend vocabulary slots on thousands of rare characters, or you hit characters you’ve never seen at inference time. Older tokenizers had a special <UNK> (”unknown”) token for exactly this, and it caused real failures on emoji, rare scripts, and even unusual typos.

Byte-level BPE starts from the 256-byte values instead. This is GPT-2’s key trick, and it’s why the base vocabulary is 256. Since every possible text in every language is just a sequence of bytes, there is no such thing as an unknown token. A brand-new emoji is simply four known byte-tokens stitched together. Chinese, Arabic, source code, corrupted text: all representable. Rare text just splits into more tokens. Nothing ever breaks.

This is why essentially all modern tokenizers are byte-level. When you saw “256 base units” in the algorithm above, those 256 are the byte values.


SentencePiece and tiktoken

BPE is the algorithm. SentencePiece and tiktoken are libraries that run it. Moving from one library to the other is not moving away from BPE. This trips up a lot of people who read that some model “switched from SentencePiece to tiktoken” and assume the algorithm changed. It didn’t. Both run BPE.

SentencePiece (Kudo & Richardson, 2018) is Google’s tokenizer library. Its real contribution is handling raw text without language-specific rules. Most tokenizers first split text on spaces, then run BPE per word, which breaks for Chinese, Japanese, and Thai, which don’t put spaces between words. SentencePiece skips the space-splitting step. It treats the space itself as a character (the “▁” symbol) and runs the merge algorithm on the raw stream. The early Llama models, Mistral, and the Gemma family all used SentencePiece, typically with a 32,000-token vocabulary.

SentencePiece is still widely used. The Gemma models use it, for example. Here it is, tokenizing a sentence (notice the “▁” marking each space):

# SentencePiece example
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load(”tokenizer.model”)

text = “Hello, how are you?”
tokens = sp.EncodeAsPieces(text)
# [’▁Hello’, ‘,’, ‘▁how’, ‘▁are’, ‘▁you’, ‘?’]
# The ▁ marks the start of a word (space replaced with ▁)

Notice the “▁” prefix. SentencePiece uses this to mark where spaces were in the original text. When you decode, it converts “▁” back to a space. This is how the model knows that “Hello” starts a new word.


tiktoken: OpenAI’s fast byte-level BPE

tiktoken is OpenAI’s tokenizer library. It runs byte-level BPE, it’s written in Rust (with Python bindings), and it’s significantly faster than older Python implementations. It’s the tokenizer behind GPT-3.5, GPT-4, and GPT-4o, and many newer open models (the later Llama releases, Qwen, DeepSeek) adopted tiktoken-style tokenizers too.

tiktoken ships with several pre-trained encodings:

  • cl100k_base: used by GPT-4 and GPT-3.5 Turbo. 100,277 tokens.

  • o200k_base: used by GPT-4o. ~200,019 tokens.

  • gpt2: the original GPT-2 encoding. 50,257 tokens.

A lot of newer models moved from SentencePiece to tiktoken-style tokenizers, and it’s worth being clear that this isn’t “BPE stopped working.” Both run BPE. The migration is mostly about speed: tiktoken’s Rust encoder is much faster at turning text into tokens, which matters when you’re tokenizing trillions of training tokens and every inference request.

Let’s see tiktoken in action:

import tiktoken

# GPT-4’s tokenizer
enc = tiktoken.get_encoding(”cl100k_base”)

# Simple English
text = “The quick brown fox jumps over the lazy dog”
tokens = enc.encode(text)
print(f”Text: {text}”)
print(f”Tokens: {tokens}”)
print(f”Count: {len(tokens)} tokens”)
print(f”Decoded: {[enc.decode([t]) for t in tokens]}”)
# Tokens: [791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679]
# Count: 9 tokens
# Decoded: [’The’, ‘ quick’, ‘ brown’, ‘ fox’, ‘ jumps’, ‘ over’, ‘ the’, ‘ lazy’, ‘ dog’]

# Now try something interesting
code = “def fibonacci(n):\n    return n if n < 2 else fibonacci(n-1) + fibonacci(n-2)”
code_tokens = enc.encode(code)
print(f”\nCode: {code}”)
print(f”Count: {len(code_tokens)} tokens”)
print(f”Decoded: {[enc.decode([t]) for t in code_tokens]}”)
# Common programming patterns are single tokens: “def”, “return”, “fibonacci”
# Indentation is a single token too

# The strawberry problem
word = “strawberry”
word_tokens = enc.encode(word)
print(f”\n’{word}’ = {[enc.decode([t]) for t in word_tokens]}”)
# ‘strawberry’ -> [’str’, ‘aw’, ‘berry’] (3 tokens)
# The model literally never sees the individual letters

9 tokens for 9 English words. That’s efficient. Code tokenizes well, too, because the BPE merges were trained on a corpus that included a lot of code. But look at “strawberry”: it splits as “str” + “aw” + “berry”, not into individual letters. This is why the model can’t count letters. It never sees them.


Who uses what: tokenizers across the major models

Here’s how tokenization evolved across the major model families. This is worth studying because the tokenizer choice tells you a lot about a model’s design priorities.

The “Algorithm” column tells you the flavor of BPE and the library used. Every model below runs BPE; what changes is whether it’s byte-level and which library implements it.

Tokenizer comparison across major models: vocab size, algorithm, and year

A few patterns jump out:

2019-2022: Small vocabulary, English-first. GPT-2 and GPT-3 used 50K tokens. Llama 1/2 and Mistral used 32K. These vocabularies were heavily English-biased. The same content in Hindi could need several times more tokens than in English.

2023: The 100K jump. GPT-4 doubled the vocabulary to 100K. This was the first big move toward multilingual efficiency: more tokens dedicated to non-English scripts means fewer tokens per sentence in those languages.

2024-2025: The 128K-256K era. Everyone expanded. Llama 3 and 4 went to 128K then 202K. Qwen settled on 152K. Gemma pushed to 256K, then 262K. The reasoning is consistent: bigger vocab = fewer tokens per text = faster inference, lower cost, longer effective context. The tradeoff is a bigger embedding table, but at 8B+ model sizes, that table is a small fraction of total parameters.

The library choice is mostly about speed, not capability. SentencePiece (Gemma, older Llama, Mistral) and tiktoken (GPT, newer Llama, Qwen) both run BPE. The migration toward tiktoken in newer models is largely because its Rust implementation encodes and decodes faster. The algorithm underneath is the same.

Qwen’s CJK advantage. Qwen’s 152K vocabulary was built with heavy Chinese, Japanese, and Korean coverage from the start. That’s why Qwen models tend to do well on CJK benchmarks relative to their size: the tokenizer compresses CJK text efficiently, giving the model more room per context window.


The rest of this post is for paid subscribers. So far, we’ve covered what an LLM is at its simplest and the full tokenization story: Unicode, bytes, UTF-8, BPE, byte-level vs character-level, SentencePiece vs tiktoken, and the cross-model tokenizer table.

Behind the paywall: how token IDs become vectors (the embedding table and what those vectors mean), how the model turns vectors back into words (the LM head, softmax, weight tying with a worked numeric example, and temperature), the full end-to-end pipeline with real numbers, autoregressive generation and the KV cache, and why tokenization quietly decides what a model can and can’t do.

Subscribe to keep reading.

User's avatar

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.
© 2026 Rahul Agarwal · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture