Pretraining 101: Data, Scale, and the Loss Function

GenAI Fundamentals Series Part 4: How a pile of random weights becomes a base model. Next-token cross-entropy, trillion-token data pipelines, and compute budgets

Jun 26, 2026

∙ Paid

Hey, Rahul here! 👋 Each week, I publish long-form ML+AI posts covering ML, AI, and System design for MLwhiz. Paid subscribers also get how-to guides with full code walkthroughs. I publish occasional extra articles. If you’d like to become a paid subscriber, here’s a button for that:

This is part of the genAI Fundamentals series. Each post picks one building block of modern LLMs and explains it from first principles, with code.

In the last post, we ended on a tidy one-liner: we said an LLM is just a function: token IDs in, next-token probabilities out. It’s a clean, satisfying way to put it — but I quietly skipped the single most important word in that sentence.

Trained.

When you first build that function, every weight in it is random — the embeddings, the transformer layers, the output head, all of it meaningless static. Feed a brand-new, freshly-initialized model the prompt “The capital of France is” and it won’t say “ Paris” — it’ll pick a token essentially at random, maybe “ marmalade,” maybe “ 7,” maybe “ the.” Every token in the vocabulary is roughly equally likely, because the model has never seen a single sentence in its life.

So here’s the question this whole post answers: how do we get from a pile of random numbers to a model that completes “The capital of France is” with “ Paris” — and, along the way, picks up grammar, facts, a bit of arithmetic, and enough code to autocomplete your Python?

The answer is pretraining. And the beautiful thing is that it comes down to one loss function, computed over a few trillion tokens, a few trillion times. There are no human labels and no clever supervision anywhere in it — the model just guesses the next token, checks the answer, nudges its weights, and does it all over again.

By the end of this post you’ll understand four things, which also happen to be the four things that decide whether a pretrained model is any good:

The loss function — what cross-entropy actually measures, walked through one training step.
The data — where 15 trillion tokens come from and why the pipeline matters more than the model code.
The compute — the simple FLOP math behind a training run, and why these runs cost tens of millions of dollars.
The scaling laws — why Chinchilla proved that a smaller model trained on more data beats a bigger model trained on less.

And finally, what you actually get when the run finishes — and why it isn’t ChatGPT.

Follow along in code. There’s a companion notebook that pretrains a tiny GPT from scratch on a real public-domain book (pulled from Project Gutenberg) — it runs on a laptop CPU in a few minutes, and every concept below maps to a cell. Grab it on Kaggle.

Let’s dive in.

1. From Random Noise to a Base Model

Pretraining is the first, longest, and most expensive phase of building an LLM. It has exactly one job: take a model full of random weights and teach it to predict the next token across a giant, generic pile of text — a single objective, repeated at a scale that’s genuinely hard to picture.

The clever part — the thing that makes the whole modern LLM era possible — is that this learning is self-supervised. Here’s why that matters.

In a normal supervised setup, you need labeled examples: a photo and a human-written “cat,” a transaction and a human-written “fraud.” Labels are expensive: a human has to make each one, which puts a hard ceiling on how much data you can learn from.

Pretraining sidesteps the ceiling entirely. The trick: any piece of text is already its own answer key. Take a sentence:

“The cat sat on the mat.”

Hide the last word. Now you have a training example for free: the input is “The cat sat on the,” and the correct answer is “mat.” No human had to label it; the text supplied the answer itself. And you don’t just get one example per sentence — you get one for every position:

Given “The” → predict “cat”
Given “The cat” → predict “sat”
Given “The cat sat” → predict “on”
Given “The cat sat on” → predict “the”
Given “The cat sat on the” → predict “mat”

One short sentence becomes five labeled examples, a single web page becomes thousands, and the entire internet becomes a near-infinite supply of free, self-labeling training data.

Self-supervised labeling: one sentence becomes five free training examples

That’s the whole idea. Everything else in this post is detail on top of it: what loss we use to measure a wrong guess, where the text comes from, how much compute it takes, and how big the model should be.

One last bit of framing before we open up the machinery. Pretraining is only the first of three stages people constantly mix up — pretraining, post-training, and fine-tuning. They sound interchangeable, but they’re not. Let’s put them side by side first, then spend the rest of the post inside pretraining.

2. Pretraining, Post-Training, and Fine-Tuning: Who Does What

If you take one thing from this section: pretraining, post-training, and fine-tuning are three different jobs, done by different people, at wildly different costs. They are not synonyms, and mixing them up is the most common confusion I see when people talk about “training” a model. Here’s the whole lineage, start to finish:

From random weights to your specialist: pretraining, post-training, and fine-tuning

Pretraining is what this whole post is about. Start from random weights, predict the next token over trillions of tokens of generic text, and end up with a base model. It’s self-supervised, so there are no human labels; it costs millions of dollars and months of compute; and only a handful of labs can afford to do it. This single stage is where roughly 99% of the model’s raw knowledge comes from.

Post-training takes that base model and teaches it behavior. A base model can complete text but won’t reliably answer a question or follow an instruction (I’ll show you exactly why at the end of the post). Post-training fixes that in two moves: instruction tuning — also called supervised fine-tuning — where you show it lots of instruction → good-response pairs, and preference tuning — RLHF, DPO, or GRPO — where you teach it which answers people actually prefer. The data is human-curated, there’s far less of it, and it’s comparatively cheap. The output is the instruct (chat) model you actually talk to, and model creators do this before they ship.

Fine-tuning is the part you do. You take an existing model (base or instruct) and adapt it to your own domain or task — whether that’s your support tickets, your medical notes, or your house code style. It’s supervised on your own labeled data, and these days it’s usually parameter-efficient (LoRA/PEFT), so it runs in hours on one or two GPUs for a handful of dollars. The output is a specialist.

One honest note on terminology: “post-training” and “fine-tuning” overlap, because instruction tuning literally is a kind of fine-tuning. The distinction that actually matters in practice is who and why: post-training is the model creator turning a raw base model into a general-purpose assistant; fine-tuning is you adapting a finished model to a narrow job. The rest of this post lives entirely in that first box — pretraining — but it helps to know what comes after.

Pretraining vs post-training vs fine-tuning: three stages, three different jobs

That first stage does the bulk of the learning and burns almost all of the money; the later two are comparatively cheap. Keep that imbalance in mind — it explains a lot about why the industry looks the way it does. Now let’s open up that first box.

3. The Loss Function: Cross-Entropy, One Step at a Time

We keep saying “nudge the weights when the guess is wrong.” Time to make that precise, because the entire training run is just this one step, repeated billions of times.

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.