The Transformer, Demystified — Let's Actually Build One

GenAI Series Part 2: Implementing a Transformer

Jun 05, 2026

∙ Paid

Hey, Rahul here! 👋 Each week, I publish long-form ML+AI posts covering ML, AI, and System design for MLwhiz. Paid subscribers also get how-to guides with full code walkthroughs. I publish occasional extra articles. If you’d like to become a paid subscriber, here’s a button for that:

Over the coming weeks, I’ll be writing more about GenAI, including topics like pre-training and post-training. This post is the second one of the foundational pieces meant to set up that series.

Transformers run most of modern NLP, but they’re still surprisingly hard to internalize from a diagram alone. In my last post, I walked through how they work — the encoder, decoder, and the data flow between them.

This post is where we stop reading and start building: an end-to-end English-to-German translator in PyTorch, written from scratch with a Transformer at its core. Because the fastest way to actually understand something is to implement it.

Task Description

We want to create a translator that uses transformers to convert English to German. So, if we look at it as a black box, our network takes as input an English sentence and returns a German sentence.

Data Preprocessing

To train our English-German translation Model, we will need translated sentence pairs between English and German.

Fortunately, there is a pretty standard way to get these with the OPUS-100 dataset (English-German subset), a curated multilingual translation corpus we can access via HuggingFace datasets.

Also, before we really get into the whole coding part, let us understand what we need as input and output to the model while training. We will actually need two matrices to be input to our Network:

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.

MLWhiz: Recs|ML|GenAI

The Transformer, Demystified — Let's Actually Build One

GenAI Series Part 2: Implementing a Transformer

Task Description

Data Preprocessing

Continue reading this post for free, courtesy of Rahul Agarwal.