HSTU: How Meta Built a Trillion-Parameter Recommender That Actually Scales
The architecture, the math, the code, and why every RecSys team is suddenly building one
Hey, Rahul here! 👋 Each week, I publish long-form ML+AI posts covering ML, AI, and System design for MLwhiz. Paid subscribers also get how-to guides with full code walkthroughs. I publish occasional extra articles. If you’d like to become a paid subscriber, here’s a button for that:
This is Part 9c of the RecSys for MLEs series. In Part 9a, we built GRU4Rec and SASRec from scratch on the Steam Games dataset and got our hands dirty with sequential recommenders. In Part 9b, we covered Semantic IDs and TIGER, Google’s clever approach to generative retrieval, where item IDs become a learned token vocabulary. Now we close the arc with the architecture that’s actually running at Meta scale: HSTU.
Late 2022. You’re an ML engineer on a recommendation team, and you’ve just shipped a beefed-up SASRec model. The metrics look great. Your director walks over and says: “What if we just scaled this thing? More layers, bigger embeddings, a hundred billion parameters. Like the LLM folks are doing.”
So you do. And for a while, the model gets better. Then it stops getting better. Then you throw more compute at it and nothing. No improvement. A bigger, slower, more expensive model that performs about the same.
Frustrating, because over in NLP-land the scaling laws are clean: double the compute, get a predictably better model. GPT-3 had proven that. LLaMA was about to prove it again. DLRMs (Deep Learning Recommendation Models)? They kept plateauing. Something was fundamentally broken, and nobody could quite articulate what.
Meta’s answer was HSTU (Hierarchical Sequential Transduction Units).
Here’s what we’ll work through in this post:
The three structural issues that quietly sabotage standard Transformers when you point them at recommendation data
What HSTU actually consumes: the fused (item, action) input format, plus the actual training data schema (impression table + history table)
Inside the HSTU block: three sub-layers, with the math, the code, the intuition, and the diagrams
M-FALCON: the caching optimization that lets you score 10,000 candidates without recomputing 8,000 history events 10,000 times
By the end, you should understand exactly why HSTU works, how it works, and why every big tech RecSys team is suddenly building one.
1. Three reasons standard Transformers fail for RecSys
Three structural mismatches between language data and recommendation data make standard Transformers fail at scale. These are fundamental, not edge cases you can patch with a clever hack.
If you’d like a refresher on how standard Transformer attention works before going further, that post will get you up to speed. From here on, I’ll assume you’re comfortable with Q, K, V, softmax, and self-attention.
Problem 01: non-stationary vocabularies
In NLP, your vocabulary is fixed. “The” is always token 42. “Recommendation” is always token 18,973. The model trains with a stable set of possible next tokens, and that set doesn’t really change between training and serving.
In recommendations, your “vocabulary” is the entire item catalog, and that catalog is constantly changing. New videos go live every second on Reels. A trending creator didn’t exist yesterday. The model you trained last Tuesday has never seen most of the items it’s being asked to score on a Friday.
Softmax was designed for a fixed vocabulary with stable class boundaries. It’s defined over a set of mutually exclusive classes that sum to 1. When the set of possible “next tokens” is a moving target, when items appear and disappear from the catalog every minute, the softmax assumption starts to silently misbehave. The probability mass keeps getting redistributed among the items the model happens to have seen, which is not the same set as the items it needs to recommend.
This is a much bigger deal than it sounds. Almost every metric you care about (CTR, watch time, retention) depends on the model surfacing new items that the user will love. If your normalization assumes a stable world, you’re starting from a broken assumption.





