MLWhiz | AI Unwrapped

MLWhiz | AI Unwrapped

Your Ranking Model Is Right. Your Recommendations Are Wrong

RecSys Series Part 8: How diversity, freshness, and business constraints turn a ranked list into a product-ready feed

Rahul Agarwal's avatar
Rahul Agarwal
Apr 11, 2026
∙ Paid

Here’s something they don’t teach you in ML courses: a perfectly relevant recommendation list is usually a terrible one.

You spend months training a ranking model. Features, architectures, multi-task objectives — the works. Then the product team walks in: “Can you make sure we don’t show 5 horror movies in a row? And boost new releases? Oh, and reserve slot 3 for promoted content.”

Each request costs you relevance. The question isn’t whether to spend — it’s how much.

Think of it as a budget. Your ranking model gives you relevance scores for every item. Re-ranking is the art of spending that relevance wisely — trading some accuracy for diversity, freshness, fairness, and business value.

This is Part 8 of the RecSys for MLEs series. We’ve covered the fundamentals, the evolution from CF to deep learning, the 3-stage funnel where I first introduced re-ranking as the “business layer,” two-tower retrieval, vector search, the ranking layer, and the cold start problem.

Today, we’re opening up that final layer. Here’s what we’ll cover:

  • The Set Problem → Why sorting by relevance produces bad recommendations

  • Diversity → From dedup rules to Determinantal Point Processes (YouTube’s production system)

  • Calibration → Matching your recommendations to the user’s taste distribution

  • Freshness → Getting new content into the feed without wrecking relevance

  • Business Constraints → The product rules that shape the final feed

  • Multi-Objective Re-Ranking → Combining everything: scalarization, constraints, and 2D layouts

  • The Practitioner’s Playbook → When to use what, and the pitfalls that trip everyone up

Let’s dive in!


1. Why Re-Ranking Exists — The Set Problem

Your ranking model scores items independently. Item A gets 0.92. Item B gets 0.89. Item C gets 0.87. Sort descending. Done.

Except it’s not done. Because when you look at your top-10 list, items A, B, and C are all psychological thrillers from the same director. Items D through G are also thrillers. The model did exactly what you asked — it found the most relevant items. But the resulting set is terrible.

This is what I call the set problem: optimizing each item independently doesn’t optimize the set.

RecSys Pipeline: Retrieval → Ranking → Re-Ranking → Serving

Here’s how to think about it. Ranking answers: “How relevant is this item to this user?” Re-ranking answers a harder question: “What’s the best collection of items to show this user?”

The input to re-ranking is typically 100-500 scored items from your ranker. The output is the final 10-50 items in their display order. And the constraints are everything your ranking model doesn’t know about: diversity requirements, content freshness, promotional obligations, fairness targets, and a dozen product-specific rules.

I remember a team meeting where someone pulled up our top-10 list for a test user: ten nearly identical sci-fi action movies. “The model is working perfectly,” someone said. Technically correct — and completely useless. The top-10 wasn’t a recommendation; it was a redundancy report.

Netflix does this at massive scale — 15,000+ shows, nearly 300 million users, and a homepage that needs to feel both personally relevant and excitingly diverse. Their page construction system doesn’t just rank shows; it considers the composition of each row and the relationships between rows.

Here’s the key mental model I want you to hold for this entire post: re-ranking is spending a relevance budget. Your ranking model gives you a relevance score for each item. That score is currency. Every diversity constraint, every freshness boost, every business rule costs some of that relevance. The art is deciding how much to spend on each.

Let’s look at the algorithms that make this possible.


2. Diversity — From Rules to Determinantal Point Processes

Diversity is the most visible re-ranking objective. When a user sees 10 items from the same genre, something has clearly gone wrong. But “add diversity” is easy to say and surprisingly hard to get right. Three levels of sophistication:

Level 1: Rule-Based Dedup

The simplest approach is just writing rules: - “No more than 2 items from the same category in the top 5” - “No two items from the same creator in a row” - “At least 1 item from ‘trending’ in top 3”

Before YouTube deployed their DPP system(we will talk about this), they used exactly these kinds of heuristics: fuzzy deduplication (removing items too similar to ones already selected) and sliding window constraints (at most n out of every m items from the same type).

Rules are fast, interpretable, and easy to debug. But they’re also brittle. They can’t capture nuanced notions of similarity — “these are both thrillers” is a rule; “these have similar emotional arcs” is not. And they compose badly: stack 5 rules on top of each other and you’ll find they frequently conflict.

Level 2: Maximal Marginal Relevance (MMR)

MMR is the first real algorithmic approach to diversity. It was originally proposed for document retrieval, but it maps perfectly to recommendations.

The idea is beautifully simple. Instead of selecting items by relevance alone, you select greedily: at each step, pick the item that best balances relevance with dissimilarity to items you’ve already selected.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def mmr_rerank(relevance_scores, item_embeddings, lambda_param=0.5, top_k=10):
    “”“
    Maximal Marginal Relevance re-ranking.

    Args:
        relevance_scores: array of shape (N,) — ranking model scores
        item_embeddings: array of shape (N, d) — item feature vectors
        lambda_param: trade-off between relevance (1.0) and diversity (0.0)
        top_k: number of items to select

    Returns:
        selected: list of indices in selection order
    “”“
    n_items = len(relevance_scores)
    sim_matrix = cosine_similarity(item_embeddings)

    selected = []
    candidates = list(range(n_items))

    for _ in range(top_k):
        best_score = -np.inf
        best_idx = None

        for idx in candidates:
            # Relevance term
            rel = relevance_scores[idx]

            # Max similarity to any already-selected item
            if selected:
                max_sim = max(sim_matrix[idx][s] for s in selected)
            else:
                max_sim = 0

            # MMR score: balance relevance vs. novelty
            score = lambda_param * rel - (1 - lambda_param) * max_sim

            if score > best_score:
                best_score = score
                best_idx = idx

        selected.append(best_idx)
        candidates.remove(best_idx)

    return selected

The lambda_param is your knob. At λ=1.0, MMR is pure relevance (no diversity). At λ=0.0, it’s pure diversity (ignores relevance). In practice, values between 0.5 and 0.7 work well.

MMR’s complexity is O(Nk) per selection, which is fast. But it has a fundamental limitation: it’s myopic. At each step, it only compares the candidate to items already selected. It never evaluates the global quality of the final set.

Level 3: Determinantal Point Processes (DPP)

This is where things get interesting.

A DPP is a probabilistic model that assigns higher probability to subsets of items that are both high-quality AND diverse. Unlike MMR’s pairwise comparisons, a DPP evaluates the entire subset at once.

Here’s the intuition: Imagine each item as an arrow in a high-dimensional space. The arrow’s length represents quality (the ranking model’s score). The arrow’s direction represents the item’s characteristics (its embedding). A DPP selects the set of arrows that spans the maximum volume — you want arrows that are both long (high quality) AND point in different directions (diverse).

DPP Volume Visualization: Quality × Diversity

Mathematically, we define a kernel matrix L where each entry captures both quality and similarity:

L[i,j] = q_i × q_j × similarity(i,j)

where q_i is item i’s quality score (from your ranker) and similarity(i,j) is the cosine similarity between item embeddings. The probability of selecting a subset S is proportional to det(L_S) — the determinant of the submatrix formed by those items, which is exactly the volume of the parallelogram those item vectors span.

That’s abstract. Let me walk through it with three movies.

DPP Kernel in Action: Three Movies, Pick Two
User's avatar

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.
© 2026 Rahul Agarwal · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture