MLWhiz | AI Unwrapped

Claude Code vs. Your ML Career: A 2026 Reality Check

Rahul Agarwal — Tue, 28 Apr 2026 22:18:23 GMT

I’ve been getting a lot of DMs lately, and they’re almost all some version of the same question: “Man, are all the jobs going to be gone? What does the future actually look like for someone like me?”

Sometimes it’s from a junior who gets ghosted after four rounds. Sometimes it’s from a friend from my Meta days who just got told their team is “not backfilling.” Sometimes it’s a student wondering if the whole field is collapsing before they even get their degrees.

And honestly, I get it. The takes out there are both scary and uninformed. They range from “AI will replace all engineers in 12 months” on one side. “AI just fails on simple tasks like counting r’s in strawberry,” on the other. After 15 months of watching this play out at Roku, interviewing candidates, and using these tools every single day, I would say that both are wrong. Here’s what I actually think is going on:

IMO, the 2026 ML job market is not dying. It has split into two. People in the first half are in real trouble. The other half may have the best market ML engineers have ever had. The whole game right now is figuring out which half you’re in, and making the moves that land you on the right side of that split.

A while back, I wrote Why LLMs Won’t Replace Programmers, and followed it up with a 2025 careers guide. Both still hold up — go read them if you haven’t. This post is about the stuff that’s changed around over the last 15 months, and what I’d actually do about it if I were you right now. Let’s get into it.

What I’m Actually Seeing

Let me paint you the picture I’ve been watching play out.

A Stanford study by Brynjolfsson and team using real ADP payroll data found something jarring: devs aged 22–25 are down about 20% in employment since late 2022. Devs aged 30+ in the same kinds of roles? Up 6–12% over the same window.

Can’t really say for sure, but at least this is my lived experience. I am getting more calls than usual, as you can see below. Most of these came in the last 2 weeks:

Pair that with what CEOs have been saying on earnings calls, which I don’t think people have fully absorbed yet:

Honestly, it’s pretty much in your face once you gather the courage to see it. The agents have gotten really good at the exact things juniors spent their first three years doing — boilerplate services, implementing someone else’s paper, fine-tuning on clean data, plumbing around a senior’s design.

So companies have just stopped hiring juniors and using agents to give seniors the kind of leverage they used to get from a small team. I see it happening across the orgs and folks I talk to.

That said, the productivity claims flying around from CEOs and vendors are wildly overstated in my opinion. I mean, some days Claude Code seems to double my output. Other days, I have to spend like 45 minutes fighting it when it goes into a dark corner and takes an unexpected turn while coding, which I could’ve done in 5 minutes.

I have seen enough folks fudging benchmarks and lying on LinkedIn threads to trust them. And honestly, you will really need to work with these tools for at least a good few weeks to know the actual picture, their capabilities, and the impact they could bring.

What Agents Took. What They Can’t Touch.

So, with that out of the way, the simplest way I’ve found to think about your career in 2026 is this → some skills just got way cheaper because agents got really good at them. Others got way more valuable because agents still can’t do them — and somebody has to, and that somebody needs to be you.

Look at what you actually worked on last week. Which column does most of it belong in?

The Most Complete Guide to PyTorch for Data Scientists

Rahul Agarwal — Mon, 27 Apr 2026 00:00:00 GMT

PyTorch has sort of became one of the de facto standards for creating Neural Networks now, and I love its interface. Yet, it is somehow a little difficult for beginners to get a hold of.

I remember picking PyTorch up only after some extensive experimentation a couple of years back. To tell you the truth, it took me a lot of time to pick it up but am I glad that I moved from Keras to PyTorch . With its high customizability and pythonic syntax,PyTorch is just a joy to work with, and I would recommend it to anyone who wants to do some heavy lifting with Deep Learning.

So, in this PyTorch guide, I will try to ease some of the pain with PyTorch for starters and go through some of the most important classes and modules that you will require while creating any Neural Network with Pytorch.

But, that is not to say that this is aimed at beginners only as I will also talk about the high customizability PyTorch provides and will talk about custom Layers, Datasets, Dataloaders, and Loss functions.

So let’s get some coffee ☕ ️and start it up.

Tensors

Tensors are the basic building blocks in PyTorch and put very simply, they are NumPy arrays but on GPU. In this part, I will list down some of the most used operations we can use while working with Tensors. This is by no means an exhaustive list of operations you can do with Tensors, but it is helpful to understand what tensors are before going towards the more exciting parts.

1. Create a Tensor

We can create a PyTorch tensor in multiple ways. This includes converting to tensor from a NumPy array. Below is just a small gist with some examples to start with, but you can do a whole lot of more things with tensors just like you can do with NumPy arrays.

# Using torch.Tensor
t = torch.Tensor([[1,2,3],[3,4,5]])
print(f"Created Tensor Using torch.Tensor:\n{t}")

# Using torch.randn
t = torch.randn(3, 5)
print(f"Created Tensor Using torch.randn:\n{t}")

# using torch.[ones|zeros](*size)
t = torch.ones(3, 5)
print(f"Created Tensor Using torch.ones:\n{t}")
t = torch.zeros(3, 5)
print(f"Created Tensor Using torch.zeros:\n{t}")

# using torch.randint - a tensor of size 4,5 with entries between 0 and 10(excluded)
t = torch.randint(low = 0,high = 10,size = (4,5))
print(f"Created Tensor Using torch.randint:\n{t}")

# Using from_numpy to convert from Numpy Array to Tensor
a = np.array([[1,2,3],[3,4,5]])
t = torch.from_numpy(a)
print(f"Convert to Tensor From Numpy Array:\n{t}")

# Using .numpy() to convert from Tensor to Numpy array
t = t.numpy()
print(f"Convert to Numpy Array From Tensor:\n{t}")

2. Tensor Operations

Again, there are a lot of operations you can do on these tensors. The full list of functions can be found here .

A = torch.randn(3,4)
W = torch.randn(4,2)
# Multiply Matrix A and W
t = A.mm(W)
print(f"Created Tensor t by Multiplying A and W:\n{t}")
# Transpose Tensor t
t = t.t()
print(f"Transpose of Tensor t:\n{t}")
# Square each element of t
t = t**2
print(f"Square each element of Tensor t:\n{t}")
# return the size of a tensor
print(f"Size of Tensor t using .size():\n{t.size()}")

Note: What are PyTorch Variables? In the previous versions of Pytorch, Tensor and Variables used to be different and provided different functionality, but now the Variable API is deprecated , and all methods for variables work with Tensors. So, if you don’t know about them, it’s fine as they re not needed, and if you know them, you can forget about them.

The nn.Module

Here comes the fun part as we are now going to talk about some of the most used constructs in Pytorch while creating deep learning projects. nn.Module lets you create your Deep Learning models as a class. You can inherit from nn.Moduleto define any model as a class. Every model class necessarily contains an __init__ procedure block and a block for the forward pass.

In the __init__ part, the user can define all the layers the network is going to have but doesn’t yet define how those layers would be connected to each other.
In the forward pass block, the user defines how data flows from one layer to another inside the network.

So, put simply, any network we define will look like:

class myNeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define all Layers Here
        self.lin1 = nn.Linear(784, 30)
        self.lin2 = nn.Linear(30, 10)
    def forward(self, x):
        # Connect the layer Outputs here to define the forward pass
        x = self.lin1(x)
        x = self.lin2(x)
        return x

Here we have defined a very simple Network that takes an input of size 784 and passes it through two linear layers in a sequential manner. But the thing to note is that we can define any sort of calculation while defining the forward pass, and that makes PyTorch highly customizable for research purposes. For example, in our crazy experimentation mode, we might have used the below network where we arbitrarily attach our layers. Here we send back the output from the second linear layer back again to the first one after adding the input to it(skip connection) back again(I honestly don’t know what that will do).

class myCrazyNeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define all Layers Here
        self.lin1 = nn.Linear(784, 30)
        self.lin2 = nn.Linear(30, 784)
        self.lin3 = nn.Linear(30, 10)

    def forward(self, x):
        # Connect the layer Outputs here to define the forward pass
        x_lin1 = self.lin1(x)
        x_lin2 = x + self.lin2(x_lin1)
        x_lin2 = self.lin1(x_lin2)
        x = self.lin3(x_lin2)
        return x

We can also check if the neural network forward pass works. I usually do that by first creating some random input and just passing that through the network I have created.

x = torch.randn((100,784))
model = myCrazyNeuralNet()
model(x).size()
--------------------------
torch.Size([100, 10])

A word about Layers

Pytorch is pretty powerful, and you can actually create any new experimental layer by yourself using nn.Module. For example, rather than using the predefined Linear Layer nn.Linear from Pytorch above, we could have created our custom linear layer.

class myCustomLinearLayer(nn.Module):
    def __init__(self,in_size,out_size):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(in_size, out_size))
        self.bias = nn.Parameter(torch.zeros(out_size))
    def forward(self, x):
        return x.mm(self.weights) + self.bias

You can see how we wrap our weights tensor in nn.Parameter. This is done to make the tensor to be considered as a model parameter. From PyTorch docs :

Parameters are *Tensor* subclasses, that have a very special property when used with Module - when they’re assigned as Module attributes they are automatically added to the list of its parameters, and will appear in parameters() iterator

As you will later see, the model.parameters() iterator will be an input to the optimizer. But more on that later.

Right now, we can now use this custom layer in any PyTorch network, just like any other layer.

class myCustomNeuralNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define all Layers Here
        self.lin1 = myCustomLinearLayer(784,10)

    def forward(self, x):
        # Connect the layer Outputs here to define the forward pass
        x = self.lin1(x)
        return x
x = torch.randn((100,784))
model = myCustomNeuralNet()
model(x).size()
------------------------------------------
torch.Size([100, 10])

But then again, Pytorch would not be so widely used if it didn’t provide a lot of ready to made layers used very frequently in wide varieties of Neural Network architectures. Some examples are: nn.Linear , nn.Conv2d , nn.MaxPool2d , nn.ReLU , nn.BatchNorm2d , nn.Dropout , nn.Embedding , nn.GRU / nn.LSTM , nn.Softmax , nn.LogSoftmax , nn.MultiheadAttention , nn.TransformerEncoder , nn.TransformerDecoder

I have linked all the layers to their source where you could read all about them, but to show how I usually try to understand a layer and read the docs, I would try to look at a very simple convolutional layer here.

So, a Conv2d Layer needs as input an Image of height H and width W, with Cin channels. Now, for the first layer in a convnet, the number of in_channels would be 3(RGB), and the number of out_channels can be defined by the user. The kernel_size mostly used is 3x3, and the stride normally used is 1.

To check a new layer which I don’t know much about, I usually try to see the input as well as output for the layer like below where I would first initialize the layer:

conv_layer = nn.Conv2d(in_channels = 3, out_channels = 64, kernel_size = (3,3), stride = 1, padding=1)

And then pass some random input through it. Here 100 is the batch size.

x = torch.randn((100,3,24,24))
conv_layer(x).size()
--------------------------------
torch.Size([100, 64, 24, 24])

So, we get the output from the convolution operation as required, and I have sufficient information on how to use this layer in any Neural Network I design.

Datasets and DataLoaders

How would we pass data to our Neural nets while training or while testing? We can definitely pass tensors as we have done above, but Pytorch also provides us with pre-built Datasets to make it easier for us to pass data to our neural nets. You can check out the complete list of datasets provided at torchvision.datasets and torchtext.datasets . But, to give a concrete example for datasets, let’s say we had to pass images to an Image Neural net using a folder which has images in this structure:

data
    train
        sailboat
        kayak
        .
        .

We can use torchvision.datasets.ImageFolder dataset to get an example image like below:

from torchvision import transforms
from torchvision.datasets import ImageFolder
traindir = "data/train/"
t = transforms.Compose([
        transforms.Resize(size=256),
    transforms.CenterCrop(size=224),
        transforms.ToTensor()])
train_dataset = ImageFolder(root=traindir,transform=t)
print("Num Images in Dataset:", len(train_dataset))
print("Example Image and Label:", train_dataset[2])

This dataset has 847 images, and we can get an image and its label using an index. Now we can pass images one by one to any image neural network using a for loop:

for i in range(0,len(train_dataset)):
    image ,label = train_dataset[i]
    pred = model(image)

But that is not optimal. We want to do batching. We can actually write some more code to append images and labels in a batch and then pass it to the Neural network. But Pytorch provides us with a utility iterator torch.utils.data.DataLoader to do precisely that. Now we can simply wrap our train_dataset in the Dataloader, and we will get batches instead of individual examples.

train_dataloader = DataLoader(train_dataset,batch_size = 64, shuffle=True, num_workers=10)

We can simply iterate with batches using:

for image_batch, label_batch in train_dataloader:
    print(image_batch.size(),label_batch.size())
    break
-------------------------------------------------
torch.Size([64, 3, 224, 224]) torch.Size([64])

So actually, the whole process of using datasets and Dataloaders becomes:

t = transforms.Compose([
        transforms.Resize(size=256),
    transforms.CenterCrop(size=224),
        transforms.ToTensor()])

train_dataset = torchvision.datasets.ImageFolder(root=traindir,transform=t)
train_dataloader = DataLoader(train_dataset,batch_size = 64, shuffle=True, num_workers=10)

for image_batch, label_batch in train_dataloader:
    pred = myImageNeuralNet(image_batch)

You can look at this particular example in action in my previous blogpost on Image classification using Deep Learning here .

This is great, and Pytorch does provide a lot of functionality out of the box. But the main power of Pytorch comes with its immense customization. We can also create our own custom datasets if the datasets provided by PyTorch don’t fit our use case.

Understanding Custom Datasets

To write our custom datasets, we can make use of the abstract class torch.utils.data.Dataset provided by Pytorch. We need to inherit this Dataset class and need to define two methods to create a custom Dataset.

__len__ : a function that returns the size of the dataset. This one is pretty simple to write in most cases.
__getitem__: a function that takes as input an index i and returns the sample at index i.

For example, we can create a simple custom dataset that returns an image and a label from a folder. See that most of the tasks are happening in __init__ part where we use glob.glob to get image names and do some general preprocessing.

from glob import glob
from PIL import Image
from torch.utils.data import Dataset

class customImageFolderDataset(Dataset):
    """Custom Image Loader dataset."""
    def __init__(self, root, transform=None):
        """
        Args:
            root (string): Path to the images organized in a particular folder structure.
            transform: Any Pytorch transform to be applied
        """
        # Get all image paths from a directory
        self.image_paths = glob(f"{root}/*/*")
        # Get the labels from the image paths
        self.labels = [x.split("/")[-2] for x in self.image_paths]
        # Create a dictionary mapping each label to a index from 0 to len(classes).
        self.label_to_idx = {x:i for i,x in enumerate(set(self.labels))}
        self.transform = transform

    def __len__(self):
        # return length of dataset
        return len(self.image_paths)

    def __getitem__(self, idx):
        # open and send one image and label
        img_name = self.image_paths[idx]
        label = self.labels[idx]
        image = Image.open(img_name)
        if self.transform:
            image = self.transform(image)
        return image,self.label_to_idx[label]

Also, note that we open our images one at a time in the __getitem__ method and not while initializing. This is not done in __init__ because we don’t want to load all our images in the memory and just need to load the required ones.

We can now use this dataset with the utility Dataloader just like before. It works just like the previous dataset provided by PyTorch but without some utility functions.

t = transforms.Compose([
        transforms.Resize(size=256),
    transforms.CenterCrop(size=224),
        transforms.ToTensor()])

train_dataset = customImageFolderDataset(root=traindir,transform=t)
train_dataloader = DataLoader(train_dataset,batch_size = 64, shuffle=True, num_workers=10)

for image_batch, label_batch in train_dataloader:
    pred = myImageNeuralNet(image_batch)

Understanding Custom DataLoaders

This particular section is a little advanced and can be skipped going through this post as it will not be needed in a lot of situations. But I am adding it for completeness here.

So let’s say you are looking to provide batches to a network that processes text input, and the network could take sequences with any sequence size as long as the size remains constant in the batch. For example, we can have a BiLSTM network that can process sequences of any length. It’s alright if you don’t understand the layers used in it right now; just know that it can process sequences with variable sizes.

class BiLSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.hidden_size = 64
        drp = 0.1
        max_features, embed_size = 10000,300
        self.embedding = nn.Embedding(max_features, embed_size)
        self.lstm = nn.LSTM(embed_size, self.hidden_size, bidirectional=True, batch_first=True)
        self.linear = nn.Linear(self.hidden_size*4 , 64)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(drp)
        self.out = nn.Linear(64, 1)


    def forward(self, x):
        h_embedding = self.embedding(x)
        h_embedding = torch.squeeze(torch.unsqueeze(h_embedding, 0))

        h_lstm, _ = self.lstm(h_embedding)
        avg_pool = torch.mean(h_lstm, 1)
        max_pool, _ = torch.max(h_lstm, 1)
        conc = torch.cat(( avg_pool, max_pool), 1)
        conc = self.relu(self.linear(conc))
        conc = self.dropout(conc)
        out = self.out(conc)
        return out

This network expects its input to be of shape (batch_size, seq_length) and works with any seq_length. We can check this by passing our model two random batches with different sequence lengths(10 and 25).

model = BiLSTM()
input_batch_1 = torch.randint(low = 0,high = 10000, size = (100,**10**))
input_batch_2 = torch.randint(low = 0,high = 10000, size = (100,**25**))
print(model(input_batch_1).size())
print(model(input_batch_2).size())
------------------------------------------------------------------
torch.Size([100, 1])
torch.Size([100, 1])

Now, we want to provide tight batches to this model, such that each batch has the same sequence length based on the max sequence length in the batch to minimize padding. This has an added benefit of making the neural net run faster. It was, in fact, one of the methods used in the winning submission of the Quora Insincere challenge in Kaggle, where running time was of utmost importance.

So, how do we do this? Let’s write a very simple custom dataset class first.

class CustomTextDataset(Dataset):
    '''
    Simple Dataset initializes with X and y vectors
    We start by sorting our X and y vectors by sequence lengths
    '''
    def __init__(self,X,y=None):
        self.data = list(zip(X,y))
        # Sort by length of first element in tuple
        self.data = sorted(self.data, key=lambda x: len(x[0]))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

Also, let’s generate some random data which we will use with this custom Dataset.

import numpy as np
train_data_size = 1024
sizes = np.random.randint(low=50,high=300,size=(train_data_size,))
X = [np.random.randint(0,10000, (sizes[i])) for i in range(train_data_size)]
y = np.random.rand(train_data_size).round()
#checking one example in dataset
print((X[0],y[0]))

Example of one random sequence and label. Each integer in the sequence corresponds to a word in the sentence.

We can use the custom dataset now using:

train_dataset = CustomTextDataset(X,y)

If we now try to use the Dataloader on this dataset with batch_size>1, we will get an error. Why is that?

train_dataloader = DataLoader(train_dataset,batch_size = 64, shuffle=False, num_workers=10)
for xb,yb in train_dataloader:
    print(xb.size(),yb.size())

This happens because the sequences have different lengths, and our data loader expects our sequences of the same length. Remember that in the previous image example, we resized all images to size 224 using the transforms, so we didn’t face this error.

So, how do we iterate through this dataset so that each batch has sequences with the same length, but different batches may have different sequence lengths?

We can use collate_fn parameter in the DataLoader that lets us define how to stack sequences in a particular batch. To use this, we need to define a function that takes as input a batch and returns (x_batch, y_batch ) with padded sequence lengths based on max_sequence_length in the batch. The functions I have used in the below function are simple NumPy operations. Also, the function is properly commented so you can understand what is happening.

def collate_text(batch):
    # get text sequences in batch
    data = [item[0] for item in batch]
    # get labels in batch
    target = [item[1] for item in batch]
    # get max_seq_length in batch
    max_seq_len = max([len(x) for x in data])
    # pad text sequences based on max_seq_len
    data = [np.pad(p, (0, max_seq_len - len(p)), 'constant') for p in data]
    # convert data and target to tensor
    data = torch.LongTensor(data)
    target = torch.LongTensor(target)
    return [data, target]

We can now use this collate_fn with our Dataloader as:

train_dataloader = DataLoader(train_dataset,batch_size = 64, shuffle=False, num_workers=10,collate_fn = collate_text)

for xb,yb in train_dataloader:
    print(xb.size(),yb.size())

It will work this time as we have provided a custom collate_fn. And see that the batches have different sequence lengths now. Thus we would be able to train our BiLSTM using variable input sizes just like we wanted.

Training a Neural Network

We know how to create a neural network using nn.Module. But how to train it? Any neural network that has to be trained will have a training loop that will look something similar to below:

num_epochs = 5
for epoch in range(num_epochs):
    # Set model to train mode
    model.train()
    for x_batch,y_batch in train_dataloader:
        # Clear gradients
        optimizer.zero_grad()
        # Forward pass - Predicted outputs
        pred = model(x_batch)
        # Find Loss and backpropagation of gradients
        loss = loss_criterion(pred, y_batch)
        loss.backward()
        # Update the parameters
        optimizer.step()
    model.eval()
    for x_batch,y_batch in valid_dataloader:
        pred = model(x_batch)
        val_loss = loss_criterion(pred, y_batch)

In the above code, we are running five epochs and in each epoch:

We iterate through the dataset using a data loader.
In each iteration, we do a forward pass using model(x_batch)
We calculate the Loss using a loss_criterion
We back-propagate that loss using loss.backward() call. We don’t have to worry about the calculation of the gradients at all, as this simple call does it all for us.
Take an optimizer step to change the weights in the whole network using optimizer.step(). This is where weights of the network get modified using the gradients calculated in loss.backward() call.
We go through the validation data loader to check the validation score/metrics. Before doing validation, we set the model to eval mode using model.eval().Please note we don’t back-propagate losses in eval mode.

Till now, we have talked about how to use nn.Module to create networks and how to use Custom Datasets and Dataloaders with Pytorch. So let’s talk about the various options available for Loss Functions and Optimizers.

Loss functions

Pytorch provides us with a variety of loss functions for our most common tasks, like Classification and Regression. Some most used examples are nn.CrossEntropyLoss , nn.NLLLoss , nn.KLDivLoss and nn.MSELoss . You can read the documentation of each loss function, but to explain how to use these loss functions, I will go through the example of nn.NLLLoss

The documentation for NLLLoss is pretty succinct. As in, this loss function is used for Multiclass classification, and based on the documentation:

the input expected needs to be of size (batch_size x Num_Classes ) — These are the predictions from the Neural Network we have created.
We need to have the log-probabilities of each class in the input — To get log-probabilities from a Neural Network, we can add a LogSoftmax Layer as the last layer of our network.
The target needs to be a tensor of classes with class numbers in the range(0, C-1) where C is the number of classes.

So, we can try to use this Loss function for a simple classification network. Please note the LogSoftmax layer after the final linear layer. If you don’t want to use this LogSoftmax layer, you could have just used nn.CrossEntropyLoss

class myClassificationNet(nn.Module):
    def __init__(self):
        super().__init__()
        # Define all Layers Here
        self.lin = nn.Linear(784, 10)
        self.logsoftmax = nn.LogSoftmax(dim=1)
    def forward(self, x):
        # Connect the layer Outputs here to define the forward pass
        x = self.lin(x)
        x = self.logsoftmax(x)
        return x

Let’s define a random input to pass to our network to test it:

# some random input:

X = torch.randn(100,784)
y = torch.randint(low = 0,high = 10,size = (100,))

And pass it through the model to get predictions:

model = myClassificationNet()
preds = model(X)

We can now get the loss as:

criterion = nn.NLLLoss()
loss = criterion(preds,y)
loss
------------------------------------------
tensor(2.4852, grad_fn=)

Custom Loss Function

I Use Claude Code Every Day. Here's the Setup That Actually Matters

Rahul Agarwal — Thu, 23 Apr 2026 02:06:45 GMT

Let me admit something upfront → there is always a small, mildly annoying blocker before picking up a new tool.

That little <do I really want to learn yet another thing?> feeling — even when half of LinkedIn is screaming that you must.

I felt it with Claude Code. I had it installed on my machine for two full weeks before I actually sat down to use it. And when I finally did, the first hour was a lot of reading docs, clicking “yes” to prompts I did not understand, and wondering if this was genuinely going to pay off or if I had just given another AI tool access to my filesystem.

Here is the thing about Claude Code, though. The hello-world is easy.

You npm install, you type claude, you ask it to fix a bug, it fixes the bug. You feel clever. Then you look at the docs and you feel lost. Are you making the most out of it?

This is so confusing. Which of this actually matters? And which of it can you safely ignore for now?

I’ve been using Claude Code daily for months since that slow start, and in the first week I made every setup mistake I could. I clicked “yes” to permission prompts a thousand times before I realized acceptEdits and dangerously-skip-permissions existed. Racked up a $200 API bill before I discovered the Max plan can be used to login as well. Ended up starting forty-something terminal claude sessions which got lost as I didn’t know claude -c was a thing.

This post is the shortcut I wish someone had handed me in week one — an opinionated setup that gets you from “I installed it” to “I actually use this daily” in a weekend. The 20% of the surface area that delivers 80% of the value. Plus honest opinions on which features are worth your time and which ones you can skip.

The official docs are great if you want the full feature dump; this is the opposite of that.

Let’s dive in.

1. What Claude Code Actually Is (And What It Isn’t)

Before we install anything, let’s get the mental model right. This is the single most common reason people bounce off Claude Code in the first hour.

Claude Code is not a VS Code extension that autocompletes your code. It is not a chat sidebar. It is not Cursor or Copilot. Yes, it has a VS Code integration, but the integration is a thin window over a CLI tool that runs in your terminal.

What Claude Code actually is: an agentic CLI. You run claude in your terminal, you tell it what you want in plain English, and it then reads files, writes files, runs bash commands, executes your tests, browses the web, calls APIs, and keeps iterating until the task is done. You watch it work, you can interrupt at any point, you can steer.

Think of it less like autocomplete and more like pair-programming with a junior developer who types fast, never gets tired, sometimes goes off the rails, and occasionally needs a hard “no, do not do that.”

That loop above — shown in the diagram at the start of this section — is the whole product. Read, think, act, observe, repeat. Your job is to give it good context up front and steer when it drifts.

If this sounds familiar, it should — I wrote about my first dance with vibe coding using Claude Pro a while back. Claude Code is what happens when that same loop moves out of a chat window and into your actual terminal, with access to your actual files and your actual tests.

And it is not a small thing. As of February 2026, roughly 4% of all public commits on GitHub were authored by Claude Code — about 135,000 commits per day. Anthropic itself says 90% of their internal code is now AI-written. ServiceNow has 29,000 daily users on it. This is not a Twitter fad. This is real and this is here to stay.

The question is no longer “is this useful.” The question is how you go from “I installed it” to “I actually use it well.” Which is the rest of this post.

Anyone still calling this “autocomplete” has either not used it, or has not been paying attention.

2. Install and Get Logged In

Installation is genuinely a one-liner now.

npm install -g @anthropic-ai/claude-code

You need Node 18 or newer. That is the whole prerequisite list.

Open a terminal and Run claude in any directory. The first time, it walks you through login. You get two choices: a claude.ai account (Pro or Max plan) or an API key from the Anthropic Console. Pick the first one if you are a human writing code daily. Pick the second one only if you have a specific reason — Bedrock, Vertex, headless CI, Corporate use and that kind of thing.

Pricing reality check: Max plan is $100 a month for individuals, $200 a month for teams. A typical 30-to-60-minute Claude Code session costs roughly $0.50 to $3.00 on the API. Do that math for yourself. If you are coding even a few hours a day, Max is the cheaper option, and it gives you Opus access without rate-limit anxiety.

If you want my full breakdown of how the AI subscriptions compare across coding, research, and general use, I broke it down here.

Once you are in, run /status to see which settings are active and how many tokens you have used up. We will come back to that command when things get weird.

3. CLAUDE.md — The One File You Have to Write

If you only learn one thing from this entire post, learn this.

CLAUDE.md is a markdown file at the root of your project. Every time you start Claude Code in that directory, the contents of this file get loaded into the conversation automatically. Think of this file as your project’s onboarding document — except the onboardee shows up every single session and reads it cover to cover.

This is where you tell Claude things like → what the project does, which command runs the tests, what the build tool is, conventions the team follows, paths it should never touch, and the gotchas.

Here is a real-world CLAUDE.md for a Python data science project, slightly trimmed:

MLWhiz Weekly AI/ML Newsletter # 4

Rahul Agarwal — Tue, 21 Apr 2026 22:02:50 GMT

🏆 Story of the Week: OpenAI Just Bought 10% of Cerebras

For two years, I’ve been hearing the same question in every infrastructure conversation: how does the Nvidia monopoly end?

This week, we got the answer. And it’s not what anyone predicted.

It wasn’t a DOJ antitrust case. It was a procurement contract — but a procurement contract structured like nothing we’ve ever seen in this industry.

Cerebras filed its S-1 on Friday. But buried inside was a deal between OpenAI and Cerebras that is pretty hard to believe.

Here’s what OpenAI committed to:

$20+ billion in chip spending through 2028
750 MW of capacity, with an option to expand to 2 GW
A $1 billion loan to OpenAI from Cerebras at 6% interest

In exchange, OpenAI got about 10% of Cerebras post-IPO.

Read that again. The customer got equity in the supplier. The customer also got a billion-dollar loan from the supplier.

OpenAI is now Cerebras’s biggest customer, biggest creditor, and one of its biggest shareholders. People are calling it “circular…

From RNNs to Transformers: Building Sequential Recommenders (Part 1)

Rahul Agarwal — Sat, 18 Apr 2026 10:56:33 GMT

Every tech revolution follows the same pattern. First, we solve the problem one way. Then, we realize we’ve been solving the wrong problem.

Natural language processing: we spent a decade on classification (sentiment, NER, QA as pick-the-right-answer).

Then GPT said: generation subsumes classification. Just generate the output.

Recommendation systems are having their moment right now. We spent years building the pipeline that would retrieve 10K candidates with Two-Tower, score 1K with a ranker, re-rank the top 100. But what if the recommender could just generate the next item directly? That’s where this series is headed.

But you can’t understand the generative revolution without understanding what came before it.

In this two-part post, we’ll trace the full evolution: how RNNs first cracked sequential recommendation, how Transformers took over, and ultimately how generative models are rewriting the rules entirely.

This is Part 1 — covering

GRU4Rec (2016),
SASRec (2018),
the BERT4Rec controversy, and
production deployment patterns.

Part 2 will cover Semantic IDs, TIGER, HSTU, and who’s deploying generative recommenders in production today.

Let’s dive in!

1. The Sequential Problem: Why Order Matters

Let’s say that you just finished watching Inception. Netflix recommends Interstellar. You watch it. Next up: Arrival.

As you can see this watching order is not random. This is not just “you like sci-fi” and so are watching sci-fi movies.

There’s a trajectory here. The recommender is following your path through Christopher Nolan’s mind-bending sci-fi catalog — what you watched second might change what you should see third.

In the first post of this series, we covered collaborative filtering, which treats user history as an unordered matrix — a bag of items. So, if you watched [Inception, Interstellar, Arrival], traditional CF treats that the same as [Arrival, Inception, Interstellar]. But the order you watched them in tells you something completely different about what to recommend next.

Sequential models fixed that. They learned to predict not just “what you might like” but “what comes next.”

Formalising the Problem

Given a sequence of items a user has interacted with:

[i₁, i₂, i₃, ..., iₙ]

Predict the next item: iₙ₊₁

This formulation applies across domains:

E-commerce: product browsing → purchase prediction
Streaming: watch history → next video
Music: listening sequence → next song
News: reading pattern → next article

The Benchmark: Steam Games Dataset

For this post, we’ll use the Steam Games dataset — a rich gaming interaction dataset from UCSD’s repository for building our models:

67,287 users (raw) → 56,808 users (after 5-core filtering)
32,133 games (raw) → 6,382 games (after 5-core filtering)
2,235,453 interactions (playtime > 1 hour)
Average sequence length: 39.4 games (median: 26)

5-core filtering is a technique that removes all users and items with fewer than 5 interactions, applied iteratively until every remaining user and item has at least 5. It’s a standard preprocessing step in RecSys research to eliminate extreme cold-start cases (users who tried one game, games nobody played) that add noise without enough signal to learn from.

Now let’s see how different architectures tackle sequential prediction.

2. GRU4Rec — When RNNs Met Recommendations (2016)

In 2016, Gravity R&D published “Session-based Recommendations with Recurrent Neural Networks” at ICLR. First major work applying RNNs to sequential recommendation. It dominated the field for nearly two years.

MLWhiz Weekly AI/ML Newsletter # 3

Rahul Agarwal — Mon, 13 Apr 2026 04:35:41 GMT

🏆 Story of the Week: The AI Stack Fragments — From Silicon to Society

This was the week the AI industry’s single-vendor era officially ended. Not in one dramatic announcement, but through a cascade of moves at every layer of the stack that collectively redrew the competitive map.

Start at the bottom: Intel surged 4.2% on Monday after confirming its participation in Elon Musk’s Terafab project — the first serious attempt at a domestic US AI chip fab. The same day, reports confirmed that DeepSeek is building V4 entirely on Huawei Ascend 950PR chips, fully decoupling from Nvidia. And Broadcom jumped 6.1% on expanded TPU deals with both Google and Anthropic. In a single session, the market priced in four distinct AI chip supply chains: Nvidia/TSMC (incumbent), Google/Broadcom TPUs, Intel/Terafab (domestic US), and Huawei/Ascend (China). The telling number: Nvidia fell 1.6% while the rest of the AI chip ecosystem rallied. SEMI data confirmed the investment cycle is real — global chip equipm…

Your Ranking Model Is Right. Your Recommendations Are Wrong

Rahul Agarwal — Sat, 11 Apr 2026 22:10:19 GMT

Here’s something they don’t teach you in ML courses:

A perfectly relevant recommendation list is usually a terrible one.

You spend months training a ranking model. Features, architectures, multi-task objectives — the works. Then the product team walks in: “Can you make sure we don’t show 5 horror movies in a row? And boost new releases? Oh, and reserve slot 3 for promoted content.”

Each request costs you relevance. The question isn’t whether to spend — it’s how much.

Think of it as a budget. Your ranking model gives you relevance scores for every item. Re-ranking is the art of spending that relevance wisely — trading some accuracy for diversity, freshness, fairness, and business value.

A perfectly relevant recommendation list is usually a terrible one.

This is Part 8 of the RecSys for MLEs series. We’ve covered the fundamentals, the evolution from CF to deep learning, the 3-stage funnel where I first introduced re-ranking as the “business layer,” two-tower retrieval, vector search, the ranking layer, and the cold start problem.

Today, we’re opening up that final layer. Here’s what we’ll cover:

The Set Problem → Why sorting by relevance produces bad recommendations
Diversity → From dedup rules to Determinantal Point Processes (YouTube’s production system)
Calibration → Matching your recommendations to the user’s taste distribution
Freshness → Getting new content into the feed without wrecking relevance
Business Constraints → The product rules that shape the final feed
Multi-Objective Re-Ranking → Combining everything: scalarization, constraints, and 2D layouts
The Practitioner’s Playbook → When to use what, and the pitfalls that trip everyone up

Let’s dive in!

1. Why Re-Ranking Exists — The Set Problem

Your ranking model scores items independently. Item A gets 0.92. Item B gets 0.89. Item C gets 0.87. Sort descending. Done.

Except it’s not done. Because when you look at your top-10 list, items A, B, and C are all psychological thrillers from the same director. Items D through G are also thrillers. The model did exactly what you asked — it found the most relevant items. But the resulting set is terrible.

This is what I call the set problem: optimizing each item independently doesn’t optimize the set.

Here’s how to think about it. Ranking answers: “How relevant is this item to this user?” Re-ranking answers a harder question: “What’s the best collection of items to show this user?”

The input to re-ranking is typically 100-500 scored items from your ranker. The output is the final 10-50 items in their display order. And the constraints are everything your ranking model doesn’t know about: diversity requirements, content freshness, promotional obligations, fairness targets, and a dozen product-specific rules.

I remember a team meeting where someone pulled up our top-10 list for a test user: ten nearly identical sci-fi action movies. “The model is working perfectly,” someone said. Technically correct — and completely useless. The top-10 wasn’t a recommendation; it was a redundancy report.

Netflix does this at massive scale — 15,000+ shows, nearly 300 million users, and a homepage that needs to feel both personally relevant and excitingly diverse. Their page construction system doesn’t just rank shows; it considers the composition of each row and the relationships between rows.

Here’s the key mental model I want you to hold for this entire post: re-ranking is spending a relevance budget. Your ranking model gives you a relevance score for each item. That score is currency. Every diversity constraint, every freshness boost, every business rule costs some of that relevance. The art is deciding how much to spend on each.

Let’s look at the algorithms that make this possible.

2. Diversity — From Rules to Determinantal Point Processes

Diversity is the most visible re-ranking objective. When a user sees 10 items from the same genre, something has clearly gone wrong. But “add diversity” is easy to say and surprisingly hard to get right. Three levels of sophistication:

Level 1: Rule-Based Dedup

The simplest approach is just writing rules: - “No more than 2 items from the same category in the top 5” - “No two items from the same creator in a row” - “At least 1 item from ‘trending’ in top 3”

Before YouTube deployed their DPP system(we will talk about this), they used exactly these kinds of heuristics: fuzzy deduplication (removing items too similar to ones already selected) and sliding window constraints (at most n out of every m items from the same type).

Rules are fast, interpretable, and easy to debug. But they’re also brittle. They can’t capture nuanced notions of similarity — “these are both thrillers” is a rule; “these have similar emotional arcs” is not. And they compose badly: stack 5 rules on top of each other and you’ll find they frequently conflict.

Level 2: Maximal Marginal Relevance (MMR)

MMR is the first real algorithmic approach to diversity. It was originally proposed for document retrieval, but it maps perfectly to recommendations.

The idea is beautifully simple. Instead of selecting items by relevance alone, you select greedily: at each step, pick the item that best balances relevance with dissimilarity to items you’ve already selected.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def mmr_rerank(relevance_scores, item_embeddings, lambda_param=0.5, top_k=10):
    “”“
    Maximal Marginal Relevance re-ranking.

    Args:
        relevance_scores: array of shape (N,) — ranking model scores
        item_embeddings: array of shape (N, d) — item feature vectors
        lambda_param: trade-off between relevance (1.0) and diversity (0.0)
        top_k: number of items to select

    Returns:
        selected: list of indices in selection order
    “”“
    n_items = len(relevance_scores)
    sim_matrix = cosine_similarity(item_embeddings)

    selected = []
    candidates = list(range(n_items))

    for _ in range(top_k):
        best_score = -np.inf
        best_idx = None

        for idx in candidates:
            # Relevance term
            rel = relevance_scores[idx]

            # Max similarity to any already-selected item
            if selected:
                max_sim = max(sim_matrix[idx][s] for s in selected)
            else:
                max_sim = 0

            # MMR score: balance relevance vs. novelty
            score = lambda_param * rel - (1 - lambda_param) * max_sim

            if score > best_score:
                best_score = score
                best_idx = idx

        selected.append(best_idx)
        candidates.remove(best_idx)

    return selected

The lambda_param is your knob. At λ=1.0, MMR is pure relevance (no diversity). At λ=0.0, it’s pure diversity (ignores relevance). In practice, values between 0.5 and 0.7 work well.

MMR’s complexity is O(Nk) per selection, which is fast. But it has a fundamental limitation: it’s myopic. At each step, it only compares the candidate to items already selected. It never evaluates the global quality of the final set.

Level 3: Determinantal Point Processes (DPP)

This is where things get interesting.

A DPP is a probabilistic model that assigns higher probability to subsets of items that are both high-quality AND diverse. Unlike MMR’s pairwise comparisons, a DPP evaluates the entire subset at once.

Here’s the intuition: Imagine each item as an arrow in a high-dimensional space. The arrow’s length represents quality (the ranking model’s score). The arrow’s direction represents the item’s characteristics (its embedding). A DPP selects the set of arrows that spans the maximum volume — you want arrows that are both long (high quality) AND point in different directions (diverse).

Mathematically, we define a kernel matrix L where each entry captures both quality and similarity:

L[i,j] = q_i × q_j × similarity(i,j)

where q_i is item i’s quality score (from your ranker) and similarity(i,j) is the cosine similarity between item embeddings. The probability of selecting a subset S is proportional to det(L_S) — the determinant of the submatrix formed by those items, which is exactly the volume of the parallelogram those item vectors span.

That’s abstract. Let me walk through it with three movies.

3 Modern Approaches to Solving Cold Start in RecSys

Rahul Agarwal — Wed, 25 Mar 2026 03:14:12 GMT

A user signs up for your streaming platform. They’ve never watched anything. They’ve never rated anything. They’ve never even scrolled. And your recommendation engine — the same engine that serves 200 million personalized feeds per day — stares at this blank profile and essentially says: “I have no idea who you are.”

This is the Cold Start Problem, and I’ve been fighting it for the better part of four years — at Meta, where new creators needed to find their audience from day one, and in the streaming world, where every new user expects a personalized experience the moment they log in. It’s the problem that’s been discussed on HackerNews since 2010, has a 400-page book written about it (Andrew Chen’s The Cold Start Problem), and STILL doesn’t have a clean answer.

This is the next installment in my RecSys series. We’ve covered the algorithmic evolution of recommendation systems, built two-tower retrieval from scratch, dissected the ranking layer — all assuming we have user data to work with. Today we drop that assumption entirely.

Here’s what we’ll cover:

The 3 types of cold start — they’re different problems with different solutions
Classical approaches — the baselines everyone ships first, and where they hit a ceiling
3 modern frontiers: contextual bandits, meta-learning (MAML, prototypical networks, CMML), and LLMs (feature extraction, reasoning, data generation)
How Spotify, TikTok, and YouTube actually solve this in production — with specific engineering details
A decision framework — so you know which approach fits your system, your data, and your budget

This is meant to be the definitive practitioner’s guide. Let’s dive in!

The Three Faces of Cold Start

Before we jump into solutions, let’s be precise about what we’re solving. “Cold start” isn’t one problem — it’s three distinct problems, and confusing them is one of the most common mistakes I see engineers make.

New User Cold Start

A user signs up. Zero watch history. Zero ratings. Zero clicks. Your collaborative filtering model — the one that works beautifully for your 50 million existing users — is completely blind. It relies on the user-item interaction matrix R where entry R(u, i) represents user u‘s interaction with item i. For a new user u_new, the entire row R(u_new, :) is empty — a zero vector.

This means every technique that depends on finding similar users (nearest-neighbor CF), decomposing the interaction matrix (matrix factorization), or learning user embeddings from behavior (deep learning approaches) — the entire algorithmic evolution we covered in Part 2 of this series — has literally nothing to work with. The new user is a point with no coordinates in preference space.

Mathematically, collaborative filtering predicts a rating as:

r̂(u, i) = μ + q_i^T · p_u

where p_u is the user’s latent factor vector and q_i is the item’s latent factor vector (Koren et al., 2009, “Matrix Factorization Techniques for Recommender Systems”, IEEE Computer). For a new user, p_u is undefined — there are no interactions to learn it from. You can initialize it randomly, but then your predictions are random noise.

New Item Cold Start

A video gets uploaded. A product gets listed. A song gets released. No one has interacted with it yet. Even if your model is phenomenal at scoring items with behavioral data, this item has zero behavioral signal — no clicks, no watches, no purchases. Its column R(:, i_new) in the interaction matrix is all zeros.

This creates a vicious cycle that I’ve seen destroy content platforms: the item is invisible to your retrieval-ranking pipeline because it has no engagement data. Because it’s invisible, it gets no exposure. Because it gets no exposure, it accumulates no engagement data. The item is trapped in a black hole of non-existence.

This isn’t an abstract concern — it directly affects creator retention on any content platform. If a creator uploads a show and it gets zero impressions for a week, that creator doesn’t come back. And losing the creator means losing not just that show, but everything they would have made in the future.

New System Cold Start

You’re launching a recommendation system from scratch. No users with behavioral data, no items with engagement history, no interaction matrix at all. R is entirely empty. This is the rarest variant, but it’s also the one that every startup and every new product line faces.

Here’s the uncomfortable truth that most blog posts skip: in production, the new item problem is often harder and more damaging than the new user problem. New users at least have some context you can exploit (device, location, time). New items have nothing but their own metadata. And the business cost of item-side cold start — creator churn, catalog invisibility, content deserts — compounds far faster than user-side cold start.

There’s also a regime between cold and warm that’s arguably even more important in practice: warm start — when you have 1-5 interactions. Not zero, but not enough for your models to be confident. This is where your system spends most of its time for the long tail of users and items, and it’s where the modern approaches we’ll cover really shine.

Classical Solutions (And Why They’re Not Enough)

Every recommendation system starts here. These are the baseline approaches — they work, they ship fast, and they’re better than showing nothing. But they all hit a ceiling, and understanding exactly where that ceiling is tells you when to invest in the modern approaches.

Popularity-Based Ranking

The simplest possible move: show new users whatever is trending right now. It’s the “most popular dish” approach — safe, zero personalization required, trivial to implement. You’re essentially replacing the personalized score r̂(u, i) with the global popularity score:

score(i) = Σ_u R(u, i)  or  score(i) = count(clicks on i in last 24h)

The obvious problem: you’ll never discover that this specific user hates action movies and loves documentaries. Everyone gets the same feed, and you learn nothing about individual preferences. It also creates a rich-get-richer feedback loop — popular items get shown more, get more clicks, become more popular. This is the Matthew Effect in recommendation systems(rich get richer), and it’s brutal for new content.

That said, popularity-based ranking has one underappreciated strength: it surfaces items that are currently relevant. A user might not know they want to watch the Oscar-nominated film that just released, but a time-decayed popularity score will surface it naturally.

Content-Based Fallback

Instead of using behavioral signals (which don’t exist for cold-start entities), use the item’s features directly. A movie’s genre, director, cast, plot keywords, runtime, year, language — these are all available from day one, before anyone watches it.

The basic approach: represent each item as a feature vector f_i (using TF-IDF, one-hot encoding, or pretrained embeddings), represent the user as a weighted average of the features of items they’ve interacted with:

p_u = (1/|I_u|) Σ_{i ∈ I_u} w_i · f_i

where I_u are the items user interacted with
w_i is weight for item (can be 1 for basic cases, or can be a time decay)

Then score new items by cosine similarity:

score(u, i) = cos(p_u, f_i) = (p_u · f_i) / (||p_u|| · ||f_i||)

This is exactly how content-based and collaborative filtering differ at their core. Content-based doesn’t need the interaction matrix — it needs item features and at least some user signal.

The catch: this works far better for new items than for new users. A new item has features you can compute similarity against. A new user with zero interactions has no profile vector p_u at all — you can’t compute an average of an empty set. You’d need at least one click to get started.

Demographic Heuristics

Use whatever you get at signup: geo-location, device type, language, age bracket, operating system. A user signing up from Tokyo on an iPhone at 11 PM likely has different preferences than someone from Texas on a smart TV at 2 PM on a Saturday.

Formally, you’re replacing the missing behavioral profile with a demographic profile:

p_u = g(demographics_u)

where g is a learned function (often a simple lookup table or a shallow neural network) that maps demographic features to the same embedding space as your warm users. You train g on your existing warm users — learning, for example, that users aged 25-34 in urban Japan tend to prefer anime and J-drama.

The obvious limitation: demographics are a coarse proxy for user preferences. Not every 30-year-old in Tokyo likes the same content. You’re fighting the stereotype problem — making assumptions about individuals based on group statistics. But in the zero-data regime, coarse is better than random.

Onboarding Surveys

“Choose 3 genres you like.” “Rate these 5 movies.” “Pick your favorite artists.” Direct, explicit preference signal that bypasses the cold-start chicken-and-egg entirely.

The catch? Every additional question increases signup friction and hurts conversion. Research from the Baymard Institute shows that each additional step beyond 3-4 in a signup flow increases abandonment significantly — and streaming onboarding is no exception. And users lie — or more precisely, they pick aspirationally rather than truthfully. (A user saying ”Yes, I definitely want to watch cerebral documentaries about climate change” might proceed to binge Love Island for 6 hours.)

There’s a rich literature on optimal onboarding question selection. Golbandi et al. (2011, “Adaptive Bootstrapping of Recommender Systems Using Decision Trees”, WSDM) showed you can use decision trees to pick the maximally informative items to show in an onboarding survey — items where the user’s response tells you the most about their latent preferences.

Hybrid Switching

The textbook answer: start with content-based or popularity-based recommendations, gradually switch to collaborative filtering as behavioral data accumulates. Formally:

r̂(u, i) = α(u) · r̂_PB(u, i) + (1 - α(u)) · r̂_CF(u, i)

where α(u) is a function of how much data you have for user u — close to 1 for cold users (trust popularity-based), close to 0 for warm users (trust collaborative filtering).

Sounds clean in a blog post but it could be incredibly messy in production. Here’s why:

The blending weight α(u) needs a functional form. Is it a step function (hard cutover at 10 interactions)? Sigmoid? Linear ramp? Each choice creates different user experiences, and there’s no universal right answer — you have to tune it per domain.
Score calibration is a nightmare. Content-based or popularity scores and CF scores live on completely different scales and distributions. Naively adding them produces garbage — you need score normalization (min-max? z-score? rank-based?) that itself requires careful calibration.
The transition can jar users. A user at α=0.6 today and α=0.3 tomorrow might see a completely different feed. Without smoothing, users experience sudden recommendation “personality shifts” that erode trust.
You’re running two serving pipelines. Two models to train, two feature stores to maintain, two sets of latency budgets. The operational complexity of production recommendation systems is one of the most underappreciated challenges — and hybrid switching doubles it.

The Ceiling

All of these classical approaches are basically guessing. Educated guessing, sure — but still guessing. They don’t actively try to learn about the user. They wait passively for data to trickle in and hope the user sticks around long enough. That’s fine for day one, maybe week one. But if your cold-start strategy is still “show popular stuff and pray” after a month, you’re leaving massive value on the table.

Here’s a simple Python sketch of the popularity + content-based fallback:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cold_start_recommend(user, items, item_features, popular_items, n=10):
    “”“Simple cold-start fallback: content-based if we have ANY
    signal, popularity otherwise.”“”

    if user.interactions:  # even 1 click gives us something
        # Build user profile from interacted item features
        profile = np.mean(
            [item_features[i] for i in user.interactions], axis=0
        )
        scores = cosine_similarity([profile], item_features)[0]
        top_items = np.argsort(scores)[::-1][:n]
        return [items[i] for i in top_items]

    # Zero interactions → fall back to popularity
    return popular_items[:n]

This is fine for day one. But the three modern approaches that follow are what separate a good recommendation system from a great one.

The rest of this post covers the three modern frontiers — contextual bandits, meta-learning, and LLMs — plus how Spotify, TikTok, and YouTube solve cold start in production, and a decision framework for choosing the right approach. Subscribe to continue reading.

MLWhiz Weekly AI/ML Newsletter # 2

Rahul Agarwal — Mon, 23 Mar 2026 23:44:15 GMT

🏆 Story of the Week: The Agent Platform War Has Officially Started

For years, the AI industry has been in a model quality race. GPT vs Claude vs Gemini, benchmark after benchmark, parameter count after parameter count. This week, that race ended — and a new one began.

On Sunday, OpenAI’s CEO of Applications Fidji Simo announced what insiders are calling “Code Red”: the consolidation of ChatGPT, the Codex coding platform, and the Atlas browser into a single desktop superapp built around agentic task handling. The catalyst? Internal data showing Anthropic’s enterprise market share climbing to 40% while OpenAI’s fell to roughly 27%. Simo told employees they could no longer afford “side quests” — a direct shot at Sora, which briefly hit #1 in the App Store before usage flatlined.

But this isn’t just an OpenAI crisis story — it’s an industry-wide convergence. Within the same week, Meta shipped “My Computer”, a desktop agent from its $2B Manus acquisition, already integrated into Meta Ads Manager and WhatsApp Business. Anthropic’s Claude Dispatch — phone-to-desktop task routing — went live on Pro at $20/month. And OpenClaw crossed 210,000 GitHub stars, spawning ByteDance’s OpenViking context database (17.7K stars in one week) with persistent agent memory that cuts token costs by 95%.

The technical bet is about agentic continuity — maintaining a single context across research, coding, browsing, and execution without losing state. OpenAI’s superapp maintains context across modalities. Anthropic’s Dispatch takes the minimal approach: phone as remote control, confirmation for every action. Meta/Manus goes local-first with OS integration. OpenClaw is the open-source wild card, now with more GitHub stars than React or Linux.

The deeper signal is that GPT-5.4, Claude Opus 4.6, and Gemini 3.1 are all “good enough.” The differentiation now lives above the model layer: who owns the surface where work happens? OpenAI has the consumers, Anthropic has the developers, Meta has the advertisers, and OpenClaw has the open-source community. The agent platform war will determine who keeps all of them.

For practitioners, the message is to design for agentic workflows from day one. The standalone chatbot era is ending. The winners will be those whose agents integrate most seamlessly into existing work contexts — and the next 90 days will determine which platform becomes the default.

🤖 Models That Dropped This Week

GPT-5.4 Mini and Nano (OpenAI, March 17) — OpenAI extended the GPT-5 family downward with Mini and Nano variants targeting cost-sensitive and edge deployment. Intensifies competition with Mistral Small 4 and open-weight alternatives in the “small frontier model” category. (source)

Xiaomi MiMo-V2-Pro (Xiaomi, March 22) — The mystery “Hunter Alpha” model climbing leaderboards since March 11 turned out to be from a phone maker, not an AI lab. A 1T-parameter MoE (42B active) with 1M context window, ranking 3rd on ClawEval behind only Claude Opus 4.6. Beats Claude Sonnet 4.6 at coding at 67% lower cost. Xiaomi committed $8.7B in AI spending over three years. (source)

🧠 Papers That Matter

Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at Spotify (GLIDE) — Spotify shipped the strongest published production result for generative recommendation systems. GLIDE uses semantic IDs — compact learned representations that let an LLM “generate” recommendations by outputting semantic tokens — combining instruction-following with collaborative filtering. The system handles natural language queries while delivering personalized results across a 10M+ podcast catalog.

The production numbers are exceptional: 5.4% increase in non-habitual podcast streaming and 14.3% improvement in new-show discovery in real A/B tests. What makes this architecturally important is that it unifies search and recommendation — users get personalization from collaborative filtering plus the flexibility of arbitrary natural language queries. A companion paper, NEO, extends the approach to consolidate search, recommendation, and reasoning in a single model. (paper)

SuperKMeans: A Super Fast K-means for Indexing Vector Embeddings — An unglamorous systems paper that will actually ship. K-means clustering is the backbone of ANN vector search indexes (IVF, FAISS, ScaNN), and standard k-means on high-dimensional embeddings is slow, bottlenecking index build time for production systems. SuperKMeans reliably prunes dimensions that don’t affect cluster assignment during each iteration, computing only the distances that matter.

The results: up to 7x faster than FAISS and Scikit-Learn on CPUs, and up to 4x faster than cuVS on GPUs, with no degradation in downstream search accuracy. It’s a drop-in replacement for k-means in existing IVF/FAISS pipelines. If you’ve ever had to choose between fresh embeddings and affordable index rebuilds, this paper directly addresses that tradeoff. A 7x clustering speedup means going from daily to near-real-time index updates without proportional compute cost. (paper)

ERank: Fusing SFT and RL for Effective Text Reranking — LLM rerankers face a fundamental tradeoff: pointwise scoring is efficient but misses global ranking signals; listwise ranking captures order but is expensive at inference. ERank solves this with a two-stage approach: train the model to output fine-grained integer scores (0-10) via SFT, then refine with RL using listwise-derived rewards for global ranking awareness — keeping pointwise efficiency at inference.

A 4B ERank model outperforms many 7B rerankers, and the 32B variant sets SOTA on the BRIGHT benchmark with nDCG@10 of 40.2, surpassing the Rank-R1-32B listwise reranker while being more inference-efficient. Reranking is the highest-leverage stage in production search and recommendation pipelines. ERank delivers listwise quality at pointwise cost — exactly what production systems need. (paper)

📝 Some Good Reads

“A Visual Guide to Attention Variants in Modern LLMs” (Sebastian Raschka) — A comprehensive visual walkthrough of every major attention variant in current open-weight architectures — from multi-head through grouped-query, multi-latent (DeepSeek), sliding window, differential, and native sparse attention. Covers hybrid patterns like Qwen3.5’s Gated DeltaNet + full attention in a 3:1 ratio. Accompanies a new LLM Architecture Gallery with 45+ visual model cards. The single best reference for understanding design choices behind Llama, DeepSeek, Gemma, and Qwen. (read it)

How Uber Uses AI for Development (Pragmatic Engineer) — The most detailed look yet at agentic coding inside a major tech company. 84% of Uber devs are agentic coding users, 65-72% of code is AI-generated in IDEs, and 11% of PRs are opened by AI agents. Claude Code usage nearly doubled from 32% to 63% in two months. Uber built five internal tools including Minion (background agent platform) and Autocover (5,000+ auto-generated unit tests per month). The catch: AI costs are up 6x since 2024. (read it)

Meta’s Ranking Engineer Agent (REA) — Meta Engineering unveiled an AI agent that autonomously optimizes advertisement ranking algorithms at scale. Not code completion — an agent iterating on ranking functions, running experiments, and improving Meta’s core revenue engine without human intervention. The shift from “AI assists engineers” to “AI is the engineer” for high-stakes production systems. (read it)

💡 What This Week Was Really About

Three forces collided this week, and the intersection defines where AI goes next.

The model race ended; the platform race began: When Xiaomi — a phone maker — can build a trillion-parameter model that ranks 3rd on ClawEval and beats Claude Sonnet at coding for 67% less, the message is inescapable: frontier model capability is commoditizing. The competition has moved to the layer above — who controls the agentic workflow surface where work actually happens.

Physical constraints are catching up with digital ambition: The Strait of Hormuz crisis represents the most consequential geopolitical moment for AI infrastructure in 2026. The AI infrastructure buildout assumed stable, cheap energy. That assumption is gone.

Regulation and accountability arrived simultaneously on multiple fronts: The era of building AI without regulatory, legal, and geopolitical constraints is definitively over.

The practitioners who thrive in the next 90 days will be those building efficient, hardware-portable, constraint-aware systems. The “just scale it” era ended this week. Efficiency is the new moat.

⚡ Quick Hits

Nvidia’s $1T projection and Vera Rubin launch — Jensen Huang opened GTC 2026 projecting at least $1 trillion in chip orders through 2027 and began shipping Vera Rubin, claiming 3.5x faster training and 5x faster inference over Blackwell. OpenAI, Anthropic, and Meta confirmed as customers. (TechCrunch)
Yann LeCun’s AMI Labs raises $1.03B — The largest AI seed round in European history, betting on world models and JEPA instead of LLM scaling. A strategic contrarian bet that investors are hedging as the LLM scaling regime shows diminishing returns. (TechCrunch)
Musk announces TERAFAB — A $20-25B joint Tesla/SpaceX/xAI chip fab in Austin targeting 2nm. Tesla has zero semiconductor manufacturing experience, and leading-edge fabs typically take 5-7 years. (Bloomberg)
Block lays off 4,000, stock jumps 25% — Jack Dorsey cited AI efficiency; critics call it “AI-washing” of post-overhiring corrections. Goldman estimates AI eliminates only 5K-10K jobs/month across all US sectors, far below the rhetoric. Sets a precedent where “AI” becomes the socially acceptable framing for any layoff. (Bloomberg)
ICML rejects 2% of papers for LLM-written reviews — First major conference to enforce anti-AI review policies at scale. The tension between AI productivity and institutional norms is hitting academic publishing. (ICML)
Karpathy’s AutoResearch goes viral — A 630-line script letting AI agents run hundreds of ML experiments overnight hit 22,983 GitHub stars in 3 days. Shopify’s CEO reported a 19% performance gain after 37 overnight experiments. The automation of AI research itself is becoming tangible. (GitHub)

From Candidates to Clicks: The Engineering Anatomy of Ranking

Rahul Agarwal — Sat, 14 Mar 2026 04:12:41 GMT

This is Part 6 of the RecSys for MLEs series. We're now in the final mile: the Ranking layer.In my previous posts in this series, we talked about the topics below. Do take a look at them:

A new engineer on our team once asked: “Why can’t we just sort by the Two-Tower scores? We already have them?”
It’s a deceptively simple question — the kind that sounds naive until you realize most experienced engineers can’t fully answer it either. The Two-Tower model is blind by design. At query time, it never sees a specific user and a specific item in the same room. The ranking model does. And that one architectural decision cascades into an entirely different class of problem.
That’s what this piece is about.

Here's what we'll cover in this detailed breakdown of Ranking:

The Ranking Problem → Why 1,000 candidates still requires a completely different model
The Feature Space → Dense, sparse, and cross features that power ranking
The Pre-Neural Era → Logistic Regression and GBDTs: why they dominated for a decade
Wide & Deep → Google's 2016 paper that changed everything
Deep & Cross Network (DCN v2) → Automatic feature crossing at scale
DLRM → Meta's production architecture powering Facebook and Instagram
Multi-Task Ranking → Why optimizing for one signal is always wrong
The Future → Lets have generative rankers
Best Practices → Configuration, trade-offs, and pitfalls

1. The Ranking Problem

Before we even start, let's be precise about what we're solving.

After retrieval (Two-Tower + FAISS), you have somewhere between 100 and 2,000 candidates. These are items that are semantically relevant to the user. Now you need to sort them. And sorting them badly is expensive → in ad systems, a 1% improvement in ranking quality can mean tens of millions of dollars(certainly billions for Meta). On a social feed, it's the difference between a user opening the app or deleting it and an opportunity loss forever.

The key insight is this: retrieval optimizes for recall; ranking optimizes for precision.

The entire two-stage architecture comes down to a simple budget constraint. Retrieval has to score roughly 100 million items in under 50ms — that’s 0.0000005ms per item, which is why we use approximate nearest neighbor search (FAISS) and willingly accept false positives assuming that Ranking will clean them up.

Ranking, by contrast, only sees ~1,000 survivors and has a 100ms budget — 0.1ms per item. Run the math and that’s a 200,000x larger per-item budget.

This budget difference is why ranking models look nothing like retrieval models. We can use:

Dense cross-attention over all feature pairs
Deep neural networks with hundreds of millions of parameters
Expensive feature crosses and second-order interactions
Rich user history features (last 100 interactions)

None of this is possible at retrieval scale.

2. The Feature Space

Before we even look at the models, we need to understand the features we can use in such systems. Ranking models consume a mix of three feature types, and getting this right is half the battle.

A. Dense Features (Continuous)

Dense features are your traditional ML inputs — numbers and low-cardinality categories you can encode directly:

User features — age, account tenure, average session length. Static signals about who the user is.
Item features — average CTR, content age, creator follower count. Signals about the item’s quality and reach.
Context features — hour of day, day of week, device type. The when and where of the request. Low-cardinality enough to one-hot encode or pass as-is.
Interaction features — user-category affinity, user-creator affinity, user’s recent CTR. These are the most valuable: they describe the relationship between this user and this type of content, not just each in isolation.

The last group is worth calling out. Pure user features and pure item features are static — they don’t change based on who’s seeing what. Interaction features do. That’s what makes them disproportionately predictive.

B. Sparse Features (High-Cardinality IDs)

This is where ranking models diverge sharply from traditional ML. You have features like user_id (500M unique values), item_id (1B unique values), creator_id (50M values). You can't one-hot encode these.

The solution? Embedding lookup tables.

import torch
import torch.nn as nn

# Each sparse feature gets its own embedding table
embedding_tables = {
    "user_id":    nn.Embedding(num_embeddings=500_000_000, embedding_dim=64),
    "item_id":    nn.Embedding(num_embeddings=1_000_000_000, embedding_dim=64),
    "creator_id": nn.Embedding(num_embeddings=50_000_000, embedding_dim=32),
    "category_id":nn.Embedding(num_embeddings=500, embedding_dim=16),
}

# At inference time: just look up the embedding vector for this user/item pair
user_embedding   = embedding_tables["user_id"](torch.tensor([user_id]))  # shape: [1, 64]
item_embedding   = embedding_tables["item_id"](torch.tensor([item_id]))  # shape: [1, 64]

In practice, these embedding tables are the largest part of the model -- often hundreds of GB. This is why ranking models are memory-bound, not compute-bound.

C. Cross Features

Some signals are only meaningful in combination. A user who usually watches cooking videos at 6pm might click on cooking content at 6pm but not at 2am. The feature hour_of_day=18 AND category=cooking is a cross feature.

# Manual cross feature (the old way)
user_category_hour = f"{user_top_category}_{hour_of_day_bucket}"
# e.g., "cooking_evening" -> one-hot encoded

# Neural approach: learn the crosses automatically (we'll see this with DCN)

Historically, feature engineering consumed 70% of an ML team's time. Deep learning's main promise was automating this. But as we'll see, the truth is nuanced.

3. The Pre-Neural Era: Why It Still Matters

Logistic Regression → The Original Ranker

For nearly a decade (2005-2015), the dominant ranking model at companies like Google, Yahoo, and early Facebook was Logistic Regression and honestly it still is in a lot of model stages.

You represent each (user, item, context) tuple as a sparse binary vector → one-hot encoded features → and learn a weight for each. Click prediction becomes:

P(click) = sigmoid(w1*user_london + w2*item_sports + w3*evening + ...)

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

# Simple example: 1M training examples, sparse features
# user_region (50 values), item_category (100 values), hour_bucket (6 values)

# In practice this is millions of features wide (all possible values)
X = one_hot_encode(training_data)   # shape: [1M, ~10K features]
y = training_data["clicked"]

model = LogisticRegression(solver='liblinear')
model.fit(X, y)

Why it worked:

It was Blazingly fast. Train in minutes, can be served in microseconds.
Perfectly interpretable. The weight for "user in London" tells you exactly how much being in London affects CTR.
Scales linearly with data.

Why it doesn’t work:
The model is linear. It cannot learn that "Sports + Evening + Mobile" together drives 3x higher CTR than each feature alone. To capture this, engineers had to manually create cross features:

# "Feature Engineering Hell" -- an actual practice at early Facebook/Google

cross_features = []
for f1 in dense_features:
    for f2 in dense_features:
        cross_features.append(f"{f1}_AND_{f2}")   # O(n^2) feature explosion

# At 10,000 base features: 100,000,000 potential cross features
# Most are useless. Figuring out which ones aren't -> that's the job.

At Yahoo, teams had dedicated "feature engineers" whose entire job was designing these cross features. This didn't scale and it was a headache to mantain and have these many complex data pipelines comprising of feature engineering and feature selection.

GBDTs → The Kaggle King

Gradient Boosted Decision Trees (XGBoost, LightGBM) solved the interaction problem somewhat elegantly: each tree learns a decision rule over a combination of features. You don't need to specify sports_AND_evening manually → the tree discovers it.

import lightgbm as lgb
import pandas as pd

# Features are now dense numerical values, not one-hot encodings
X_train = pd.DataFrame({
    "user_age": user_ages,
    "user_avg_ctr": user_ctrs,
    "item_category_id": item_categories,   # LightGBM handles categoricals natively
    "hour_of_day": hours,
    "item_age_hours": item_ages,
})

y_train = clicks

dtrain = lgb.Dataset(X_train, label=y_train)

params = {
    "objective": "binary",
    "metric": "auc",
    "num_leaves": 127,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
}

model = lgb.train(params, dtrain, num_boost_round=500)

# A single tree's decision path might look like:
# if item_category == "sports" AND hour_of_day > 17 AND user_age < 35: CTR += 0.03

GBDTs dominated leaderboards for years, and they still power ranking at many companies as a baseline or for feature selection.

Where they struggle:

High-cardinality sparse features: You cannot feed 500M unique User IDs into LightGBM.
Online learning: Updating a GBDT incrementally is hard. Neural networks handle streaming updates gracefully.
Multi-modal features: Raw text, image embeddings, sequences → GBDT can't consume these natively.

And, this is exactly where the Wide & Deep network stepped in.

4. Wide & Deep: Google's Breakthrough (2016)

In 2016, the Google Play team published "Wide & Deep Learning for Recommender Systems". It's one of the most influential papers in industrial ML, and it directly addressed the LR vs. GBDT trade-off.

The key insight: memorization and generalization are both important, and you need a different architecture for each.

Wide component: A linear model on raw and crossed features. Good at memorizing specific patterns ("users who installed app A also install app B").
Deep component: A deep neural network on embeddings of sparse features. Good at generalizing to unseen (user, item) pairs.

You train both jointly and combine their outputs.

import torch
import torch.nn as nn

class WideAndDeep(nn.Module):
    def __init__(self,
                 num_dense_features: int,
                 sparse_feature_dims: dict,  # {feature_name: (vocab_size, embed_dim)}
                 deep_hidden_dims: list = [256, 128, 64]):
        super().__init__()

        # Wide component: linear model on dense + cross features
        self.wide = nn.Linear(num_dense_features, 1)

        # Embedding tables for sparse features
        self.embeddings = nn.ModuleDict({
            name: nn.Embedding(vocab_size, embed_dim)
            for name, (vocab_size, embed_dim) in sparse_feature_dims.items()
        })

        # Deep component: MLP on concatenated embeddings + dense features
        total_embed_dim = sum(d for _, d in sparse_feature_dims.values())
        deep_input_dim = num_dense_features + total_embed_dim

        layers = []
        in_dim = deep_input_dim
        for h_dim in deep_hidden_dims:
            layers.extend([
                nn.Linear(in_dim, h_dim),
                nn.ReLU(),
                nn.BatchNorm1d(h_dim),
                nn.Dropout(0.1),
            ])
            in_dim = h_dim
        self.deep = nn.Sequential(*layers)
        self.deep_output = nn.Linear(in_dim, 1)

    def forward(self, dense_features, sparse_features):
        # Wide path
        wide_output = self.wide(dense_features)  # [B, 1]

        # Deep path: look up embeddings for all sparse features
        embed_list = [
            self.embeddings[name](ids)
            for name, ids in sparse_features.items()
        ]
        deep_input = torch.cat([dense_features] + embed_list, dim=1)  # [B, D]
        deep_output = self.deep_output(self.deep(deep_input))          # [B, 1]

        # Joint output
        logit = wide_output + deep_output
        return torch.sigmoid(logit).squeeze(1)


# Example instantiation
model = WideAndDeep(
    num_dense_features=10,
    sparse_feature_dims={
        "user_id":    (500_000, 64),
        "item_id":    (1_000_000, 64),
        "category_id":(500, 16),
    }
)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Output: Parameters: 96,089,804

Let's look at training this on a synthetic dataset to see it in action:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Generate synthetic ranking data
N = 100_000
torch.manual_seed(42)

dense   = torch.randn(N, 10)
user_ids = torch.randint(0, 500_000,   (N,))
item_ids = torch.randint(0, 1_000_000, (N,))
cat_ids  = torch.randint(0, 500,       (N,))

# Ground truth: users click more on items they have affinity for
# (simple synthetic rule)
labels = ((dense[:, 0] + dense[:, 2] > 0) & (cat_ids < 250)).float()

dataset = TensorDataset(dense, user_ids, item_ids, cat_ids, labels)
loader  = DataLoader(dataset, batch_size=2048, shuffle=True)

model     = WideAndDeep(10, {"user_id": (500_000,64), "item_id":(1_000_000,64), "category_id":(500,16)})
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCELoss()

for epoch in range(5):
    total_loss = 0
    for dense_b, uid_b, iid_b, cid_b, y_b in loader:
        pred = model(dense_b, {"user_id": uid_b, "item_id": iid_b, "category_id": cid_b})
        loss = criterion(pred, y_b)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} | Loss: {total_loss/len(loader):.4f}")

Output:

Epoch 1 | Loss: 0.6891
Epoch 2 | Loss: 0.6204
Epoch 3 | Loss: 0.5877
Epoch 4 | Loss: 0.5701
Epoch 5 | Loss: 0.5589

Wide & Deep became the template that every major tech company adapted. YouTube, Spotify, Airbnb, and Twitter all published variants of this architecture within two years of the paper.

But there was still a manual bottleneck: the cross features in the Wide component still had to be designed by hand.

5. DCN v2: Automatic Feature Crossing

MLWhiz Weekly AI/ML Newsletter # 1

Rahul Agarwal — Wed, 11 Mar 2026 04:30:46 GMT

🏆 Story of the Week: The AI Governance War Just Got Real

This was the week AI governance stopped being an abstract policy debate and started showing up as lost contracts, executive resignations, and a company being blacklisted by the US government.

The sequence of events reads like a thriller. The Pentagon demanded that AI labs agree to “all lawful use” of their models —including mass domestic surveillance and fully autonomous lethal weapons. Anthropic refused, citing specific ethical red lines around human oversight of lethal force and mass surveillance of Americans without judicial oversight. The Pentagon’s response: terminate a $200M contract, formally designate Anthropic a “supply chain risk” — a designation never previously applied to an American company — and hand the contract to OpenAI. By mid-week, federal agencies (Treasury, State, HHS) were actively phasing out Claude and switching to Grok and Codex, which had accepted the terms. The GSA removed Anthropic from USAi.gov entirely.

Then the story got stranger. OpenAI took the deal, but quietly walked back parts of it after backlash — adding two sentences about human oversight of lethal force. On March 7, Caitlin Kalinowski, OpenAI’s head of robotics and consumer hardware, resigned over the Pentagon deal. She named what nobody in executive suites wanted to name: surveillance without judicial oversight, lethal autonomy without human authorization. Meanwhile, Anthropic’s own tech was still running Iranian war strikes — inside Palantir’s systems, which apparently had already integrated Claude before the ban. The irony was sharp: the lab that held the line against military AI was more embedded in active combat than the one that accepted the contract.

What this week exposed is that “AI safety” and “AI ethics” are no longer differentiators you can just put in a marketing brief. They’re now business risks. If you hold the line, you lose government revenue and get blacklisted. If you don’t, you lose your own people. There’s no clean version of this story, and both OpenAI and Anthropic came out of the week looking different than they went in. The principle at stake — whether AI companies can negotiate safety terms with the US military — will define which labs can scale in government markets for the next decade, and at what moral cost. Watch Anthropic’s legal challenge closely. It’s the most important case in AI governance since... well, ever.

🔗 LA Times — Anthropic Vows Legal Fight

🔗 Forbes — OpenAI’s Robotics Chief Resigns Over Pentagon Deal

🔗 CBS News — Pentagon-Anthropic Feud

🔗 Nextgov — Agencies Begin Shedding Anthropic Contracts

🤖 Models That Dropped This Week

GPT-5.4 and GPT-5.4 Pro (OpenAI, March 5–6) — The headline capability is native computer use built into the base model, not bolted on. On OSWorld-Verified it hit 75.0%, above human-level (72.4%) and up from 47.3% for GPT-5.2. The full package: 1M token context, coding ability from GPT-5.3-Codex folded in (57.7% on SWE-Bench Pro), tool search that cuts token usage by 47% in tests, and a thinking mode that shows you its plan upfront. ARC-AGI-2 went from 54.2% (GPT-5.2 Pro) to 83.3% (GPT-5.4 Pro) — a 29-point jump in one generation. Artificial Analysis gives it score 57 on their Intelligence Index (up from 51). Cost: $2.50/$15 per 1M in/out tokens vs. $1.75/$14 for GPT-5.2. 🔗 OpenAI announcement

Gemini 3.1 Flash-Lite (Google DeepMind, March 3) — The sharpest answer to the high-volume inference use case. $0.25/M input tokens, 2.5× faster TTFT than Gemini 2.5 Flash, 86.9% on GPQA Diamond (strong for this price tier), 1432 Elo on LMArena. The adjustable thinking levels feature is the practical standout — one model handles both cheap classification tasks and heavier reasoning by dialing a parameter. 🔗 Google Blog

GPT-5.3 Instant (OpenAI, March 3–4) — Now the default ChatGPT model. Fewer brittle refusals, better web synthesis, reduced hallucinations on flagged prompts. A polish update, not a capability leap, but the explicit direction toward helpfulness over caution is notable. 🔗 AI Business

Phi-4-reasoning-vision-15B (Microsoft, March 6) — 15B multimodal with dedicated reasoning architecture. Positioned as the “sweet spot” for production agents where frontier is overkill but real reasoning matters. Runs on one A100. 🔗 HuggingFace

🧠 Papers That Matter

Scaling Laws for Reranking in Information Retrieval — The first systematic study of how rerankers scale with model size, data, and compute in multi-stage retrieval. Key finding: Scaling the reranker isn’t always the right move — there are inflection points where adding compute to first-stage retrieval outperforms a bigger reranker, and the optimal candidate-set/reranker-size combination is non-obvious.

RAG Fusion in Production: Lessons from an Industry Deployment — Multi-query retrieval with RRF increases raw recall, but that improvement largely evaporates once a fixed-budget reranker is applied. Hit@10 dropped from 0.51 to 0.48 in several configurations compared to a single-query baseline. Measurement framework for evaluation is the key contribution: test end-to-end under your actual constraints, not recall in isolation. Required reading before you add multi-query fusion to any production RAG stack.

SORT: Systematically Optimized Ranking Transformer for Industrial-Scale Recommenders — What makes this notable is the rare combination: +6.35% orders and +5.97% GMV in real A/B tests alongside a 44.67% reduction in serving latency and 121.33% throughput improvement. If you’re running ranking pipelines at scale and the transformer-doesn’t-work-in-prod story has hit you, read the system design choices in this paper carefully.

Behind the Prompt: The Agent-User Problem in Information Retrieval — As AI agents increasingly act as the “user” in retrieval systems, classical IR’s core assumption — that observed behavior reveals human intent — mathematically breaks down. The paper proves this non-identifiability isn’t a detection problem awaiting a better classifier; it’s structural. With Claude handling 50% of coding use cases and agentic traffic growing fast, this is a foundational issue for everyone building or evaluating retrieval systems. One to read carefully.

📝 Some Good Reads

Donald Knuth: “Claude’s Cycles” — The moment of the week outside the governance story. Knuth, whose skepticism of generative AI was on record, opened his note with “Shock! Shock!” after Claude Opus 4.6 solved an open combinatorics problem he’d been working on for weeks: finding a general construction for decomposing odd-sized 3D directed graphs into Hamiltonian cycles. Over 90 minutes and 31 systematic explorations, Claude found it. Knuth proved it formally. He coined the term “Claude-like decompositions” and concluded: “It seems I’ll have to revise my opinions about generative AI one of these days.” This category of evidence — real researchers, real problems, real results — matters more than any benchmark. Stanford PDF | Simon Willison

Cursor’s “Third Era” (Latent Space) — Agent usage at Cursor now outnumbers Tab autocomplete 2:1. More than one-third of internal PRs are written by cloud agents running in dedicated VMs. The company also acquired Graphite and Autotab. At $2B+ ARR, this is the clearest evidence yet that “agentic coding” has crossed from curiosity to workflow primitive. The chart showing the ratio inversion is the kind of inflection point data that will look obvious in retrospect. Latent Space

💡 What This Week Was Really About

AI governance became a hard business constraint, not a soft value statement. The Anthropic blacklisting created a fork in the AI industry: labs that accept government use-case terms without guardrails get federal revenue; labs with ethical limits don’t. OpenAI lost a senior executive proving the other side of the same coin. This bifurcation — government-accessible AI vs. principled AI — will define market positioning for years. The Pentagon precedent is being set right now.

Computer use crossed a threshold. GPT-5.4 hitting above human-level on OSWorld-Verified (75.0% vs. 72.4%) is the kind of benchmark that means agents can now reliably navigate desktop environments. The bottleneck on automation shifts from “can the model do it” to “do you trust it enough to let it.” Combine that with Cursor’s 2:1 agent-to-tab ratio and it’s clear: agentic work is the baseline, not the frontier.

The best open-source model family lost its architects. Junyang Lin, Yu Bowen, and Hui Binyuan — the three people most responsible for Qwen’s run as the dominant open-source model family — all left Alibaba within weeks of each other after an internal reorganization. Whether Qwen 3.5 becomes a swan song or a temporary dip depends on how fast Alibaba can rebuild. The open-source ecosystem is less robust than it looked three months ago.

⚡ Quick Hits

vLLM v0.17.0 — FlashAttention 4, 30.8% throughput gains with async scheduling, full Qwen3.5 support, Realtime WebSocket API for audio. Upgrade is worth it if you’re self-hosting inference. GitHub
Block (Square) cut 40% of workforce citing AI productivity — 10,000 → under 6,000 employees, with Q4 gross profit up 24% YoY. One of the clearest examples of a profitable company restructuring around AI, not survival. AP News
Databricks KARL beats Claude 4.6 and GPT-5.2 on enterprise knowledge tasks at 33% lower cost and 47% lower latency — RL-trained agent, entirely synthetic training data, a few thousand GPU hours. Zaharia is opening the pipeline to customers. Databricks Blog
Apple replacing Core ML with “Core AI” at WWDC 2026 — Targets on-device LLMs, diffusion models, agentic workflows. If you’re building iOS ML apps, your stack is about to change. AppleInsider
Anthropic hit $19B ARR — doubled from $9B in roughly two months — Mostly Claude Code and enterprise. Closing in on OpenAI’s $20B. Seeking Alpha
Aravind Srinivas (Perplexity): “The orchestration is the product. The model is a tool.” — Clearest articulation yet of the model-commoditization thesis from someone building a product on top of it. Perplexity Computer + Voice Mode launched this week too. Indian Express

Vector Search at Scale: The Production Engineer's Guide

Rahul Agarwal — Tue, 03 Feb 2026 04:14:22 GMT

This is Part 5 of the RecSys for MLEs series. We’re now taking a brief foray into a vital production infra → The vector databases.

In my previous posts in this series, we talked about the topics below. Do take a look at them:

Yesterday, a colleague asked me about Product Quantization. I tried to explain it and twenty minutes later, we were both completely lost.

It started simply enough: “What is PQ Quantization?” Then we went straight to IVFPQ.

I started explaining—IVF partitions the space, PQ compresses the vectors, you combine them for speed and memory efficiency. Easy, right?

“Okay, but what’s IVF doing exactly?”

I started explaining Voronoi cells and centroids. He asked how that connects to PQ.

At some point, we both realized neither of us could clearly articulate how all these pieces fit together. I knew IVF-PQ worked as I had used FAISS dozens of times. But explaining it from scratch was pretty hard.

So I went back and actually figured it out.

This is the explanation I wish I’d given him yesterday.

So, get a Coffee and let’s go!!!

Here’s what we’ll cover in this detailed breakdown of Vector Search:

Brute Force — Doesn’t Scale
IVF (Inverted File Index) — Partitioning space for sub-linear search
Product Quantization (PQ) — Compressing vectors by 64x
IVF-PQ Combined — Best of both worlds
Metadata Filtering — Combining semantic search with structured filters
Vector Databases — FAISS, Milvus, Pinecone, and friends
Benchmarking — Measuring what matters
Best Practices — Configuration, tuning, and pitfalls
Conclusion — Putting it all together

This is Part 5 of the RecSys for MLEs series. We’re now taking a brief foray into a vital production infra → The vector databases.

In my previous posts in this series, we talked about the topics below. Do take a look at them:

1. The Problem: Brute Force Doesn’t Scale

First, let’s understand why naive nearest neighbor search fails at scale.

import numpy as np
import time

def brute_force_search(query, database, k=10):
    """Find k nearest neighbors by computing all distances."""
    distances = np.sum((database - query) ** 2, axis=1)
    return np.argsort(distances)[:k]

# Simulate different scales
for n in [10_000, 100_000, 1_000_000, 10_000_000]:
    d = 128
    database = np.random.randn(n, d).astype(np.float32)
    query = np.random.randn(d).astype(np.float32)

    start = time.time()
    indices = brute_force_search(query, database)
    elapsed = time.time() - start

    memory_gb = (n * d * 4) / (1024**3)
    print(f"n={n:>10,}: {elapsed*1000:>8.1f}ms, memory={memory_gb:.2f}GB")

Output:

n =     10,000:     25.2ms, memory=0.00GB
n =    100,000:     36.7ms, memory=0.05GB
n =  1,000,000:    391.3ms, memory=0.48GB
n = 10,000,000:   4536.0ms, memory=4.77GB

The problem is O(n × d) — linear in both dataset size and dimension. At 1 billion vectors, you’re looking at 453 seconds per query — which is totally unacceptable! Also, the size of the dataset is going to be pretty high so you couldn’t really fit it into the memory.

So, we will need techniques that provide:

Sub-linear search time — Don’t look at every vector.
Memory compression — Store vectors more efficiently.
Approximate results — Trade some accuracy for massive speedups.

2. IVF: Inverted File Index

The first key insight that we will go into is → don’t search everything. Instead, partition the space into regions and only search the relevant ones.

We can do this using IVF, which uses k-means clustering to divide the vector space into nlist partitions (also called Voronoi cells). Each partition is defined by a centroid, and thereby every database vector is assigned to its nearest centroid.

Here is how the whole IVF process works →

We have three distinct phases here. For better understanding, let’s think of this with some numbers. Assume that we have a total of 1M embeddings that we want to index.

A. Training Phase:

We begin by applying the k-means algorithm to a representative sample of our embedding dataset. This process identifies nlist centroids, which define the boundaries of our Voronoi cells.

Example: If we set nlist = 1024, the vector space is partitioned into 1024 distinct clusters.

B. Indexing Phase:

Once we have nlist cluster centroids, we effectively index our entire dataset using them. Particularly, for each database vector, we find its nearest centroid and store the vector in that centroid’s “inverted list”. So we will have a total of nlist lists.

Example: In our case, this will come out to be 1024 lists. Each list will have on average 1M/1024 points → 976 points.

C. Search Phase:

Now, rather than performing an exhaustive (brute-force) search across all 1M points, we use the nprobe parameter to limit our scope.

Centroid Selection: When a query vector arrives, we first calculate its distance to all nlist centroids to find the nprobe closest ones. If nprobe = 10 we will select the 10 closest centroids.
Localized Search: Now, rather than searching across all vectors, we will only search the vectors contained within the 10 specific Voronoi cells. So, instead of 1,000,000 comparisons, we perform only ~9,760 comparisons now, reducing the computational load by over 99%.

The idea here is that rather than searching in all cells, we search in the top 10 cells closest to the query vector.

Why does it improve speed?

Brute force: We needed to search all n vectors → O(n × d)
IVF: We just searched nprobe × (n/nlist) vectors → O(nprobe × n/nlist × d)
Speedup Factor: nlist/nprobe

For 1M vectors with nlist=1024 and nprobe=10:

Brute force: 1,000,000 distance computations
IVF: 10 × (1M/1024) ≈ 9,766 distance computations
Speedup: ~100x!

Now, here is how we can implement this using FAISS.

import faiss
import numpy as np
import time

# Simulate different scales
for n in [100_000, 1_000_000, 10_000_000]:
    d = 128
    database = np.random.randn(n, d).astype(np.float32)
    query =  np.random.random((1, d)).astype('float32')

    # Number of Voronoi cells
    nlist = 1024                  
    # The coarse quantizer (how we find centroids)
    quantizer = faiss.IndexFlatL2(d)  
    index = faiss.IndexIVFFlat(quantizer, d, nlist)
    
    # 3. Train & Add
    # Training on the whole set is slow. 
    # FAISS usually trains on a subset (e.g., 10k points).
    index.train(database[:100000])
    
    index.add(database)
    
    start = time.time()
    indices = index.search(x=query,k=10)
    elapsed = time.time() - start
    
    print(f"n={n:>10,}: {elapsed*1000:>8.1f}ms")

Output:

n=   100,000:      0.2ms
n= 1,000,000:      0.4ms
n=10,000,000:      1.3ms

This is all good and great, but as you know, there is no free lunch. The Trade-off here is between Recall vs Speed.

While we have increased speed, if the true nearest neighbor is in a partition we didn’t probe, we’ll miss it.

Also note that IVF doesn’t reduce memory. It just reduces the search space—we are still storing full float32 vectors. And that’s where PQ comes in.

3. Product Quantization (PQ): Vector Compression

IVF speeds up search but doesn’t help with memory. At 1 billion 128-dimensional vectors in float32, you need 512 GB of RAM. Product Quantization effectively compresses this to 8 GB—a 64x reduction. But How?

It is a Divide and Quantize paradigm.

PQ works by:

Splitting each vector into m subvectors
Clustering each subspace independently with k centroids
Encoding each subvector as the ID of its nearest centroid

Let’s look at an example through a 128-dimensional vector with m=8 subvectors and k=256 centroids:

Original vector (512 bytes):

x = [0.23, -1.45, 0.78, ..., 0.45]  # 128 floats × 4 bytes = 512 bytes

Split into 8 subvectors (16 dims each):

u¹ = [0.23, -1.45, ..., 0.89]   # dims 1-16
u² = [1.12, 0.34, ..., -0.56]   # dims 17-32
...
u⁸ = [-0.12, 0.67, ..., 0.45]  # dims 113-128

Find the nearest centroid in each codebook:

u¹ → codebook 1 → nearest centroid: index 42
u² → codebook 2 → nearest centroid: index 189
...
u⁸ → codebook 8 → nearest centroid: index 201

Final PQ code (8 bytes!):

pq_code = [42, 189, 7, 255, 91, 123, 56, 201]  # 8 uint8 values

Compression: 512 bytes → 8 bytes = 64x!

Till now, we talked about the codebook in a handwavy way, but here is the whole implementation on how PQ works.

How YouTube Finds Your Next Video in Milliseconds

Rahul Agarwal — Mon, 26 Jan 2026 04:56:17 GMT

This is Part 4 of the RecSys for MLEs series. We’re now in the heart of the modern recommendation stack—the retrieval layer.

In my previous posts in this series, we discussed the high-level architecture of recommendation systems, the history of recommendation systems and the fundamentals of Recsys . But theory only goes so far; nothing truly clicks until you see the implementation and understand the engineering trade-offs required at scale. Today, we’re diving into the dominant architecture powering candidate generation at YouTube, Pinterest, Airbnb, and virtually every large-scale system: The Two-Tower Model.

What We’ll Cover Today:

The Scale Problem: Why the “brute force” approach to scoring fails at the billion-item scale.
Architectural Decoupling: How the Two-Tower design allows for sub-millisecond retrieval through offline precomputation.
YouTube’s Canonical Design: Learning from the features that defined the standard, including watch history and the “Example Age” trick.
Training at Scale: Implementing In-Batch Negatives to turn a massive classification problem into a tractable one.
Debiasing & Optimization: Using LogQ Correction to fix popularity bias and Hard Negative Mining to improve model discrimination.
Hands-on Implementation: A complete PyTorch (Code from Scratch) walkthrough using the MovieLens-1M dataset.

By the end of this post, you’ll understand why two-tower models are the “gold standard” for scale and how to build one that actually generalizes to real-world users.

The Scale Problem

Before diving into architecture, let’s understand the problem we’re solving. Imagine you’re building YouTube’s recommendation system.

You have:

2 billion+ videos in your catalog
2 billion+ users visiting daily
~100 milliseconds to return recommendations when someone opens the app

The naive approach would be to score every video for every user. This requires 2 billion forward passes per request because you will need to test each item for one user. At 1 ms per inference, that’s 23 days of compute. Per request. Obviously impossible. Would you wait for 23 days to see the recommendations?

This is the fundamental issue with retrieval: you need to find the best items, but you can’t afford to look at all of them.

Idea: The Two-Stage Solution

So the main idea behind a two tower system is simple. Why to do everything in one step? We can effectively break the problem into to parts here:

Retrieval: Just find me the best candidates to score. It doesn’t need to be perfect — it just needs to not miss good candidates.
Ranking: Score these candidates.

This division of labor is what makes billion-scale recommendations tractable.

But Why there are Two Towers here?

Two-tower models solve the retrieval speed problem with one clever insight: if user and item representations are independent, we can precompute item embeddings offline.

The key constraint is that the two towers never interact until the final dot product. No cross-features, no attention between user and item, no shared layers. This seems limiting, but it’s exactly what enables the incredible scale achieved by such systems:

Offline precomputation: Compute all item embeddings once, store in an index
Fast online serving: At request time, only compute the user embedding (single forward pass)
Approximate nearest neighbor search: Use FAISS/ScaNN to find top-K similar items in milliseconds

The DSSM Origins

I always love to look at the history it took us to reach the current state and the Two-tower architectures trace back to Microsoft’s Deep Structured Semantic Model (DSSM), published in 2013 for web search.

The problem DSSM solved → given a search query “machine learning tutorials,” how do you find relevant documents from billions of web pages? Traditional approaches relied on keyword matching (BM25), but this misses semantic similarity — a document about “ML courses” is relevant even without exact word overlap.

DSSM’s insight was to embed queries and documents into the same vector space, where semantic similarity corresponds to geometric proximity.

Now, the same principle applies directly to recommendations:

Query → User (what they want)
Document → Item (what we’re recommending)
Relevance → Engagement probability

DSSM showed this could work at scale. And YouTube, Pinterest, and others adapted it for recommendations. But the main thing to understand is that we are standing on the shoulder of giants.

YouTube’s Deep Neural Network (2016)

Google’s “Deep Neural Networks for YouTube Recommendations“ paper remains the canonical reference for production two-tower systems. It’s worth understanding in detail because the lessons might transfer directly to any retrieval system you’ll build.

A. The Model Architecture

The 3-Stage Funnel Behind Every Modern Recommender System

Rahul Agarwal — Tue, 20 Jan 2026 05:00:55 GMT

You’re the lead engineer at YouTube.

A user just opened the app.

You have 200 milliseconds to return your recommendations to the user.

In that time, you need to scan 5 billion videos and surface the 10 they want to watch right now. Not 10 random videos. Not 10 popular videos. The perfect 10 for this specific user at this specific moment.

Miss the window? They close the app.

Show irrelevant content? They close the app.

Recommend something they watched yesterday? They close the app.

Here’s some envelope math: if your model takes just 10 milliseconds to score a single video, scoring the full catalog would take 500 days. And you have less than a second.

This is the brutal reality of production recommender systems. Training a state-of-the-art model is only 20% of the work. The other 80% is figuring out how to actually serve it.

In Post 1, we covered the fundamental techniques of recsys systems. In Post 2, we traced the history from collaborative filtering to deep learning. This post is about what happens after you have a model—how Google, Netflix, and Spotify serve recommendations to billions of users without melting their servers or losing their audience?

The answer isn’t a single clever algorithm. It’s a system design principle: don’t solve the whole problem at once.

The architecture behind every massive recommender system is in essence just a pipeline of filters. It starts with a wide net and ends with sort of a microscope.

Instead of running your smartest model on every item, you split the problem into three stages—each with a different job and a different mathematical objective.

Stage 1 is Candidate Generation (The “Retrieval” Layer). The goal here is high recall—we need to go from billions of items down to just a few hundred fast. The idea is simple: we don’t care if we include some bad items, as long as we catch all or most of the relevant ones. To achieve this scale, we rely on fast, “approximate” algorithms like Vector Databases, Quantization and Two-Tower Models which we are going to discuss in this post.

Stage 2 is Scoring (The “Ranking” Layer). Once we have whittled the list down to a few hundred, we switch our goal to high precision. Because the list is small, we can now afford to spend expensive compute power to analyze the items deeply. This is where we deploy our heavy Deep Learning models, such as Cross-Encoders, Transformers, MMOE Models to determine the exact order of preference.

Stage 3 is Re-Ranking (The “Business” Layer). We now have the top dozen items, but we need to optimize for policy. The model might love these 10 items, but are they actually good for the product? This stage uses rule-based logic to handle diversity, fairness, and removing clickbait.

Stage 1: The Retrieval Layer (Candidate Generation)

The goal of this layer is fast recall. Out of billions of items, grab a few hundred that are relevant to the user.

Note that perfect ordering doesn’t matter here. It’s fine if the 5th best item lands at position 10, or if some irrelevant items slip through. What’s not fine is missing the best items entirely—because if retrieval misses it, ranking never sees it and our models don’t get to learn about good items at all.

To achieve this we rely on a specific neural architecture that has become the industry standard for this task: the Two-Tower Model (also known as a Bi-Encoder or a Dual Encoder).

The Two-Tower Architecture

We briefly touched on this in Post 2, but let’s look at it again a bit deeper this time.

A Two-Tower model basically consists of two independent deep neural networks:

The User Tower (Query Tower): This network takes everything we know about the user right now—their watch history, their current location, the time of day, their declared interests—and passes these through several layers of a Deep Neural Network (DNN). The output is a single, dense vector (embedding), usually of a fixed size like 128 or 256 dimensions. Let’s call this vector u
The Item Tower (Candidate Tower): This network takes everything we know about an item—its title, description, tags, video thumbnails, audio transcript—and processes it through its own DNN. The output is also a single dense vector in the exact same mathematical space as the user vector. Let’s call this vector v.

How do they interact?

During training, both towers learn to place users and items they like close together in vector space. We measure “closeness” with a dot product (or cosine similarity).

Score = u . v

If the score is high, the vectors point in similar directions, and it’s a good match.

Training (InfoNCE Loss)

How do we actually train this? We want the dot product

to be high for items the user watched, and low for items they didn’t.

Standard classification is inefficient here because of the massive class imbalance (1 positive vs 5 billion negatives). Instead, we use Softmax Cross-Entropy with In-Batch Negatives (often called InfoNCE loss).

For a batch of size B we treat the i-th user-item pair as positive, and every other item in the batch as a negative for that user.

Where tau is a temperature hyperparameter which can be tuned. This allows us to train efficiently on massive datasets without explicitly mining billions of negative samples.

The “Hack”: Decouple these Towers

So, we have trained our two tower model. But if we ran both towers in real-time for every item, we’d gain nothing—still billions of forward passes.

How Recommendation Systems Learned to Think

Rahul Agarwal — Sat, 04 Oct 2025 04:01:13 GMT

This is Post 2 in my comprehensive RecSys series. In Post 1, we looked at the fundamental techniques—collaborative filtering, content-based filtering, and matrix factorization. But how did we get here? And where are we heading?

Picture this: It’s 1994, and the entire web has maybe 10,000 websites. A small team at the University of Minnesota launches something called GroupLens to help people find interesting Usenet articles. The system could barely handle a few thousand users, but it introduced a revolutionary idea: computers could predict what you’d like based on what similar people enjoyed.

Fast forward 30 years and we are here.

Netflix serves 230 million subscribers with AI-powered recommendations processing billions of interactions in real-time. Spotify’s Discover Weekly creates 40 million personalized playlists every single week. And now, ChatGPT-style models are starting to generate recommendations through natural conversation.

In this post, we’ll trace the complete evolution of recommendation systems, understand not just what happened but why it happened, and see how to implement the key innovations yourself. This historical perspective will give you the context to understand why certain approaches dominate different scenarios—knowledge that’s crucial for building effective systems today.

Ready to journey through 30 years of RecSys evolution? Let’s dive in.

The Pre-Internet Era: When Computers First Learned to Recommend

Grundy (1979): The First Digital Librarian

Okay, I will start with some history. It still blows my mind that Elaine Rich created what might be the world’s first recommender system called Grundy back in 1970. I mean, how awesome is that someone thought about this problem in 1970, when computers barely existed. Forget computers—even data was a thing of the future!

This “computer librarian” would interview users about their reading preferences and then classify them into stereotypes like “mystery lover” or “sci-fi fan.”

Grundy’s approach was dead simple:

Ask users direct questions about their preferences
Classify them into predefined categories
Recommend books from those categories

Sure, it was basic, but Grundy solved a fundamental problem that still bugs us today: the cold start problem. When you have zero data about a new user, how do you make recommendations? Grundy’s solution: just ask them directly.

Grundy proved that explicitly asking about preferences could bootstrap recommendation systems. This insight is still everywhere today—Netflix’s thumbs up/down ratings, Spotify’s music taste onboarding, and TikTok’s initial “what interests you?” questions all trace back to Grundy’s core insight that sometimes you just need to ask users what they want.

Information Retrieval and the Rise of Content-Based Filtering

While Grundy focused on user characteristics, researchers were working in parallel on analyzing item characteristics. Back in the 1960s, Gerard Salton introduced the Vector Space Model, which basically said “hey, what if we represent text documents as numerical vectors in a high-dimensional space?” Interestingly, Salton never actually wrote the paper he’s most famous for—there’s literally a paper called “The Most Influential Paper Gerard Salton Never Wrote”. But this was an extraordinary discovery that still matters today when we think of documents as embedding vectors.

The trick to making this work was figuring out the weight of each term in the vector. This led to Term Frequency-Inverse Document Frequency (TF-IDF). Don’t let the fancy name scare you—it’s actually pretty intuitive:

Term Frequency (TF): How often does a word appear in a document?

Inverse Document Frequency (IDF): How rare is this word across all documents?

The idea is actually very intuitive. Words that appear frequently in a specific document but rarely elsewhere get high scores - these are the words that really define what that document is about.

This content-based approach worked well for its time, but it had a fundamental limitation: it could only recommend items similar to what you’d already consumed. If you only watched action movies, you’d never discover you might love romantic comedies. The system was trapped in what we call a “content bubble.”

Tapestry: The Birth of “Collaborative Filtering”

The big conceptual leap from content-based analysis to leveraging user communities happened at Xerox PARC in 1992. Researchers built the “Tapestry” system to help users manage the crazy flow of electronic documents and emails. Now a user could create queries like, “Show me all documents that my colleague ‘dave’ has marked as ‘important.’”

Tapestry coined the term “collaborative filtering” and introduced the idea that you could create filters based on other people’s actions. Pretty revolutionary stuff for 1992!

Tapestry proved that collective intelligence could work in digital systems. The insight that “people who agreed in the past will probably agree again in the future” became one of the foundational pillar for RecSys.

GroupLens: Automating the Wisdom of Crowds

The first system to actually automate collaborative filtering was GroupLens, built in 1994 at the University of Minnesota. And this is actually where most literature starts coming into view. It was designed to help users navigate the flood of articles on Usenet newsgroups, and it transformed collaborative filtering from a manual process into something algorithmic.

Instead of manually deciding whose opinions to trust, GroupLens would automatically find users with similar tastes and use their preferences to make recommendations.

This became known as User-Based Collaborative Filtering:

Find users similar to you
Look at what they liked that you haven’t seen
Recommend those items

The Scalability Problem and Amazon’s Solution

But there was a problem. As the web exploded, user-based collaborative filtering became painfully slow. Finding similar users for millions of people in real-time? Not happening.

GroupLens worked great for thousands of users, but by 2000, Amazon had millions. Storing user similarity matrices grows as O(n²) - that’s terabytes for millions of users.

This scalability nightmare led to a brilliant innovation from Amazon. They flipped the problem on its head with item-based collaborative filtering (Item-Item CF).

Instead of “Users like you also liked...”, they said “Users who bought The Matrix also bought Blade Runner.”

For most platforms, items << users. Amazon in 2000 had millions of users but maybe hundreds of thousands of products. The genius move was to Pre-compute item similarities offline, then at recommendation time, just do fast lookups.

The Matrix Factorization Era - Catalyzed by the Netflix Prize (2006-2009)

RecSys Fundamentals: The Art and Science of Digital Matchmaking

Rahul Agarwal — Sat, 27 Sep 2025 03:56:05 GMT

Ever wonder how Netflix seems to read your mind, serving up that perfect binge-worthy series just when you need it? Or how Spotify crafts playlists that feel like they were made specifically for you? The magic behind these eerily accurate suggestions isn’t actually magic at all—it’s Recommendation Systems, or RecSys as we call them in the industry.

I’ve spent the last four years building these systems from the inside out—first at Meta, where I helped creators discover their audiences, and now at Roku, where I’m working to surface the most compelling content for your living room. What I’ve learned is that while these systems might seem like black magic, they’re actually built on surprisingly intuitive principles that anyone can understand.

This is the first post in my comprehensive RecSys series, the purpose of which is to take you from complete beginner to recommendation system expert.

In this inaugural post, we’ll dive deep into the fundamental techniques that power every recommendation engine—collaborative filtering, content-based filtering, and hybrid approaches. More importantly, you’ll learn how to implement each technique in code and build your own working movie recommender from scratch with these fundamental approaches.

Future posts in this series will cover advanced topics like deep learning approaches, real-time systems, and scaling challenges.

But first, let’s understand the fundamentals.

Yes, this is our playground here

Ready to peek behind the curtain? Let’s get started.

What’s a Recommendation System Anyway?

At its heart, a RecSys is a sophisticated matchmaker between users and items. The “item” could be anything—a movie, a product, a news article, or even a potential friend on a dating app. The goal is elegantly simple: predict what you’ll love and show it to you before you even know you want it.

This creates a powerful win-win scenario for both the users who no longer need to go through infinite scroll paralysis or choice overload as well as for businesses who get massive improvements in engagement, sales, and customer satisfaction.

But there’s no single “right way” to build this system. Just like there are different approaches to matchmaking in the real world, recommendation systems use different strategies to connect users with content.

The Three Main RecSys Playbooks

There’s no single way to build a RecSys. Instead, we have three main battle-tested strategies, each with its own strengths and weaknesses.

A. Content-Based Filtering

Imagine a super-smart personal shopper who remembers everything you’ve ever liked. If you’re a fan of action flicks with Dwayne “The Rock” Johnson, this system will immediately start showing you more of his movies, regardless of what anyone else thinks and rates them.

Step 1: Get Item Embeddings

We represent each item as a feature vector. For a movie, this could include its genre, actors, director, or keywords from the plot. So We might create a matrix of features like this. This could in practice be a very complex matrix with TFIDF, transformers or what not:

Step 2: Get profile

Your personal profile is then built by using weighted average the features of all the items you’ve enjoyed.

Assuming you enjoyed, Jumanji, and Fast And Furious

Step 3: Get predictions

The system recommends new items that are “similar” to your profile. We often use cosine similarity for this, which essentially measures the angle between your profile vector and the item’s vector. A smaller angle means they’re a closer match.

What Production Deployments Taught Me About ReAct vs Function Calling

Rahul Agarwal — Sun, 21 Sep 2025 01:52:26 GMT

Building an AI agent that works in a demo is easy. Building one that works reliably in production? That's where things get interesting.

After deploying few AI agents to production I've learned that the gap between "it works on my machine" and "it works for real users" is enormous.

The core challenge isn't just making AI agents that can reason and act. It's making them fast, reliable, cost-effective, and debuggable when they inevitably break. It's handling edge cases, managing token costs, implementing proper logging, and building fallback systems that gracefully handle failures.

This guide isn't another theoretical overview of ReAct and Function Calling. It's a practical deep-dive into the production realities of building AI agents that scale. We'll cover the fundamental patterns, but more importantly, we'll dive into the hard-won lessons that only surface when real users start hammering your systems with unexpected queries, network timeouts, and edge cases you never considered.

By the end, you'll understand not just how to build AI agents, but how to build ones that survive contact with production. Because in the real world, the difference between a working agent and a reliable agent is everything.

The 5 Principles That Turn ML Projects Into Business Impact

Rahul Agarwal — Fri, 18 Jul 2025 23:15:22 GMT

When I was at Meta, I got to learn about a simple word: IMPACT. This word was literally absolutely everywhere. You want to do something, and your manager will ask: But Rahul, what’s the Impact. You cannot do anything at Meta without getting this word thrown at you. Honestly, while I hated it back then, now it makes me think before focussing on any problem at hand.

As machine learning engineers, we aim to build sophisticated models, fine-tune complex architectures, and push the boundaries of algorithmic performance. Yet, a harsh reality persists: a staggering number of our projects never deliver tangible business value. After spending years of my life on a hundreds of ML deployments, I’ve noticed why up to 70% of projects falter not on the algorithm, but in the gap between the model and the real world.

And, the secret to success isn't in a more complex model or a novel architecture. It's found in a disciplined, engineering-first approach that prioritizes smart problem selection, continuous iteration, and a relentless focus on production realities.

The Issues with Current ML

Alright, let's get to the gist of what I am saying. Honestly, I have noticed that the technical challenge of training a model is often the easiest part of the journey. The real hurdles are the not so fun, often-overlooked steps that come before and after. Here’s where most projects stop delivering Impact:

Poor Problem Formulation: We start with "Let's build a recommender system" instead of "We need to increase user engagement by 15% on product pages, which translates to $X in new revenue." We chase a solution before we've even defined the problem.

The #1 Reason Your GenAI Project Will Fail in Production (and the 4 Pillars to Prevent It)

Rahul Agarwal — Wed, 09 Jul 2025 02:56:14 GMT

So, you've done it. You've wrangled with a Large Language Model, stitched together a slick Streamlit or Gradio demo, and your GenAI prototype is the talk of the company.

The VPs are happy. The product managers are thinking up about new features to add. Everyone is convinced that this system about to print money.

And you, my friend, are about to enter a world of pain.

The journey from a clever prototype to a scalable, reliable, and—most importantly—not-bankrupting production application is hard. The very things that make generative models so magical also make them an operational nightmare.

Welcome to GenAIOps or LLMOps.

I've worked on productionizing GPT-powered apps, and let me tell you: the old playbook doesn't work. At all.

This post covers what I wish someone had told me before I learned these lessons by failing and handling production issues multiple times. We'll dig into the MLOps practices that are required for GenAI, plus the cost optimization tricks that'll save you from explaining a five-figure OpenAI bill to your manager.

So, let’s get started.

Are you ready to level up your LLM skills? Check out the Generative AI Engineering with LLMs Specialization on Coursera! This comprehensive program takes you from LLM basics to advanced production engineering, covering RAG, fine-tuning, and building complete AI systems. If you want to go from dabbling to deployment? This is your next step!

The New Frontier: Why Your Old MLOps Playbook is Obsolete

Ok, so as I said, your previous MLOps setup? Yeah, it's gotten obsolete now.

You cannot try to jam a RAG chatbot into the same deployment pipeline you use for a fraud detection model. This new technology needs a totally new infra as well as a new style of thinking around it.

This fundamental operational gap — the lack of a robust GenAIOps strategy tailored for generative AI — is the #1 reason why promising GenAI projects fail in production.

But what has exactly changed, and why does it matter for your production deployment?

Join my new subscriber chat

Rahul Agarwal — Sat, 05 Jul 2025 02:53:13 GMT

Today I’m announcing a brand new addition to my Substack publication: MLWhiz | AI Unwrapped subscriber chat.

This is a conversation space exclusively for subscribers—kind of like a group chat or live hangout. I’ll post questions and updates that come my way, and you can jump into the discussion.

Join chat