MLWhiz | AI Unwrapped

MLWhiz | AI Unwrapped

MLWhiz Weekly AI/ML Newsletter # 1

Here is what happened this week.

Rahul Agarwal's avatar
Rahul Agarwal
Mar 11, 2026
∙ Paid

🏆 Story of the Week: The AI Governance War Just Got Real

This was the week AI governance stopped being an abstract policy debate and started showing up as lost contracts, executive resignations, and a company being blacklisted by the US government.

The sequence of events reads like a thriller. The Pentagon demanded that AI labs agree to “all lawful use” of their models — including mass domestic surveillance and fully autonomous lethal weapons. Anthropic refused, citing specific ethical red lines around human oversight of lethal force and mass surveillance of Americans without judicial oversight. The Pentagon’s response: terminate a $200M contract, formally designate Anthropic a “supply chain risk” — a designation never previously applied to an American company — and hand the contract to OpenAI. By mid-week, federal agencies (Treasury, State, HHS) were actively phasing out Claude and switching to Grok and Codex, which had accepted the terms. The GSA removed Anthropic from USAi.gov entirely.

Then the story got stranger. OpenAI took the deal, but quietly walked back parts of it after backlash — adding two sentences about human oversight of lethal force. On March 7, Caitlin Kalinowski, OpenAI’s head of robotics and consumer hardware, resigned over the Pentagon deal. She named what nobody in executive suites wanted to name: surveillance without judicial oversight, lethal autonomy without human authorization. Meanwhile, Anthropic’s own tech was still running Iranian war strikes — inside Palantir’s systems, which apparently had already integrated Claude before the ban. The irony was sharp: the lab that held the line against military AI was more embedded in active combat than the one that accepted the contract.

What this week exposed is that “AI safety” and “AI ethics” are no longer differentiators you can just put in a marketing brief. They’re now business risks. If you hold the line, you lose government revenue and get blacklisted. If you don’t, you lose your own people. There’s no clean version of this story, and both OpenAI and Anthropic came out of the week looking different than they went in. The principle at stake — whether AI companies can negotiate safety terms with the US military — will define which labs can scale in government markets for the next decade, and at what moral cost. Watch Anthropic’s legal challenge closely. It’s the most important case in AI governance since... well, ever.

🔗 LA Times — Anthropic Vows Legal Fight

🔗 Forbes — OpenAI’s Robotics Chief Resigns Over Pentagon Deal

🔗 CBS News — Pentagon-Anthropic Feud

🔗 Nextgov — Agencies Begin Shedding Anthropic Contracts


🤖 Models That Dropped This Week

GPT-5.4 and GPT-5.4 Pro (OpenAI, March 5–6) — The headline capability is native computer use built into the base model, not bolted on. On OSWorld-Verified it hit 75.0%, above human-level (72.4%) and up from 47.3% for GPT-5.2. The full package: 1M token context, coding ability from GPT-5.3-Codex folded in (57.7% on SWE-Bench Pro), tool search that cuts token usage by 47% in tests, and a thinking mode that shows you its plan upfront. ARC-AGI-2 went from 54.2% (GPT-5.2 Pro) to 83.3% (GPT-5.4 Pro) — a 29-point jump in one generation. Artificial Analysis gives it score 57 on their Intelligence Index (up from 51). Cost: $2.50/$15 per 1M in/out tokens vs. $1.75/$14 for GPT-5.2. 🔗 OpenAI announcement

Gemini 3.1 Flash-Lite (Google DeepMind, March 3) — The sharpest answer to the high-volume inference use case. $0.25/M input tokens, 2.5× faster TTFT than Gemini 2.5 Flash, 86.9% on GPQA Diamond (strong for this price tier), 1432 Elo on LMArena. The adjustable thinking levels feature is the practical standout — one model handles both cheap classification tasks and heavier reasoning by dialing a parameter. 🔗 Google Blog

GPT-5.3 Instant (OpenAI, March 3–4) — Now the default ChatGPT model. Fewer brittle refusals, better web synthesis, reduced hallucinations on flagged prompts. A polish update, not a capability leap, but the explicit direction toward helpfulness over caution is notable. 🔗 AI Business

Phi-4-reasoning-vision-15B (Microsoft, March 6) — 15B multimodal with dedicated reasoning architecture. Positioned as the “sweet spot” for production agents where frontier is overkill but real reasoning matters. Runs on one A100. 🔗 HuggingFace


🧠 Papers That Matter

Scaling Laws for Reranking in Information Retrieval — The first systematic study of how rerankers scale with model size, data, and compute in multi-stage retrieval. Key finding: Scaling the reranker isn’t always the right move — there are inflection points where adding compute to first-stage retrieval outperforms a bigger reranker, and the optimal candidate-set/reranker-size combination is non-obvious.

User's avatar

Continue reading this post for free, courtesy of Rahul Agarwal.

Or purchase a paid subscription.
© 2026 Rahul Agarwal · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture