MLWhiz Weekly AI/ML Newsletter # 1
Here is what happened this week.
🏆 Story of the Week: The AI Governance War Just Got Real
This was the week AI governance stopped being an abstract policy debate and started showing up as lost contracts, executive resignations, and a company being blacklisted by the US government.
The sequence of events reads like a thriller. The Pentagon demanded that AI labs agree to “all lawful use” of their models —including mass domestic surveillance and fully autonomous lethal weapons. Anthropic refused, citing specific ethical red lines around human oversight of lethal force and mass surveillance of Americans without judicial oversight. The Pentagon’s response: terminate a $200M contract, formally designate Anthropic a “supply chain risk” — a designation never previously applied to an American company — and hand the contract to OpenAI. By mid-week, federal agencies (Treasury, State, HHS) were actively phasing out Claude and switching to Grok and Codex, which had accepted the terms. The GSA removed Anthropic from USAi.gov entirely.
Then the story got stranger. OpenAI took the deal, but quietly walked back parts of it after backlash — adding two sentences about human oversight of lethal force. On March 7, Caitlin Kalinowski, OpenAI’s head of robotics and consumer hardware, resigned over the Pentagon deal. She named what nobody in executive suites wanted to name: surveillance without judicial oversight, lethal autonomy without human authorization. Meanwhile, Anthropic’s own tech was still running Iranian war strikes — inside Palantir’s systems, which apparently had already integrated Claude before the ban. The irony was sharp: the lab that held the line against military AI was more embedded in active combat than the one that accepted the contract.
What this week exposed is that “AI safety” and “AI ethics” are no longer differentiators you can just put in a marketing brief. They’re now business risks. If you hold the line, you lose government revenue and get blacklisted. If you don’t, you lose your own people. There’s no clean version of this story, and both OpenAI and Anthropic came out of the week looking different than they went in. The principle at stake — whether AI companies can negotiate safety terms with the US military — will define which labs can scale in government markets for the next decade, and at what moral cost. Watch Anthropic’s legal challenge closely. It’s the most important case in AI governance since... well, ever.
🔗 LA Times — Anthropic Vows Legal Fight
🔗 Forbes — OpenAI’s Robotics Chief Resigns Over Pentagon Deal
🔗 CBS News — Pentagon-Anthropic Feud
🔗 Nextgov — Agencies Begin Shedding Anthropic Contracts
🤖 Models That Dropped This Week
GPT-5.4 and GPT-5.4 Pro (OpenAI, March 5–6) — The headline capability is native computer use built into the base model, not bolted on. On OSWorld-Verified it hit 75.0%, above human-level (72.4%) and up from 47.3% for GPT-5.2. The full package: 1M token context, coding ability from GPT-5.3-Codex folded in (57.7% on SWE-Bench Pro), tool search that cuts token usage by 47% in tests, and a thinking mode that shows you its plan upfront. ARC-AGI-2 went from 54.2% (GPT-5.2 Pro) to 83.3% (GPT-5.4 Pro) — a 29-point jump in one generation. Artificial Analysis gives it score 57 on their Intelligence Index (up from 51). Cost: $2.50/$15 per 1M in/out tokens vs. $1.75/$14 for GPT-5.2. 🔗 OpenAI announcement
Gemini 3.1 Flash-Lite (Google DeepMind, March 3) — The sharpest answer to the high-volume inference use case. $0.25/M input tokens, 2.5× faster TTFT than Gemini 2.5 Flash, 86.9% on GPQA Diamond (strong for this price tier), 1432 Elo on LMArena. The adjustable thinking levels feature is the practical standout — one model handles both cheap classification tasks and heavier reasoning by dialing a parameter. 🔗 Google Blog
GPT-5.3 Instant (OpenAI, March 3–4) — Now the default ChatGPT model. Fewer brittle refusals, better web synthesis, reduced hallucinations on flagged prompts. A polish update, not a capability leap, but the explicit direction toward helpfulness over caution is notable. 🔗 AI Business
Phi-4-reasoning-vision-15B (Microsoft, March 6) — 15B multimodal with dedicated reasoning architecture. Positioned as the “sweet spot” for production agents where frontier is overkill but real reasoning matters. Runs on one A100. 🔗 HuggingFace
🧠 Papers That Matter
Scaling Laws for Reranking in Information Retrieval — The first systematic study of how rerankers scale with model size, data, and compute in multi-stage retrieval. Key finding: Scaling the reranker isn’t always the right move — there are inflection points where adding compute to first-stage retrieval outperforms a bigger reranker, and the optimal candidate-set/reranker-size combination is non-obvious.
RAG Fusion in Production: Lessons from an Industry Deployment — Multi-query retrieval with RRF increases raw recall, but that improvement largely evaporates once a fixed-budget reranker is applied. Hit@10 dropped from 0.51 to 0.48 in several configurations compared to a single-query baseline. Measurement framework for evaluation is the key contribution: test end-to-end under your actual constraints, not recall in isolation. Required reading before you add multi-query fusion to any production RAG stack.
SORT: Systematically Optimized Ranking Transformer for Industrial-Scale Recommenders — What makes this notable is the rare combination: +6.35% orders and +5.97% GMV in real A/B tests alongside a 44.67% reduction in serving latency and 121.33% throughput improvement. If you’re running ranking pipelines at scale and the transformer-doesn’t-work-in-prod story has hit you, read the system design choices in this paper carefully.
Behind the Prompt: The Agent-User Problem in Information Retrieval — As AI agents increasingly act as the “user” in retrieval systems, classical IR’s core assumption — that observed behavior reveals human intent — mathematically breaks down. The paper proves this non-identifiability isn’t a detection problem awaiting a better classifier; it’s structural. With Claude handling 50% of coding use cases and agentic traffic growing fast, this is a foundational issue for everyone building or evaluating retrieval systems. One to read carefully.
📝 Some Good Reads
Donald Knuth: “Claude’s Cycles” — The moment of the week outside the governance story. Knuth, whose skepticism of generative AI was on record, opened his note with “Shock! Shock!” after Claude Opus 4.6 solved an open combinatorics problem he’d been working on for weeks: finding a general construction for decomposing odd-sized 3D directed graphs into Hamiltonian cycles. Over 90 minutes and 31 systematic explorations, Claude found it. Knuth proved it formally. He coined the term “Claude-like decompositions” and concluded: “It seems I’ll have to revise my opinions about generative AI one of these days.” This category of evidence — real researchers, real problems, real results — matters more than any benchmark. Stanford PDF | Simon Willison
Cursor’s “Third Era” (Latent Space) — Agent usage at Cursor now outnumbers Tab autocomplete 2:1. More than one-third of internal PRs are written by cloud agents running in dedicated VMs. The company also acquired Graphite and Autotab. At $2B+ ARR, this is the clearest evidence yet that “agentic coding” has crossed from curiosity to workflow primitive. The chart showing the ratio inversion is the kind of inflection point data that will look obvious in retrospect. Latent Space
💡 What This Week Was Really About
AI governance became a hard business constraint, not a soft value statement. The Anthropic blacklisting created a fork in the AI industry: labs that accept government use-case terms without guardrails get federal revenue; labs with ethical limits don’t. OpenAI lost a senior executive proving the other side of the same coin. This bifurcation — government-accessible AI vs. principled AI — will define market positioning for years. The Pentagon precedent is being set right now.
Computer use crossed a threshold. GPT-5.4 hitting above human-level on OSWorld-Verified (75.0% vs. 72.4%) is the kind of benchmark that means agents can now reliably navigate desktop environments. The bottleneck on automation shifts from “can the model do it” to “do you trust it enough to let it.” Combine that with Cursor’s 2:1 agent-to-tab ratio and it’s clear: agentic work is the baseline, not the frontier.
The best open-source model family lost its architects. Junyang Lin, Yu Bowen, and Hui Binyuan — the three people most responsible for Qwen’s run as the dominant open-source model family — all left Alibaba within weeks of each other after an internal reorganization. Whether Qwen 3.5 becomes a swan song or a temporary dip depends on how fast Alibaba can rebuild. The open-source ecosystem is less robust than it looked three months ago.
⚡ Quick Hits
vLLM v0.17.0 — FlashAttention 4, 30.8% throughput gains with async scheduling, full Qwen3.5 support, Realtime WebSocket API for audio. Upgrade is worth it if you’re self-hosting inference. GitHub
Block (Square) cut 40% of workforce citing AI productivity — 10,000 → under 6,000 employees, with Q4 gross profit up 24% YoY. One of the clearest examples of a profitable company restructuring around AI, not survival. AP News
Databricks KARL beats Claude 4.6 and GPT-5.2 on enterprise knowledge tasks at 33% lower cost and 47% lower latency — RL-trained agent, entirely synthetic training data, a few thousand GPU hours. Zaharia is opening the pipeline to customers. Databricks Blog
Apple replacing Core ML with “Core AI” at WWDC 2026 — Targets on-device LLMs, diffusion models, agentic workflows. If you’re building iOS ML apps, your stack is about to change. AppleInsider
Anthropic hit $19B ARR — doubled from $9B in roughly two months — Mostly Claude Code and enterprise. Closing in on OpenAI’s $20B. Seeking Alpha
Aravind Srinivas (Perplexity): “The orchestration is the product. The model is a tool.” — Clearest articulation yet of the model-commoditization thesis from someone building a product on top of it. Perplexity Computer + Voice Mode launched this week too. Indian Express

