Build AI
Posts
-
LLM-as-a-judge: the measurement problem
You've built something and you need to know if it works. So you do what's sensible—you ask an LLM to grade it. Factual accuracy, code quality, agent outputs. The machine judges the machine, and you get a number you can act on. Except that number
Continue reading -
Claude Opus 4.5: effort control
Claude Opus 4.5 is the newest brainchild from Anthropic, the folks behind the Claude language models. Think of it as their latest and smartest tool for handling really complicated tasks—like having an assistant who can juggle lots of jobs at once, and still keep everything running smoothly. So,
Continue reading -
Apps SDK Brings Custom UI to ChatGPT
This week's edition covers building custom interfaces in ChatGPT, Google's Veo 3.1 video generation with native audio, multi-turn agent evaluation, and monitoring agent reasoning.
Continue reading -
OpenAI: apps inside ChatGPT
OpenAI has just launched something called the Apps SDK, and it’s a bit like giving developers a new set of building blocks for ChatGPT. Instead of just chatting, you can now create apps that live right inside the conversation, with their own custom look and feel. The SDK builds
Continue reading -
Veo 3.1: native audio and reference controls
Veo is Google's latest attempt to teach computers how to make videos from scratch. Now in version 3.1, it's available for anyone willing to pay for early access, either through Google AI Studio or Vertex AI. You can choose between the regular version or a
Continue reading -
Pydantic: Evals
Pydantic Evals is a tool for Python that lets you watch, step by step, how your AI agents go about solving problems. It’s made by the same people who built the popular Pydantic data validation library. What makes it interesting is that it doesn’t just check if your
Continue reading -
LangSmith: multi-turn evaluation
Imagine you’re chatting with an AI, asking it to help you book a flight. It might give you the right answer to every single question you ask, but somehow, you still end up without a ticket. That’s where multi-turn evaluations come in. Instead of just checking if each
Continue reading -
Give Claude Memory and Skills via API
This week's edition covers Anthropic's new memory and Agent Skills APIs for building agents, Karpathy's transparent LLM training pipeline, on-device inference with Windows ML, and circuit-based interpretability tools that cut data requirements by 150x.
Continue reading -
Karpathy: nanochat
nanochat is Karpathy’s attempt to strip LLM training down to its bare essentials. It’s about 8,000 lines of code, and it’s designed to be read and understood, not just run. Unlike the big, complicated frameworks you find in production, this one is all about showing you
Continue reading -
The health tech paradox
Picture this: you buy a shiny new health gadget that claims it will look after you, no effort required. It sounds like the dream. But there’s a problem. Even the most hands-off technology still asks something from you. Mild cognitive impairment, or MCI, is when your memory and thinking
Continue reading -
Google Research: the language of biology
Imagine you could talk to cells and ask them what they’re up to. That’s more or less what Cell2Sentence-Scale (C2S-Scale) lets you do. Built by Google Research and Yale, it’s an open-source model that takes the huge, messy data from single-cell RNA sequencing—basically, a readout of
Continue reading -
Anthropic: Agent Skills
Agent Skills is a new way for Claude to learn new tricks. Imagine you could hand Claude a folder full of instructions, code, and resources, and Claude would know exactly when to use them. That’s what Agent Skills does: it lets you teach Claude how to handle specific jobs,
Continue reading -
SAE steering: Delta Token Confidence
Imagine you’re trying to understand what’s going on inside an AI’s mind. Sparse Autoencoders, or SAEs, are a tool that lets us break down the AI’s thoughts into features we can actually make sense of. Developers use these features to guide what the model does—whether
Continue reading -
Gemini: computer use
Google DeepMind has just released something called the Gemini 2.5 Computer Use model. In plain English, it’s an AI that can use computers almost like a person does. It can click buttons, type into forms, and scroll through pages – all those little things you do every day on
Continue reading -
Fraunhofer: circuit-based interpretability
Mechanistic interpretability is all about peering inside a neural network’s mind and asking, ‘How does this thing actually think?’ The usual way is to throw billions of words at the model and then ask another AI to explain what’s going on. That’s slow, expensive, and often gives
Continue reading -
Claude: memory tool
Imagine you’re building something with Claude, and you want it to remember things from one conversation to the next. Anthropic has just released a memory tool for their Claude API that lets you do exactly that. Instead of Claude forgetting everything between sessions, you can now give it a
Continue reading -
Harvard: ML systems textbook
Imagine you want to build machine learning systems that actually work in the real world, not just in a classroom. Harvard has put together a textbook for exactly that. It’s based on their CS249r course, and while the full book comes out in 2026, you can already download the
Continue reading -
Windows ML: on-device inference
Windows ML is a tool from Microsoft that lets developers run AI models right on your own computer. Instead of sending your data off to some distant server, everything happens right there on your device. The big news is that Windows ML is now ready for everyone to use. Developers
Continue reading -
OpenAI Open-Sources Agentic Commerce Protocol: A Standard for AI Transactions
The Agentic Commerce Protocol is a set of rules that lets AI agents buy things for you. OpenAI and Stripe built it together, along with some merchants. The idea is simple: it tells AIs, users, and businesses how to talk to each other so that buying things is easy, but
Continue reading -
Google Releases Open Protocol for Agent-Initiated Payments
The Agent Payments Protocol, or AP2, is a new set of rules for how AI agents can move money around safely. Google didn’t do this alone. They worked with more than 60 other companies—big names like Mastercard, American Express, PayPal, Coinbase, Salesforce, and ServiceNow. AP2 builds on earlier
Continue reading -
Claude Sonnet 4.5: A New AI Model That Excels at Coding and Building Agents
Claude Sonnet 4.5 is the newest AI model from Anthropic, and it’s built for people who want to create smarter apps and digital assistants. It belongs to the Claude 4 family, and you can use it through Anthropic’s website, their mobile app, or even a command-line tool
Continue reading