Pydantic Evals is a tool for Python that lets you watch, step by step, how your AI agents go about solving problems. It’s made by the same people who built the popular Pydantic data validation library.
What makes it interesting is that it doesn’t just check if your agent got the answer right. It checks if your agent thought things through in the right way. It keeps track of which tools the agent used, in what order, and what steps it took along the way. You can then write tests to make sure your agent is thinking straight, not just guessing well.
This solves a big problem with testing smart agents. Sometimes, getting the right answer isn’t enough—how you get there matters. An agent might stumble onto the right result by accident, but that same approach could fall apart next time. By using OpenTelemetry, a standard way to watch what’s happening in your code, Pydantic Evals makes sure your tests match what actually happens when your agent is running for real.
Why does this matter? If you’re building agents that need to follow a careful process—maybe for safety, or for rules you can’t break—this lets you check that they’re not just getting lucky. And if you’re a data scientist trying to improve how your agent works, you can see if your changes really made it smarter, or if it just got lucky on the test.
You do have to set up OpenTelemetry in your code, which takes a bit of work. But the reward is that your tests will catch mistakes in how your agent thinks, not just what it says. That’s something you’d miss if you only looked at the final answer.