Imagine you’re chatting with an AI, asking it to help you book a flight. It might give you the right answer to every single question you ask, but somehow, you still end up without a ticket. That’s where multi-turn evaluations come in. Instead of just checking if each reply is correct, they look at the whole conversation—did the agent actually help you get what you wanted?

Most tests just ask, ‘Did the AI answer this question right?’ But that misses the bigger picture. Sometimes, the real problems only show up when you look at the whole back-and-forth, not just one reply at a time.

Multi-turn evaluations ask three big questions: What was the user actually trying to do? Did they get what they wanted in the end? And how did the conversation flow—did the agent use the right tools at the right time, or did it get lost along the way?

Instead of judging each message as it comes, these evaluations wait until the whole conversation is over. Only then do they look back and ask, ‘Did this actually work?’ Tools like LangSmith are starting to make this the new normal for building smarter agents.

Why does this matter? Because real users don’t just fire off one question and leave—they have goals that take a few steps to reach. If you only check each answer on its own, you’ll miss the moments when the agent goes off track, forgets what you wanted, or just leads you in circles. Multi-turn evaluation is how you spot those hidden failures that slip through the cracks.

Read more at LangChain's blog