Evaluating How Well AI Systems Perform
An Introduction To AI Evals And Systematically Assessing AI Outputs
AI is becoming increasingly embedded in many of our products and processes. While our systems are getting more powerful, capable, and efficient, they are also becoming more opaque. Unlike traditional systems, which follow clearly defined rules, we can’t always predict what outputs AI systems will produce. Slight differences in inputs could lead to drastically different results. AI evals (short for evaluations) are now becoming increasingly important as a means to systematically assess variable AI outputs. We can no longer rely on just traditional software evaluation methods because AI fundamentally changes how we build products and how those products work. For product teams, AI evals are essential for validating that AI is being leveraged in the right way and ensuring that products remain reliable and useful.
What Are Evals
AI evals are structured tests that assess whether an AI is performing as intended. They measure the quality and effectiveness of an AI system. Traditional software development involves predictable outputs. We have structured ways of testing these outputs (like unit tests, user acceptance tests, etc.) because we know what our code is going to do when we run it. However, AI outputs are non-deterministic, so you could get different outputs from the same input. This makes conventional tests inadequate. Instead, evals must accommodate the messy realities of AI systems, where the definition of a good output can be subjective. AI evals provide a clear, unambiguous metric for the performance of an AI system. Ideally, improved performance on an AI eval correlates to some meaningful improvement in business metrics.
AI evals establish the reliability of AI systems.
Evaluating AI is like conducting a driving test. We examine how the AI system performs in a variable environment (like testing a driver in a busy city). We evaluate its ability to interpret signals, make sound decisions, and react appropriately in unpredictable scenarios. Just like a driver must prove road readiness, an AI system must pass an eval to prove it can consistently behave as intended, without going off the rails. Presently, product teams are likely getting a lot of pressure to both add AI to their products and use AI in the development process. However, the challenge is using AI to deliver reliable outcomes. AI evals are how we accomplish this.
Evals are the only way you can break down each step in the system and measure specifically what impact an individual change might have on a product, giving you the data and confidence to take the right next step.
— Aman Khan, Head of Product at Arize AI
Why Are Evals Necessary
The variability of AI outputs means there’s always a risk of something going wrong.
AI systems can find ways to work around rules and restrictions. When they go from A to B, it’s not always clear how they got there, which makes it hard to prevent the occurrence of bad AI outputs. These systems can drift into unsafe territories really fast. They can hallucinate (confidently present false information). They can amplify bias in the datasets they are trained on. They can struggle with user input variations. They can be vulnerable to exploitation from malicious users intending to use AI outputs for harmful purposes. An eval helps a product team understand how well their product solves a user problem and verify if it delivers the right user experience. People care a lot about bad experiences. Even if only 1 out of 1000 interactions with your product is bad, they will remember it. It can be hard to regain trust once it’s been lost. Therefore, you must understand exactly when, where, how, and why your AI system fails. You also need to know how often this happens and how many users are impacted.
AI evals help us design products that can accommodate the variability of AI outputs.
While AI models are incredibly smart, factually aware intelligences, they still don’t have access to all the data that is out there. They simply cannot be trained on every possible scenario. They will encounter unexpected information, and they will likely respond unpredictably to that information. This makes AI evals essential for understanding and managing the reliability of AI systems. They help us determine how likely a system is to accomplish its goal, and exactly when, where, and how it is likely to fail. Once we have a clearer picture of the system’s performance, we can develop the right design for the product. We add affordances to alert the user of issues with the AI outputs. While this may require changing the user interface or user interactions, it will improve the overall outcomes from using the AI system. For example, if a travel booking AI is going to purchase a ticket for a customer and it passes its eval only 80% of the time, there should be an additional review/confirmation step before the purchase is executed. A moment of annoyance is a lot better than an upset customer who got tickets to the wrong place.
If the model gets it right 60% of the time, you build a very different product than if the model gets it right 95% of the time, versus if the model gets it right 99.5% of the time.
— Kevin Weil, OpenAI CPO
AI Evaluation Methods
While AI evals can take many forms, the two basic methods are as follows:
Human evals involve using human feedback to assess your AI systems’ outputs. You have human graders assess the outputs, labelling them as good or bad. You could also integrate user feedback loops into the product, in the form of a UI element next to the AI output (thumbs up/thumbs down, retry button, rating, comment box, etc.). While the quality of evals can be higher with this method (because people understand nuance), human evals are not feasible at a large scale.
LLM as a judge eval involves using another AI model to assess your AI systems’ outputs. Instead of relying on human graders, you assign the LLM the role of an evaluator, provide it with context on the evaluation task (purpose, criteria, examples, etc.), and define the precise criteria for evaluating outputs. You need to provide sufficient examples of what good and bad outputs look like. While this approach is highly scalable, it’s necessary to have some human oversight because AI models can still hallucinate.
How Can AI Evals Be Used
Evals can be operationalized by product teams in many different ways:
Guardrails: The eval can be used to prevent workflows from executing (if they fail an eval or exceed a failure threshold), which would require a retry or a switch to an alternative workflow. For example, the product team could automatically stop a critical task because of a failed eval and ask the user to complete an action (confirm, retry, etc.) before proceeding to reduce risk.
Optimization: The evals can be used to fine-tune AI outputs, identifying higher-performing versions of a workflow (ones with greater success on evals). For example, the product team could use evaluation failure rates to assess and optimize how an AI system performs specific tasks.
Monitoring: The eval can be used to continuously test workflows to detect performance issues. For example, the product team could use evals to flag instances where the AI models fail or underperform and respond either autonomously (by creating detailed logs, triggering alerts, rolling back a release, etc.) or manually (by investigating the issue).
AI Evals And Product Management
AI evals are a product responsibility, not just a technical exercise.
Since PMs define product goals, they are best positioned to test whether their AI systems’ output aligns with what they are trying to accomplish. A well-designed eval helps answer critical product questions: Is the AI system achieving its purpose? How does the variability of AI outputs impact the user experience? Is it ready for production? If not, what specifically needs to change? Getting answers to these questions can help PMs make informed decisions about the product’s development, release, and future iterations. By identifying where systems fall short, evals reduce the risk of costly errors that might erode user trust.
You don’t necessarily need a deep understanding of AI to design an eval. However, you do need to understand the critical steps it’s going to take to get from input to output. The process of creating an eval helps you identify the clusters of scenarios your AI system will face. The eval itself helps you test your system to evaluate how well it actually performs and diagnose points of failure. The results of the eval help you determine what you need to change to ensure you can still deliver the intended user experience. In the future, as AI becomes more embedded into technology as a whole, the ability to create evals will emerge as a core skill for PMs.
Evals force you into the shoes of your user. You can no longer just hypothesize what they might do - you must articulate that understanding in writing.
— Peter Yang, Product Lead at Roblox
For more detailed insight into how to implement AI evals, check out these resources: AI Evals: Everything You Need to Know to Start, Mastering AI Evals: A Complete Guide for PMs
Case Study
Evaluating an AI chatbot on a Food Delivery app
Imagine that you are designing an AI chatbot for a food delivery app. You want to let your users enter a request like “I want a large cheese pizza under $20.” The bot might ask a few clarifying questions, such as ”Are you fine with waiting 30 mins?” “Do you prefer a specific restaurant?”, etc. It will look up the options based on the user’s preferences (delivery time, ratings, etc.). Then, it will present the best option and place the order (once the user confirms). The experience would be similar to placing an order over the phone, but would require a lot less effort. Users just “tell” the AI what they want.
From a product perspective, the purpose of such a feature might be to reduce friction in the ordering process, ideally resulting in more orders being placed. To understand what could potentially go wrong, we need to map out what the AI system needs to do after the user enters their request:
Understand the user’s intent → It might fail to understand the user’s requirements (the type of food, quantity, price, etc.). For example, it thinks the user wants 20 pizzas instead of a pizza under $20.
Identify relevant context → It might fail to identify relevant information from the user’s profile or past orders. For example, it does not realize that the user always orders pizza from the same place.
Ask follow-up questions → It might fail to ask relevant follow-up questions. For example, it asks the user about their budget, even though they already mentioned it.
Figure out which system to call → It might fail to identify the right systems necessary (to get information, take an action, etc.). For example, it does not realize it needs to check restaurant reviews data when identifying restaurants.
Perform the right actions → It might fail to interact with systems as necessary. For example, it searches for restaurants using the wrong parameters, selects a restaurant with bad reviews, fills out an order for pasta instead of pizza, etc.
Construct a response → It might fail to respond to the user appropriately. For example, it uses an inappropriate tone when responding to the user with the order.
Complete the overall task → It might fail at accomplishing the overall goal (placing an order). For example, the user might be upset (because of the tone, poor restaurant selection, incorrect order, etc.) and abandon the order.
As we see here, each step the AI system must take can fail in some way. We can use evals at each step to understand the frequency and the severity of these failures. This helps us figure out how to respond appropriately. Even if the AI passes an eval 95% of the time, we must understand what those 5% of failures are. For example, if an AI eval classified the AI output’s tone as bad, is the chatbot dismissive? Is it rude? Is it aggressive? Is it outright hostile? Even if only a small percentage of users encounter these types of messages, the experience might be damaging enough to erode customer relationships. We need to critically examine our AI systems to understand where/when AI evals are necessary.
Conclusion
AI evals help product teams enhance the quality of their product.
As AI becomes more deeply integrated into products, we need to get better at understanding and managing its inherent unpredictability. We have to put greater care and consideration into shaping and directing AI systems. Evals give us powerful insight into how these systems get from input to output. They help us ensure that our product works (and continues to work) the way we want it to. They help us verify if the AI passes critical use cases, identify the scenarios where the AI fails, and make the right changes to address issues. While implementing evals might be tedious, they are an important tool for ensuring that AI systems remain safe, reliable, and effective. After all, if an AI system can’t do a task right, it’s not useful, no matter how fast or how efficient it is.
Thanks For Reading
References
Product Growth | AI Evals: Everything You Need to Know to Start
The Product Compass | Mastering AI Evals: A Complete Guide for PMs
Towards Data Science | What Exactly Is an “Eval” and Why Should Product Managers Care?
Behind The Craft | The AI Skill That Will Define Your PM Career in 2025 | Aman Khan (Arize)
Lenny’s Newsletter | Beyond vibe checks: A PM’s complete guide to evals


