I’ve been doing some deeper dives lately into LLMs, and AI solutions, and the latest rabbit hole I found myself running down as “How do we measure performance of LLMs, and by extension AI solutions?” This question for me goes back to a critical point of any technology solution…every solution is only as good as it’s metrics and monitoring.
Let’s face it, the solutions being built today are absolutely critical to supporting business processes, and the basic functions of the modern day enterprise. And along those lines, by building AI into every technology solution that exists, we are also creating scenarios by which we are requiring those systems to stand up to the rigor and support requirements of that enterprise. There’s no way around it, you can’t add AI into applications, without it meeting the needs of the enterprise.
And when we look at your more traditional applications, monitoring is more straight forward. Languages like C# and Python provide libraries that enable abstraction of metrics capture along with logging to support any new application. And the implementation pattern isn’t anything crazy, you install the nuget package, the PyPI module, and then start implementing logging commands within your code.
But with AI, that prospect is not as simple, as many of these models and solutions are essentially a black-box, and can make that monitoring be less deterministic from an implementation perspective.
And even more than that, if we expand the aperature on this topic, how do we know we implemented the best LLM for our data. How can I tell if Claude vs gpt-4o is better? These are not easy questions, but that’s what I’m looking to do today.
So let’s start with this:
Why LLM Evaluation is challenging?
LLMs are hard to evaluate because their output is not binary, and situational. When an LLM is asked a question, it can give back a response that is acceptable with a bit of a sliding scale. For a concrete example of what I mean here, let’s look at the simple “summarize a document” ask. If I ask gpt-4o to summarize a document, the evaluation of it’s response for correctness alone could be different depending on two different situations.
If I’m asking for a legal document to be parsed, the requirements around precision and errors is significantly higher than if I’m summarizing a document for a customer service chat bot. Hallucinations in the legal sense could have extremely damaging consequences.
Additionally, when we look at a response, there are a lot of scenarios that could potentially impact the “acceptability of the answer.” How accurate was it? How to the point? How complete? How relevant? These are all questions that can impact evaluation of a model. So this brings us back to “What metrics and areas should I be monitoring?”
What are key metrics to track for LLM Solutions?
The following are generally considered to be the key metrics to evaluating an LLM / AI models responses. I’ve broken these down into groups to help with the approach of seeing how these are measured:
Measuring Solution Performance:
When we look at any solution we build with AI, scale and meeting the needs of the user becomes increasingly important. The following metrics are very focused on how we measure the solution and it’s overall performance:
- Efficiency / Latency: These are more your operational metrics, and include things like inference latency, throughput, and token usage. These are operational tools that should absolutely be monitored for cost, scale and performance perspective.
- User Engagement: These are critical metrics for capturing the success of your solution, and its ability to deliver the needs of the organization or enterprise. Are your users seeing value and engaging with these features. More on specifics further down.
Measuring the LLM:
When we focus on the LLM and its performance, there are specific metrics we will want to be looking at. And these metrics are designed to help evaluate models and compare them against one another predominantly.
- Accuracy / Correctness: For the purposes of LLMs, this metric refers to how well the output aligns with the ground truth or the expected answer. This is often referred to as an “F1 score.” Which more on that later.
- Relevance: This metric refers to the how well the output addresses the question or task that the LLM was given. This is something that is really reviewed by humans, and in my research I could not find a mathmatically way of calculating this metric.
- Coherence / Fluency: How well structured is the output. Is it well-structured, logical, and easy for the human to understand and follow. In my research, I found out that these are measured by “crowdworker rating” and “perplexity” from a quantitative perspective, and more on those later.
- Coverage / Completeness: Are all the key parts of the request addressed in the output?
- Hallucination Rate: Tracks how often the model outputs incorrect or fabricated information that is not supported by hard data.
- Bias / Fairness: Analyzing the output for undesired bias and unequal performance against demographics. These are captured by a “bias score” or a “toxicity score,” but more on them later.
Now at first blush, these probably look like metrics that “if I’m not building LLMs, why do I care?” The answer is you absolutely need to care, for the following reasons:
- If you are doing any fine-tuning, this is how you measure the effectiveness of that fine-tuning effort.
- If you are building a solution, the question of “Is this the right model for the job?” is going to come up. And these metrics provide a means of measuring and comparing models to ensure you are implementing the right answer.
Business or Product KPIs
Finally, we need to be able to quantatively measure the ability of the solution to meet the needs of the business. Productivity KPIs: This one seems obvious, but I find is the most overlooked. Is AI delivering on the promise? Sometimes the best way to evaluate the solution is by measuring the problem it’s trying to solve. For example, if I build a solution to help my field sellerrs, am I seeing faster close rates? Am I seeing new opportunities? Am I seeing better engagement with customers?
What is an “F1 score?”
So the high level on this score, and metric is that it “harmonizing precision and recall to give an objective measure of the output.” And those are definitely words, and notionally that makes sense. But I also hate that defintion because it tells me nothing of the “how?” So let’s dive into the concepts of “Precision” and “Recall.”
What is “precision?”
Precision in this context refers to the accuracy of the response coming back. Specifically looking at the “how correct the response is?” Usually this is represented as a 1.0 being perfect, and on a percentage scale.
What is “Recall?”
Recall represents the portion of actual positives that the model correctly identified. Ultimately this measures the “Of all the real positive instances, how many did the model identify?” So this focuses on the ability of the model to recognize a correct response.
Why does this matter?
Ultimately this matters, because it shows the ability of the model to provide and identify a correct response, and these aren’t blanket metrics, but rather metrics that relate to how the model is behaving to identify if it’s in parameters. For example, a model that helping to summarize a meeting might have a lower F1 metric requirement, than a model working in a legal context. So the F1 score provides a balanced metric comparing these two values.
Another important thing to consider when looking at the F1 score is complexity, the more complex the tasks the model is being tasked with, the lower the F1 score will be, as the definition of “correctness” is more ambiguous. For reasoning models, and other complex tasks, specifically there are other ways of numerically capturing this metric. An example of which is Deita Complexity scoring.
What’s next?
As I went down this particular rabbit hole, I discovered quickly how massive this topic is. So I’m building another post to dive deeper into this, and will have another post specifically looking at questions like:
How can Measure LLM Performance? How can I evaluate fine tuning? What are the public benchmarks? And how do I monitor agentic solutions?