
Background
LLMs and GenAI are now assisting professionals in more & more workflows in many different settings – whether that be large companies, financial institutions, academic research and even in high-stakes industries such as healthcare and law. The outputs that LLMs produce within these workflows influence decisions that have real consequences. Meanwhile, however, standards for how to evaluate the performance of these AI systems or workflows haven’t necessarily kept pace with the speed of real-world deployment.
Why Evaluation Is Important
Generative LLM tasks typically don’t have a single, absolute ‘correct’ answer. Certain standard metrics like accuracy, F1 scores (a harmonic mean of precision and recall), and BLEU scores (a measure of coherence) have emerged to evaluate LLM outputs, but their usefulness is limited. For instance, F1 scores help evaluate how well an LLM performs on classification tasks such as classifying an email as spam or sentiment analysis, but fall flat when it comes to giving signals about the LLM/AI system’s quality of reasoning, contextual relevance, clarity of writing, or instruction following. High scores on these standard metrics can tend to add a false sense of security about the LLM’s performance which is not good for critical business workflows, where a more nuanced and thorough approach is needed to evaluate whether an LLM/Gen AI workflow is performing well.
Introducing Evals
Evals (short for Evaluations) help you assess how well an LLM or AI model’s output aligns with what the task or the user actually needs. You need evals in order to measure how effectively your LLM is performing in the context of the workflow it is embedded in and the task it is supposed to perform. Evals go beyond correctness or factual accuracy to different dimensions of LLM/Gen AI workflow performance such as usefulness, clarity, instruction following, reliability, actionability, and ethical alignment (for a deeper dive, see AI ethics and LLMs).
Because different domains prioritize different kinds of ‘quality’, evaluations become the subjective anchor of quality, helping answer ‘Would a domain expert trust and use this output?’. Furthermore, evals then become a way for businesses to capture the complexity and nuances of the specific business processes within which these LLMs and GenAI-powered workflows are embedded and measure whether they are doing what they’re supposed to.
Given the high stakes of GenAI powered workflows and the $ going into the investment to set them up, it is incumbent upon businesses and the right roles to create robust evaluations so they can get some meaningful signals about the performance of their many different agents and AI systems at a massive scale.
Where to Start with Evals
The first place to start when working on coming up with evals is to think about the dimensions relevant to evaluating the performance of the LLM within the scenario or business process it’s plugged into.
Let’s take an example of a use case where we’ve all benefited from LLMs and GenAI – a meeting transcription and summarization service, a very common use case for enterprises enabling GenAI and LLMs. Some of the dimensions we would want to evaluate its performance on include:
- Content Accuracy / Correctness – Is the bot summarizing facts from the meeting correctly? Is it hallucinating information or misrepresenting facts?
- Structure and Organization – Is there a logical flow to the summary? Are the topics and decisions being categorized effectively?
- Comprehension – Are all speakers’ comments being captured properly? How are pauses/silences being accounted for?
In order to assess if the GenAI-powered meeting summary is meeting the needs of its consumers, we can go a step deeper with our evaluations. Employees who consume meeting summaries are often looking for detailed recaps of who owns what action items based on the meeting discussions. So if we needed to evaluate the quality of action items captured in the meeting summary, we would want to measure:
- Specificity – Are the action items captured vague or specific?
- Ownership – Are the right people assigned as owners for action items?
- Timeline – Were the important dates and deadlines captured?
The core idea is to think intuitively about the different dimensions that a task output would need to be evaluated on in order to infer if it is performing well. A correct output may not always be the ‘right’ output for your use case/workflow/process. For inspiration, Stanford’s benchmarking work, the Holistic Evaluation of Language Models (HELM), has shown how multidimensional scoring surfaces weaknesses invisible to single-number metrics.
Ways to Instrument Evals
If you’re interested in instrumenting evaluations within your respective work setting, there are 3 broad ways in which it can be done:
- Human feedback: Can be captured from the consumers of your AI models/LLM outputs. This feedback is very valuable since you hear directly from people interacting with the model, provided you can get them to share their feedback. This feedback usually ends up taking the form of a binary input (good/bad) and you’ve probably come across it as the thumbs up/ thumbs down buttons when interacting with your favorite LLM. Though it can also be a free-text input. Can be captured as structured feedback and labeling of LLM conversation logs by SMEs. While this method doesn’t scale very well, and is time-consuming, it is very useful and is, in fact, a prerequisite for using the ‘LLM as a judge’ method described later. The feedback typically manifests as scores within a range (e.g. 1-5) on multiple different dimensions.
- LLM as a judge: As the name suggests, this is where you leverage an LLM by using a specific prompt to help grade the outputs of another LLM/GenAI-powered workflow. The more powerful LLMs and newer reasoning models can label data and conversations really well based on your instructions and requirements (specified through the prompt). This method scales really well, but requires some pre-labeled data that can serve as a reference for the LLM to grade other outputs. You will need to detail how you want the evaluations done through a well-written prompt. Frameworks such as the open-source OpenAI Evals quick-start let you define prompts, reference answers, and scoring rubrics that scale.
- Other categories (heuristic-based/statistical analysis/comparative): There are other ways to perform evaluations, such as by leveraging a script in a programming language such as Python that has pre-defined rules used to grade outputs, or by comparing two different LLMs against each other for the same task and then picking the right one that could be appropriate given the circumstances.
Conclusion
Evals offer a comprehensive framework for assessing LLM outputs against user and task requirements, extending beyond traditional metrics. By focusing on subjective quality and domain-specific nuances, Evals are crucial for ensuring high-stakes GenAI applications perform optimally. Implementing robust evaluations is vital for businesses to gain meaningful insights into AI system performance and make informed decisions in critical workflows.
About the Author
Rajat is a Senior Staff Product Manager for Analytics and AI at ServiceNow where he leads the building of AI & ML products that provide insights and Sales-ready information for 1000+ strong Sales and Product teams and leaders. Rajat is a prolific contributor to the Business Technology, SaaS and IT communities, as a judge, speaker and writer.
Disclaimer: The author is completely responsible for the content of this article. The opinions expressed are their own and do not represent IEEE’s position, nor that of the Computer Society, nor its Leadership.