Created
Aug 10, 2024 10:37 PM
Favorite
Favorite
Priority
备注
推荐
类型
Langchain
notion image
Answer Correctness (or Correctness) is one of the most important and commonly used evaluation metrics for LLM applications. Correctness is typically scored from 0 to 1, with 1 indicating a correct answer and 0 indicating an incorrect one.
info
Although numerous general-purpose Correctness metrics exist, our users find it most useful to create a custom Correctness metric for their custom LLM application. In deepeval, this can be accomplished through G-Eval.
Assessing Correctness involves comparing an LLM's actual output with the ground truth, but the process is not as straightforward as it may seem. There are important things to consider such as:
  • Determining what constitutes your ground truth (selecting evaluation parameters)
  • Defining the evaluation steps/criteria for assessing actual output against ground truth
  • Establishing what constitutes an appropriate threshold to scale your correctness score

How to create your Correctness Metric

1. Instantiate a GEval object

Begin creating your Correctness metric by instantiating a GEval object, choosing your evaluation LLM, and naming the metric accordingly.
tip
G-Eval is most effective when employing a model from the GPT-4 model family as your evaluation LLM, especially when it comes to assessing correctness.

2. Select your evaluation parameters

G-Eval allows you to select parameters that are relevant for evaluation by providing a list of LLMTestCaseParams, which includes:
  • LLMTestCaseParams.INPUT
  • LLMTestCaseParams.ACTUAL_OUTPUT
  • LLMTestCaseParams.EXPECTED_OUTPUT
  • LLMTestCaseParams.CONTEXT
  • LLMTestCaseParams.RETRIEVAL_CONTEXT
ACTUAL_OUTPUT should always be included in your evaluation_params, as this is what every Correctness metric will be directly evaluating. As mentioned earlier, Correctness is determined by how well the actual output aligns with the ground truth, which is typically more variable. The ground truth is best represented by EXPECTED_OUTPUT, where the expected output serves as the ideal reference for the actual output, with an exact match earning a score of 1.
If the expected output is unavailable, you can alternatively compare the actual output with the CONTEXT, which serves as the ideal retrieval context for a RAG application. This comparison comes with its own set of evaluation criterias, however, which we will explore in the following step.

3. Defining your Evaluation Criteria

G-Eval lets you either provide a criteria from which it generates evaluation steps to assess your evaluation_parameters, or directly input the evaluation steps yourself. It's always recommended to supply your own evaluation_steps when building a custom Correctness metric, as this allows you to have more control over how Correctness is defined.
Here is a simple example of how one might define a basic Correctness metric:
Here's a more complex set of evaluation_steps, where detail is crucial to ensuring Correctness:
Here's another example metric which prioritizes general factual correctness over minutiae:
Each evaluation dataset is unique, so it's important to iteratively adjust your evaluation_steps until your Correctness metric produces scores that align with your expectations. Whether this means giving more importance to detail, numerical values, structure, or even defining a new set of evaluation steps relative to the context instead of the expected output, is up for experimentation. The key is to keep refining the metrics until they deliver the desired scores.
note
G-Eval metrics remain relatively stable across multiple evaluations, despite the variability of LLM responses. Therefore, once you establish a satisfactory set of evaluation_steps, your Correctness metric should be relatively robust.
Congratulations 🎉! You've just learnt how to build a Correctness metric for your custom LLM application. In the next section, we'll go over how to select an appropiate threshold for your Correctness metric.

Iterating your evaluations_steps

You may wonder what it means to iterate on your Correctness metric until it aligns with your expectations. The answer is to have expectations! Once you establish an evaluation dataset and decide to assess your test cases for correctness, it's essential to establish a baseline benchmark by initially identifying which cases should score well and which should not, based on the needs of your LLM application.
Here is an example based on a detail-oriented Correctness metric:
Having a benchmark helps guide the development of your metric, and the primary method to align your evaluations with this baseline is by adjusting your evaluation_steps, as detailed in step 3 above.

Finding the Right Threshold

You may initially achieve an 80% or even over 90% alignment with your expectations simply by tweaking the evaluation_steps. However, it's very common to hit a plateau at this stage. Identifying the correct threshold becomes essential at this point. It represents the crucial step in refining your custom metric to fully meet your expectations—and it's much simpler than you think!

Step 1: Perform Correctness Evaluation

First, perform the Correctness evaluation on your dataset:

Step 2: Determine the Threshold

Next, determine the percentage of test cases you expect to be correct, extract all the test scores, and calculate the threshold accordingly:
By following these steps, you can fine-tune the threshold to ensure your evaluation metrics align closely with your expectations, achieving the level of precision required for your specific needs.
Loading...
Alan_Hsu
Alan_Hsu
许心志我在 蓝湛阔天海 中定自主宰
统计
文章数:
176
Latest posts
2024_年终总结: [代码与咖啡]打工人的奇幻漂流
2025-1-3
🎄✨ 圣诞特辑 | 美食简餐 🎁🍷
2025-1-3
基于大模型搭建本地私有化知识库的搭建与研究
2025-1-2
python中 self cls的区别
2025-1-2
2023_年终总结_关键词
2025-1-2
2025_周报 #01
2024-12-29