Answer Correctness Metric | DeepEval - The Open-Source LLM Evaluation Framework

Created

Aug 10, 2024 10:37 PM

Favorite

Priority

URL

https://docs.confident-ai.com/docs/guides-answer-correctness-metric

备注

How to create your Correctness Metric

1. Instantiate a `GEval` object

Begin creating your Correctness metric by instantiating a GEval object, choosing your evaluation LLM, and naming the metric accordingly.

tip

G-Eval is most effective when employing a model from the GPT-4 model family as your evaluation LLM, especially when it comes to assessing correctness.

2. Select your evaluation parameters

G-Eval allows you to select parameters that are relevant for evaluation by providing a list of LLMTestCaseParams, which includes:

LLMTestCaseParams.INPUT

LLMTestCaseParams.ACTUAL_OUTPUT

LLMTestCaseParams.EXPECTED_OUTPUT

LLMTestCaseParams.CONTEXT

LLMTestCaseParams.RETRIEVAL_CONTEXT

ACTUAL_OUTPUT should always be included in your evaluation_params, as this is what every Correctness metric will be directly evaluating. As mentioned earlier, Correctness is determined by how well the actual output aligns with the ground truth, which is typically more variable. The ground truth is best represented by EXPECTED_OUTPUT, where the expected output serves as the ideal reference for the actual output, with an exact match earning a score of 1.

If the expected output is unavailable, you can alternatively compare the actual output with the CONTEXT, which serves as the ideal retrieval context for a RAG application. This comparison comes with its own set of evaluation criterias, however, which we will explore in the following step.

3. Defining your Evaluation Criteria

G-Eval lets you either provide a criteria from which it generates evaluation steps to assess your evaluation_parameters, or directly input the evaluation steps yourself. It's always recommended to supply your own evaluation_steps when building a custom Correctness metric, as this allows you to have more control over how Correctness is defined.

Here is a simple example of how one might define a basic Correctness metric:

Here's a more complex set of evaluation_steps, where detail is crucial to ensuring Correctness:

Here's another example metric which prioritizes general factual correctness over minutiae:

Each evaluation dataset is unique, so it's important to iteratively adjust your evaluation_steps until your Correctness metric produces scores that align with your expectations. Whether this means giving more importance to detail, numerical values, structure, or even defining a new set of evaluation steps relative to the context instead of the expected output, is up for experimentation. The key is to keep refining the metrics until they deliver the desired scores.

note

G-Eval metrics remain relatively stable across multiple evaluations, despite the variability of LLM responses. Therefore, once you establish a satisfactory set of evaluation_steps, your Correctness metric should be relatively robust.

Congratulations 🎉! You've just learnt how to build a Correctness metric for your custom LLM application. In the next section, we'll go over how to select an appropiate threshold for your Correctness metric.

Iterating your `evaluations_steps`

You may wonder what it means to iterate on your Correctness metric until it aligns with your expectations. The answer is to have expectations! Once you establish an evaluation dataset and decide to assess your test cases for correctness, it's essential to establish a baseline benchmark by initially identifying which cases should score well and which should not, based on the needs of your LLM application.

Here is an example based on a detail-oriented Correctness metric:

Having a benchmark helps guide the development of your metric, and the primary method to align your evaluations with this baseline is by adjusting your evaluation_steps, as detailed in step 3 above.

Finding the Right Threshold

You may initially achieve an 80% or even over 90% alignment with your expectations simply by tweaking the evaluation_steps. However, it's very common to hit a plateau at this stage. Identifying the correct threshold becomes essential at this point. It represents the crucial step in refining your custom metric to fully meet your expectations—and it's much simpler than you think!

Step 1: Perform Correctness Evaluation

First, perform the Correctness evaluation on your dataset:

Step 2: Determine the Threshold

Next, determine the percentage of test cases you expect to be correct, extract all the test scores, and calculate the threshold accordingly:

By following these steps, you can fine-tune the threshold to ensure your evaluation metrics align closely with your expectations, achieving the level of precision required for your specific needs.