Created
Aug 10, 2024 10:37 PM
Favorite
Favorite
Priority
备注
推荐
类型
Langchain
Answer Correctness (or Correctness) is one of the most important and commonly used evaluation metrics for LLM applications. Correctness is typically scored from 0 to 1, with 1 indicating a correct answer and 0 indicating an incorrect one.
info
Although numerous general-purpose Correctness metrics exist, our users find it most useful to create a custom Correctness metric for their custom LLM application. In
deepeval
, this can be accomplished through G-Eval.Assessing Correctness involves comparing an LLM's actual output with the ground truth, but the process is not as straightforward as it may seem. There are important things to consider such as:
- Determining what constitutes your ground truth (selecting evaluation parameters)
- Defining the evaluation steps/criteria for assessing actual output against ground truth
- Establishing what constitutes an appropriate threshold to scale your correctness score
How to create your Correctness Metric
1. Instantiate a GEval
object
Begin creating your Correctness metric by instantiating a
GEval
object, choosing your evaluation LLM, and naming the metric accordingly.tip
G-Eval is most effective when employing a model from the GPT-4 model family as your evaluation LLM, especially when it comes to assessing correctness.
2. Select your evaluation parameters
G-Eval allows you to select parameters that are relevant for evaluation by providing a list of
LLMTestCaseParams
, which includes:LLMTestCaseParams.INPUT
LLMTestCaseParams.ACTUAL_OUTPUT
LLMTestCaseParams.EXPECTED_OUTPUT
LLMTestCaseParams.CONTEXT
LLMTestCaseParams.RETRIEVAL_CONTEXT
ACTUAL_OUTPUT
should always be included in your evaluation_params
, as this is what every Correctness metric will be directly evaluating. As mentioned earlier, Correctness is determined by how well the actual output aligns with the ground truth, which is typically more variable. The ground truth is best represented by EXPECTED_OUTPUT
, where the expected output serves as the ideal reference for the actual output, with an exact match earning a score of 1.If the expected output is unavailable, you can alternatively compare the actual output with the
CONTEXT
, which serves as the ideal retrieval context for a RAG application. This comparison comes with its own set of evaluation criterias, however, which we will explore in the following step.3. Defining your Evaluation Criteria
G-Eval
lets you either provide a criteria from which it generates evaluation steps to assess your evaluation_parameters
, or directly input the evaluation steps yourself. It's always recommended to supply your own evaluation_steps
when building a custom Correctness metric, as this allows you to have more control over how Correctness is defined.Here is a simple example of how one might define a basic Correctness metric:
Here's a more complex set of
evaluation_steps
, where detail is crucial to ensuring Correctness:Here's another example metric which prioritizes general factual correctness over minutiae:
Each evaluation dataset is unique, so it's important to iteratively adjust your
evaluation_steps
until your Correctness metric produces scores that align with your expectations. Whether this means giving more importance to detail, numerical values, structure, or even defining a new set of evaluation steps relative to the context instead of the expected output, is up for experimentation. The key is to keep refining the metrics until they deliver the desired scores.note
G-Eval metrics remain relatively stable across multiple evaluations, despite the variability of LLM responses. Therefore, once you establish a satisfactory set of
evaluation_steps
, your Correctness metric should be relatively robust.Congratulations 🎉! You've just learnt how to build a Correctness metric for your custom LLM application. In the next section, we'll go over how to select an appropiate threshold for your Correctness metric.
Iterating your evaluations_steps
You may wonder what it means to iterate on your Correctness metric until it aligns with your expectations. The answer is to have expectations! Once you establish an evaluation dataset and decide to assess your test cases for correctness, it's essential to establish a baseline benchmark by initially identifying which cases should score well and which should not, based on the needs of your LLM application.
Here is an example based on a detail-oriented Correctness metric:
Having a benchmark helps guide the development of your metric, and the primary method to align your evaluations with this baseline is by adjusting your
evaluation_steps
, as detailed in step 3 above.Finding the Right Threshold
You may initially achieve an 80% or even over 90% alignment with your expectations simply by tweaking the
evaluation_steps
. However, it's very common to hit a plateau at this stage. Identifying the correct threshold becomes essential at this point. It represents the crucial step in refining your custom metric to fully meet your expectations—and it's much simpler than you think!Step 1: Perform Correctness Evaluation
First, perform the Correctness evaluation on your dataset:
Step 2: Determine the Threshold
Next, determine the percentage of test cases you expect to be correct, extract all the test scores, and calculate the threshold accordingly:
By following these steps, you can fine-tune the threshold to ensure your evaluation metrics align closely with your expectations, achieving the level of precision required for your specific needs.