unitxt.llm_as_judge module

class unitxt.llm_as_judge.LLMJudge(data_classification_policy: List[str] = None, main_score: str = __required__, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = __required__, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_engine: unitxt.inference.InferenceEngine = __required__, evaluator_name: <enum 'EvaluatorNameEnum = None, check_positional_bias: bool = True, context_fields: Union[str, List[str], Dict[str, str]] = ['context'], generate_summaries: bool = False, format: str = 'formats.chat_api', include_prompts_in_result: bool = True, criteria_field: str = None, criteria: unitxt.llm_as_judge_constants.Criteria = None)[source]

Bases: BulkInstanceMetric

A metric class to evaluate instances using LLM as a Judge.

Evaluations are performed in two steps. First, the LLM is asked to generate an assessment following a CoT approach based on the criteria. Then, the same LLM is asked to select one of the available options. A summary of the general assessment can be generated for easy consumption by end users.

before_process_multi_stream()[source]

Checks the criteria-related fields correctness before processing multiple streams.

Raises:

UnitxtError – If both ‘criteria’ and ‘criteria_field’ are not set.

clean_results(results: dict | list)[source]

Cleans the results by removing None values and empty lists and dictionaries.

Parameters:

results (Union[dict, list]) – The results to clean.

Returns:

The cleaned results.

Return type:

Union[dict, list]

context_fields: str | List[str] | Dict[str, str] = ['context']

Fields to be used as context. If a dict is provided, the keys are used as the final names in the prompts, while the values are used to access the context variable values in the task_data object.

get_contexts(task_data: List[Dict[str, Any]]) List[Dict[str, str]][source]

Extracts and parses context fields from task data.

Parameters:

task_data (List[Dict[str, Any]]) – The task data containing context information.

Returns:

A list of parsed context dictionaries.

Return type:

List[Dict[str, str]]

get_criteria(task_data, eval_count)[source]

Retrieves the evaluation criteria from the criteria_field or from self.

Parameters:
  • task_data (List[Dict[str, Any]]) – The task data containing criteria information.

  • eval_count (int) – The number of evaluations to perform.

Returns:

A list of criteria for evaluation.

Return type:

List[Criteria]

Raises:

UnitxtError – If the criteria field is not found in the task data.

perform_evaluation_step(instances: list, task: Task, template: Template, previous_messages: List[Dict[str, str]] | None = None)[source]

Performs an evaluation step by generating predictions for the given instances.

Parameters:
  • instances (list) – The list of instances to evaluate.

  • task (Task) – The task associated with the instances.

  • template (Template) – The template used for generating predictions.

  • previous_messages (Optional[List[Dict[str, str]]]) – Previous messages for context.

Returns:

A tuple containing prompts, raw predictions, and processed predictions. Raw predictions differ from processed predictions only in the completion step, where the processors.match_closest_option is used.

Return type:

Tuple[List[str], List[str], List[str]]

prepare()[source]

Prepares the LLMJudge instance by setting up context fields and evaluator name.

class unitxt.llm_as_judge.LLMJudgeDirect(data_classification_policy: List[str] = None, main_score: str = 'llm_as_judge', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['llm_as_judge']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_engine: unitxt.inference.InferenceEngine = __required__, evaluator_name: <enum 'EvaluatorNameEnum = None, check_positional_bias: bool = True, context_fields: Union[str, List[str], Dict[str, str]] = ['context'], generate_summaries: bool = False, format: str = 'formats.chat_api', include_prompts_in_result: bool = True, criteria_field: str = None, criteria: unitxt.llm_as_judge_constants.CriteriaWithOptions = None)[source]

Bases: LLMJudge

LLMJudgeDirect is a specialized evaluation metric that performs Direct Assessment using an LLM to score responses based on a predefined evaluation criteria.

Direct Assessment is an evaluation paradigm in which the LLM selects one of a predefined set of options based on an assessment criterion. This approach can be used for Likert-scale scoring (e.g., 1-5) or selecting from semantically conditioned literals (e.g., Yes/No, Pass/Fail).

before_process_multi_stream()[source]

Ensures that the criteria is of type CriteriaWithOptions, raising an exception otherwise.

compute(references: List[List[str]], predictions: List[str], task_data: List[Dict[str, Any]]) List[Dict][source]

Performs direct assessment evaluation on the given predictions and references.

This method evaluates the quality of of the predictions by calculating scores for each instance based on a criterion.

Returns:

List[Dict]

A list of dictionaries containing the evaluation results for each instance. The results include the computed scores for each prediction. Each result will have the score_name as a prefix, which may be the criterion name if only one used, or “llm_as_judge” if several criteria were used.

Explanation of fields:

  • score: a float representing the evaluation score for the response. The value is calculated from criteria.option_map[selected_option].

  • using_<evaluator_name>: Equal to score.

  • positional_bias: Boolean indicating whether the assessment detected positional bias. Its final value is selected_option != positional_bias_selected_option

  • selected_option: The criteria option that the evaluator chose (e.g., “Could be Improved”). It is calculated by processing option_selection_completion using processors.match_closest_option

  • positional_bias_selected_option: The criteria option that the evaluator chose when checking positional bias.

  • assessment: The inference engine’s generated text using the prompts.assessment prompt.

  • positional_bias_assessment: The inference engine’s generated text using the prompts.positional_bias_assessment prompt.

  • summary: An LLM-generated summary of the assessment.

  • positional_bias_summary: A LLM-generated summary of the positional bias assessment.

  • prompts: A dictionary of prompts used in different stages of evaluation.
    • assessment: The prompt used to instruct the model on how to assess the response.

    • positional_bias_assessment: The prompt used to instruct the model on how to assess the response in the positional bias check.

    • summarization: The prompt used to generate summary of the assessment.

    • option_selection: The prompt used to generate a final judgement.

    • positional_bias_option_selection: The prompt used to generate a final judgement in the positional bias check.

  • option_selection_completion: The inference engine’s generated text using prompts.option_selection.

  • positional_bias_option_selection_completion: The inference engine’s generated text using prompts.positional_bias_option_selection.

  • criteria: A JSON-like string representing the evaluation criteria’s artifact.

Result example:

[
    {
        "answer_relevance": 1,
        "answer_relevance_using_granite3.0-2b_litellm": 1,
        "answer_relevance_positional_bias": false,
        "answer_relevance_selected_option": "Could be Improved",
        "answer_relevance_positional_bias_selected_option": "Could be Improved",
        "answer_relevance_assessment": "To assess the quality of the response, l...",
        "answer_relevance_positional_bias_assessment": "To assess the quality of the response, l...",
        "answer_relevance_summary": "A response about apprenticeships during ...",
        "answer_relevance_positional_bias_summary": "A response about apprenticeships during ...",
        "answer_relevance_prompts": {
            "assessment": [
                {
                    "role": "user",
                    "content": "You are presented with a response gener..."
                }
            ],
            "positional_bias_assessment": [
                {
                    "role": "user",
                    "content": "You are presented with a response gener..."
                }
            ],
            "summarization": [
                {
                    "role": "user",
                    "content": "Transform the following assessment into ..."
                }
            ],
            "option_selection": [
                {
                    "content": "You are presented with a response gener...",
                    "role": "user"
                },
                {
                    "content": "To assess the quality of the response, l...",
                    "role": "assistant"
                },
                {
                    "content": "Now consider the evaluation criteria and...",
                    "role": "user"
                }
            ],
            "posional_bias_option_selection": [
                {
                    "content": "You are presented with a response gener...",
                    "role": "user"
                },
                {
                    "content": "To assess the quality of the response, l...",
                    "role": "assistant"
                },
                {
                    "content": "Now consider the evaluation criteria and...",
                    "role": "user"
                }
            ]
        },
        "answer_relevance_option_selection_completion": "Could be Improved",
        "answer_relevance_positional_bias_option_selection_completion": "Could be Improved",
        "answer_relevance_criteria": "{    \"__type__\": \"criteria_with_options..."
    }
]
reduction_map: Dict[str, List[str]] = {'mean': ['llm_as_judge']}

A mapping used for score aggregation. By default, it will take the value of {'mean': [<default_main_score_name>]} .

class unitxt.llm_as_judge.LLMJudgePairwise(data_classification_policy: List[str] = None, main_score: str = '1_winrate', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['score']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_engine: unitxt.inference.InferenceEngine = __required__, evaluator_name: <enum 'EvaluatorNameEnum = None, check_positional_bias: bool = True, context_fields: Union[str, List[str], Dict[str, str]] = ['context'], generate_summaries: bool = False, format: str = 'formats.chat_api', include_prompts_in_result: bool = True, criteria_field: str = None, criteria: unitxt.llm_as_judge_constants.Criteria = None)[source]

Bases: LLMJudge

A judge for pairwise comparison evaluations, where two or more responses are compared to determine which one is preferred based on a criterion.

before_process_multi_stream()[source]

Verifies that the criteria is of the correct type before processing the multi-stream data.

compute(references: List[List[str]], predictions: List[str], task_data: List[Dict[str, str]]) List[Dict][source]

Executes the pairwise comparison evaluation, including assessment, summarization, and option selection. It computes the winrate and ranking for the responses.

Parameters:
  • references (List[List[str]]) – A list of reference responses for comparison.

  • predictions (List[str]) – A list of predicted responses.

  • task_data (List[Dict[str, str]]) – Task data to be used for evaluation.

Returns:

List[Dict[str,Dict]]

The results of the evaluation, including winrate, ranking, and other metrics.

For each instance result, the following metrics are included per response/system. Each of the metrics will have appended the systems name, if predictions were provided as a list of dicts, or their index, starting from 1, if predictions were provided as a list of lists.

All the fields are arrays with length equal to len(systems) - 1. For any result at index i: response_name[i]’s contest against compared_to[i]’s result is contest_results[i].

Explanation of fields:

  • summaries: A list of LLM-generated summaries explaining the comparison results for each response.

  • contest_results: A list of boolean values indicating whether the response won in each comparison.

  • selections: A list of the selected system names, representing the preferred response in each comparison.

  • compared_to: A list of system names that were compared against the given response.

  • assessments: A list of LLM-generated assessments explaining the reasoning behind the evaluation results.

  • positional_bias_assessments: A list of LLM-generated assessments focused on detecting positional bias in the evaluation.

  • option_selection_outputs: A list of response names selected as the best choice based on the evaluation.

  • positional_bias: A list of boolean values indicating whether positional bias was detected in the contest.

  • positional_bias_selection: A list of response names representing the selected option when considering positional bias.

  • prompts: A dictionary of prompts used in different stages of evaluation.
    • assessment: The prompt used to instruct the model on how to assess the responses.

    • positional_bias_assessment: The prompt used to instruct the model on how to assess positional bias.

    • option_selection: The prompt used to guide the model in selecting the best response.

    • positional_bias_option_selection: The prompt used for selecting the best response while checking for positional bias.

    • summary: The prompt used to generate a summary of the assessment.

  • winrate: A float representing the proportion of comparisons the response won.

  • llm_as_judge: Equal to winrate.

  • ranking: An integer representing the ranking position of the response based on the evaluation results. Best is 1.

  • response_name: A string identifying the response in the evaluation.

Result example:

[
    {
        "system1_contest_results": [
            true,
            true
        ],
        "system1_selections": [
            "system1",
            "system1"
        ],
        "system1_compared_to": [
            "system2",
            "system3"
        ],
        "system1_assessments": [
            "To determine the better response accordi...",
            "To determine the better response accordi..."
        ],
        "system1_positional_bias_assessments": [
            "To determine the better response accordi...",
            "To determine the better response accordi..."
        ],
        "system1_option_selection_outputs": [
            "system1",
            "system1"
        ],
        "system1_positional_bias": [
            false,
            false
        ],
        "system1_positional_bias_selection": [
            "system1",
            "system1"
        ],
        "system1_prompts": {
            "assessment": [
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ],
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ]
            ],
            "positional_bias_assessment": [
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ],
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ]
            ],
            "option_selection": [
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ],
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ]
            ],
            "positional_bias_option_selection": [
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ],
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ]
            ]
        },
        "system1_winrate": 1.0,
        "system1_llm_as_judge": 1.0,
        "system1_ranking": 1,
        "system1_response_name": "system1",
        "system2_contest_results": [
            false,
            true
        ],
        "system2_selections": [
            "system1",
            "system2"
        ],
        "system2_compared_to": [
            "system1",
            "system3"
        ],
        "system2_assessments": [
            "To determine the better response accordi...",
            "To determine the better response accordi..."
        ],
        "system2_positional_bias_assessments": [
            "To determine the better response accordi...",
            "To determine the better response accordi..."
        ],
        "system2_option_selection_outputs": [
            "system1",
            "system2"
        ],
        "system2_positional_bias": [
            false,
            false
        ],
        "system2_positional_bias_selection": [
            "system1",
            "system2"
        ],
        "system2_prompts": {
            "assessment": [
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ],
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ]
            ],
            "positional_bias_assessment": [
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ],
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ]
            ],
            "option_selection": [
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ],
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ]
            ],
            "positional_bias_option_selection": [
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ],
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ]
            ]
        },
        "system2_winrate": 0.5,
        "system2_llm_as_judge": 0.5,
        "system2_ranking": 2,
        "system2_response_name": "system2",
        "system3_contest_results": [
            false,
            false
        ],
        "system3_selections": [
            "system1",
            "system2"
        ],
        "system3_compared_to": [
            "system1",
            "system2"
        ],
        "system3_assessments": [
            "To determine the better response accordi...",
            "To determine the better response accordi..."
        ],
        "system3_positional_bias_assessments": [
            "To determine the better response accordi...",
            "To determine the better response accordi..."
        ],
        "system3_option_selection_outputs": [
            "system1",
            "system2"
        ],
        "system3_positional_bias": [
            false,
            false
        ],
        "system3_positional_bias_selection": [
            "system1",
            "system2"
        ],
        "system3_prompts": {
            "assessment": [
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ],
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ]
            ],
            "positional_bias_assessment": [
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ],
                [
                    {
                        "role": "user",
                        "content": "You are provided a pair of responses (Re..."
                    }
                ]
            ],
            "option_selection": [
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ],
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ]
            ],
            "positional_bias_option_selection": [
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ],
                [
                    {
                        "content": "You are provided a pair of responses (Re...",
                        "role": "user"
                    },
                    {
                        "content": "To determine the better response accordi...",
                        "role": "assistant"
                    },
                    {
                        "content": "Now considering the evaluation criteria,...",
                        "role": "user"
                    }
                ]
            ]
        },
        "system3_winrate": 0.0,
        "system3_llm_as_judge": 0.0,
        "system3_ranking": 3,
        "system3_response_name": "system3",
        "criteria": "{    \"__type__\": \"criteria\",    \"name\"..."
    }
]
prepare()[source]

Prepares the pairwise comparison by initializing the necessary templates and tasks. These tasks will be used to assess, summarize, and select options from candidate responses.

reduction_map: Dict[str, List[str]] = {'mean': ['score']}

A mapping specifying how scores should be reduced. By default, it will be {'main': ['score']} .