unitxt.llm_as_judge module¶
- class unitxt.llm_as_judge.LLMJudge(data_classification_policy: List[str] = None, main_score: str = __required__, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = __required__, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_engine: unitxt.inference.InferenceEngine = __required__, evaluator_name: <enum 'EvaluatorNameEnum = None, check_positional_bias: bool = True, context_fields: Union[str, List[str], Dict[str, str]] = ['context'], generate_summaries: bool = False, format: str = 'formats.chat_api', include_prompts_in_result: bool = True, criteria_field: str = None, criteria: unitxt.llm_as_judge_constants.Criteria = None)[source]¶
Bases:
BulkInstanceMetric
A metric class to evaluate instances using LLM as a Judge.
Evaluations are performed in two steps. First, the LLM is asked to generate an assessment following a CoT approach based on the criteria. Then, the same LLM is asked to select one of the available options. A summary of the general assessment can be generated for easy consumption by end users.
- before_process_multi_stream()[source]¶
Checks the criteria-related fields correctness before processing multiple streams.
- Raises:
UnitxtError – If both ‘criteria’ and ‘criteria_field’ are not set.
- clean_results(results: dict | list)[source]¶
Cleans the results by removing None values and empty lists and dictionaries.
- Parameters:
results (Union[dict, list]) – The results to clean.
- Returns:
The cleaned results.
- Return type:
Union[dict, list]
- context_fields: str | List[str] | Dict[str, str] = ['context']¶
Fields to be used as context. If a dict is provided, the keys are used as the final names in the prompts, while the values are used to access the context variable values in the task_data object.
- get_contexts(task_data: List[Dict[str, Any]]) List[Dict[str, str]] [source]¶
Extracts and parses context fields from task data.
- Parameters:
task_data (List[Dict[str, Any]]) – The task data containing context information.
- Returns:
A list of parsed context dictionaries.
- Return type:
List[Dict[str, str]]
- get_criteria(task_data, eval_count)[source]¶
Retrieves the evaluation criteria from the criteria_field or from self.
- Parameters:
task_data (List[Dict[str, Any]]) – The task data containing criteria information.
eval_count (int) – The number of evaluations to perform.
- Returns:
A list of criteria for evaluation.
- Return type:
List[Criteria]
- Raises:
UnitxtError – If the criteria field is not found in the task data.
- perform_evaluation_step(instances: list, task: Task, template: Template, previous_messages: List[Dict[str, str]] | None = None)[source]¶
Performs an evaluation step by generating predictions for the given instances.
- Parameters:
- Returns:
A tuple containing prompts, raw predictions, and processed predictions. Raw predictions differ from processed predictions only in the completion step, where the processors.match_closest_option is used.
- Return type:
Tuple[List[str], List[str], List[str]]
- class unitxt.llm_as_judge.LLMJudgeDirect(data_classification_policy: List[str] = None, main_score: str = 'llm_as_judge', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['llm_as_judge']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_engine: unitxt.inference.InferenceEngine = __required__, evaluator_name: <enum 'EvaluatorNameEnum = None, check_positional_bias: bool = True, context_fields: Union[str, List[str], Dict[str, str]] = ['context'], generate_summaries: bool = False, format: str = 'formats.chat_api', include_prompts_in_result: bool = True, criteria_field: str = None, criteria: unitxt.llm_as_judge_constants.CriteriaWithOptions = None)[source]¶
Bases:
LLMJudge
LLMJudgeDirect is a specialized evaluation metric that performs Direct Assessment using an LLM to score responses based on a predefined evaluation criteria.
Direct Assessment is an evaluation paradigm in which the LLM selects one of a predefined set of options based on an assessment criterion. This approach can be used for Likert-scale scoring (e.g., 1-5) or selecting from semantically conditioned literals (e.g., Yes/No, Pass/Fail).
- before_process_multi_stream()[source]¶
Ensures that the criteria is of type CriteriaWithOptions, raising an exception otherwise.
- compute(references: List[List[str]], predictions: List[str], task_data: List[Dict[str, Any]]) List[Dict] [source]¶
Performs direct assessment evaluation on the given predictions and references.
This method evaluates the quality of of the predictions by calculating scores for each instance based on a criterion.
Returns:¶
- List[Dict]
A list of dictionaries containing the evaluation results for each instance. The results include the computed scores for each prediction. Each result will have the score_name as a prefix, which may be the criterion name if only one used, or “llm_as_judge” if several criteria were used.
Explanation of fields:
score: a float representing the evaluation score for the response. The value is calculated from criteria.option_map[selected_option].
using_<evaluator_name>: Equal to score.
positional_bias: Boolean indicating whether the assessment detected positional bias. Its final value is selected_option != positional_bias_selected_option
selected_option: The criteria option that the evaluator chose (e.g., “Could be Improved”). It is calculated by processing option_selection_completion using processors.match_closest_option
positional_bias_selected_option: The criteria option that the evaluator chose when checking positional bias.
assessment: The inference engine’s generated text using the prompts.assessment prompt.
positional_bias_assessment: The inference engine’s generated text using the prompts.positional_bias_assessment prompt.
summary: An LLM-generated summary of the assessment.
positional_bias_summary: A LLM-generated summary of the positional bias assessment.
- prompts: A dictionary of prompts used in different stages of evaluation.
assessment: The prompt used to instruct the model on how to assess the response.
positional_bias_assessment: The prompt used to instruct the model on how to assess the response in the positional bias check.
summarization: The prompt used to generate summary of the assessment.
option_selection: The prompt used to generate a final judgement.
positional_bias_option_selection: The prompt used to generate a final judgement in the positional bias check.
option_selection_completion: The inference engine’s generated text using prompts.option_selection.
positional_bias_option_selection_completion: The inference engine’s generated text using prompts.positional_bias_option_selection.
criteria: A JSON-like string representing the evaluation criteria’s artifact.
Result example:
[ { "answer_relevance": 1, "answer_relevance_using_granite3.0-2b_litellm": 1, "answer_relevance_positional_bias": false, "answer_relevance_selected_option": "Could be Improved", "answer_relevance_positional_bias_selected_option": "Could be Improved", "answer_relevance_assessment": "To assess the quality of the response, l...", "answer_relevance_positional_bias_assessment": "To assess the quality of the response, l...", "answer_relevance_summary": "A response about apprenticeships during ...", "answer_relevance_positional_bias_summary": "A response about apprenticeships during ...", "answer_relevance_prompts": { "assessment": [ { "role": "user", "content": "You are presented with a response gener..." } ], "positional_bias_assessment": [ { "role": "user", "content": "You are presented with a response gener..." } ], "summarization": [ { "role": "user", "content": "Transform the following assessment into ..." } ], "option_selection": [ { "content": "You are presented with a response gener...", "role": "user" }, { "content": "To assess the quality of the response, l...", "role": "assistant" }, { "content": "Now consider the evaluation criteria and...", "role": "user" } ], "posional_bias_option_selection": [ { "content": "You are presented with a response gener...", "role": "user" }, { "content": "To assess the quality of the response, l...", "role": "assistant" }, { "content": "Now consider the evaluation criteria and...", "role": "user" } ] }, "answer_relevance_option_selection_completion": "Could be Improved", "answer_relevance_positional_bias_option_selection_completion": "Could be Improved", "answer_relevance_criteria": "{ \"__type__\": \"criteria_with_options..." } ]
- reduction_map: Dict[str, List[str]] = {'mean': ['llm_as_judge']}¶
A mapping used for score aggregation. By default, it will take the value of
{'mean': [<default_main_score_name>]}
.
- class unitxt.llm_as_judge.LLMJudgePairwise(data_classification_policy: List[str] = None, main_score: str = '1_winrate', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['score']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_engine: unitxt.inference.InferenceEngine = __required__, evaluator_name: <enum 'EvaluatorNameEnum = None, check_positional_bias: bool = True, context_fields: Union[str, List[str], Dict[str, str]] = ['context'], generate_summaries: bool = False, format: str = 'formats.chat_api', include_prompts_in_result: bool = True, criteria_field: str = None, criteria: unitxt.llm_as_judge_constants.Criteria = None)[source]¶
Bases:
LLMJudge
A judge for pairwise comparison evaluations, where two or more responses are compared to determine which one is preferred based on a criterion.
- before_process_multi_stream()[source]¶
Verifies that the criteria is of the correct type before processing the multi-stream data.
- compute(references: List[List[str]], predictions: List[str], task_data: List[Dict[str, str]]) List[Dict] [source]¶
Executes the pairwise comparison evaluation, including assessment, summarization, and option selection. It computes the winrate and ranking for the responses.
- Parameters:
references (List[List[str]]) – A list of reference responses for comparison.
predictions (List[str]) – A list of predicted responses.
task_data (List[Dict[str, str]]) – Task data to be used for evaluation.
Returns:¶
- List[Dict[str,Dict]]
The results of the evaluation, including winrate, ranking, and other metrics.
For each instance result, the following metrics are included per response/system. Each of the metrics will have appended the systems name, if predictions were provided as a list of dicts, or their index, starting from 1, if predictions were provided as a list of lists.
All the fields are arrays with length equal to len(systems) - 1. For any result at index i: response_name[i]’s contest against compared_to[i]’s result is contest_results[i].
Explanation of fields:
summaries: A list of LLM-generated summaries explaining the comparison results for each response.
contest_results: A list of boolean values indicating whether the response won in each comparison.
selections: A list of the selected system names, representing the preferred response in each comparison.
compared_to: A list of system names that were compared against the given response.
assessments: A list of LLM-generated assessments explaining the reasoning behind the evaluation results.
positional_bias_assessments: A list of LLM-generated assessments focused on detecting positional bias in the evaluation.
option_selection_outputs: A list of response names selected as the best choice based on the evaluation.
positional_bias: A list of boolean values indicating whether positional bias was detected in the contest.
positional_bias_selection: A list of response names representing the selected option when considering positional bias.
- prompts: A dictionary of prompts used in different stages of evaluation.
assessment: The prompt used to instruct the model on how to assess the responses.
positional_bias_assessment: The prompt used to instruct the model on how to assess positional bias.
option_selection: The prompt used to guide the model in selecting the best response.
positional_bias_option_selection: The prompt used for selecting the best response while checking for positional bias.
summary: The prompt used to generate a summary of the assessment.
winrate: A float representing the proportion of comparisons the response won.
llm_as_judge: Equal to winrate.
ranking: An integer representing the ranking position of the response based on the evaluation results. Best is 1.
response_name: A string identifying the response in the evaluation.
Result example:
[ { "system1_contest_results": [ true, true ], "system1_selections": [ "system1", "system1" ], "system1_compared_to": [ "system2", "system3" ], "system1_assessments": [ "To determine the better response accordi...", "To determine the better response accordi..." ], "system1_positional_bias_assessments": [ "To determine the better response accordi...", "To determine the better response accordi..." ], "system1_option_selection_outputs": [ "system1", "system1" ], "system1_positional_bias": [ false, false ], "system1_positional_bias_selection": [ "system1", "system1" ], "system1_prompts": { "assessment": [ [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ], [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ] ], "positional_bias_assessment": [ [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ], [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ] ], "option_selection": [ [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ], [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ] ], "positional_bias_option_selection": [ [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ], [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ] ] }, "system1_winrate": 1.0, "system1_llm_as_judge": 1.0, "system1_ranking": 1, "system1_response_name": "system1", "system2_contest_results": [ false, true ], "system2_selections": [ "system1", "system2" ], "system2_compared_to": [ "system1", "system3" ], "system2_assessments": [ "To determine the better response accordi...", "To determine the better response accordi..." ], "system2_positional_bias_assessments": [ "To determine the better response accordi...", "To determine the better response accordi..." ], "system2_option_selection_outputs": [ "system1", "system2" ], "system2_positional_bias": [ false, false ], "system2_positional_bias_selection": [ "system1", "system2" ], "system2_prompts": { "assessment": [ [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ], [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ] ], "positional_bias_assessment": [ [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ], [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ] ], "option_selection": [ [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ], [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ] ], "positional_bias_option_selection": [ [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ], [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ] ] }, "system2_winrate": 0.5, "system2_llm_as_judge": 0.5, "system2_ranking": 2, "system2_response_name": "system2", "system3_contest_results": [ false, false ], "system3_selections": [ "system1", "system2" ], "system3_compared_to": [ "system1", "system2" ], "system3_assessments": [ "To determine the better response accordi...", "To determine the better response accordi..." ], "system3_positional_bias_assessments": [ "To determine the better response accordi...", "To determine the better response accordi..." ], "system3_option_selection_outputs": [ "system1", "system2" ], "system3_positional_bias": [ false, false ], "system3_positional_bias_selection": [ "system1", "system2" ], "system3_prompts": { "assessment": [ [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ], [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ] ], "positional_bias_assessment": [ [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ], [ { "role": "user", "content": "You are provided a pair of responses (Re..." } ] ], "option_selection": [ [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ], [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ] ], "positional_bias_option_selection": [ [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ], [ { "content": "You are provided a pair of responses (Re...", "role": "user" }, { "content": "To determine the better response accordi...", "role": "assistant" }, { "content": "Now considering the evaluation criteria,...", "role": "user" } ] ] }, "system3_winrate": 0.0, "system3_llm_as_judge": 0.0, "system3_ranking": 3, "system3_response_name": "system3", "criteria": "{ \"__type__\": \"criteria\", \"name\"..." } ]
- prepare()[source]¶
Prepares the pairwise comparison by initializing the necessary templates and tasks. These tasks will be used to assess, summarize, and select options from candidate responses.
- reduction_map: Dict[str, List[str]] = {'mean': ['score']}¶
A mapping specifying how scores should be reduced. By default, it will be
{'main': ['score']}
.