π Mistral Large Instruct Watsonx JudgeΒΆ
metrics.rag.end_to_end.answer_relevance.mistral_large_instruct_watsonx_judge
TaskBasedLLMasJudge(
inference_model="engines.classification.mistral_large_watsonx",
template="templates.rag_eval.answer_relevance.judge_answer_relevance_numeric",
task="tasks.rag_eval.answer_relevance.binary",
format=None,
main_score="answer_relevance_judge",
prediction_field="answer",
infer_log_probs=False,
judge_to_generator_fields_mapping={
"ground_truths": "reference_answers",
},
)
[source]Explanation about TaskBasedLLMasJudgeΒΆ
LLM-as-judge-based metric class for evaluating correctness of generated predictions.
This class can use any task and matching template to evaluate the predictions. All task/templates field are taken from the instanceβs task_data. The instances sent to the judge can either be: 1.a unitxt dataset, in which case the predictions are copied to a specified field of the task. 2. dictionaries with the fields required by the task and template.
- Args:
- main_score (str):
The main score label used for evaluation.
- task (str):
The type of task the llm as judge runs. This defines the output and input format of the judge model.
- template (Template):
The template used when generating inputs for the judge llm.
- format (Format):
The format used when generating inputs for judge llm.
- system_prompt (SystemPrompt):
The system prompt used when generating inputs for judge llm.
- strip_system_prompt_and_format_from_inputs (bool):
Whether to strip the system prompt and formatting from the inputs that the models that is being judges received, when they are inserted to the llm-as-judge prompt.
- inference_model (InferenceEngine):
The module that creates the inference of the judge llm.
- reduction_map (dict):
A dictionary specifying the reduction method for the metric.
- batch_size (int):
The size of the bulk.
- infer_log_probs(bool):
whether to perform the inference using logprobs. If true, the templateβs post-processing must support the logprobs output.
- judge_to_generator_fields_mapping (Dict[str, str]):
optional mapping between the names of the fields in the generator task and the judge task. For example, if the generator task uses βreference_answersβ and the judge task expect βground_truthβ, include {βground_truthβ: βreference_answersβ} in this dictionary.
- prediction_field (str):
if indicated, and prediction exist, copy prediction to this field name in task_data.
- include_meta_data (bool):
whether to include the inference per-instance metadata in the returned results.
References: templates.rag_eval.answer_relevance.judge_answer_relevance_numeric, engines.classification.mistral_large_watsonx, tasks.rag_eval.answer_relevance.binary
Read more about catalog usage here.