π Llama 3 1 70B Instruct Wml Q AΒΆ
metrics.rag.answer_relevance.llama_3_1_70b_instruct_wml_q_a
type: TaskBasedLLMasJudge
inference_model: engines.classification.llama_3_1_70b_instruct_wml
template: templates.rag_eval.answer_relevance.judge_answer_relevance
task: tasks.rag_eval.answer_relevance.binary
format: formats.empty
main_score: answer_relevance_q_a
prediction_field: answer
infer_log_probs: False
[source]Explanation about TaskBasedLLMasJudgeΒΆ
LLM-as-judge-based metric class for evaluating correctness of generated predictions.
This class can use any task and matching template to evaluate the predictions. All task/templates field are taken from the instanceβs task_data. The instances sent to the judge can either be: 1.a unitxt dataset, in which case the predictions are copied to a specified field of the task. 2. dictionaries with the fields required by the task and template.
- Attributes:
main_score (str): The main score label used for evaluation. task (str): The type of task the llm as judge runs. This defines the output and input format of the judge model. template (Template): The template used when generating inputs for the judge llm. format (Format): The format used when generating inputs for judge llm. system_prompt (SystemPrompt): The system prompt used when generating inputs for judge llm. strip_system_prompt_and_format_from_inputs (bool): Whether to strip the system prompt and formatting from the
inputs that the models that is being judges received, when they are inserted to the llm-as-judge prompt.
inference_model (InferenceEngine): The module that creates the inference of the judge llm. reduction_map (dict): A dictionary specifying the reduction method for the metric. batch_size (int): The size of the bulk. infer_log_probs(bool): whether to perform the inference using logprobs. If true, the templateβs post-processing must support the logprobs output. judge_to_generator_fields_mapping (Dict[str, str]): optional mapping between the names of the fields in the generator task and the judge task. For example, if the generator task uses βreference_answersβ and the judge task expect βground_truthβ, include {βground_truthβ: βreference_answersβ} in this dictionary. prediction_field: if indicated, and prediction exist, copy prediction to this field name in task_data. include_meta_data (bool): whether to include the inference per-instance metadata in the returned results.
References: templates.rag_eval.answer_relevance.judge_answer_relevance, engines.classification.llama_3_1_70b_instruct_wml, tasks.rag_eval.answer_relevance.binary, formats.empty
Read more about catalog usage here.