πŸ“„ Llama 4 Maverick Watsonx JudgeΒΆ

metrics.rag.external_rag.faithfulness.llama_4_maverick_watsonx_judge

TaskBasedLLMasJudge(
    inference_model="engines.classification.llama_4_maverick_17b_128e_instruct_fp8_watsonx",
    template="templates.rag_eval.faithfulness.judge_with_question_simplified_verbal",
    task="tasks.rag_eval.faithfulness.binary",
    format=None,
    main_score="faithfulness_judge",
    prediction_field="answer",
    infer_log_probs=False,
    judge_to_generator_fields_mapping={},
)
[source]

Explanation about TaskBasedLLMasJudgeΒΆ

LLM-as-judge-based metric class for evaluating correctness of generated predictions.

This class can use any task and matching template to evaluate the predictions. All task/templates field are taken from the instance’s task_data. The instances sent to the judge can either be: 1.a unitxt dataset, in which case the predictions are copied to a specified field of the task. 2. dictionaries with the fields required by the task and template.

Args:
main_score (str):

The main score label used for evaluation.

task (str):

The type of task the llm as judge runs. This defines the output and input format of the judge model.

template (Template):

The template used when generating inputs for the judge llm.

format (Format):

The format used when generating inputs for judge llm.

system_prompt (SystemPrompt):

The system prompt used when generating inputs for judge llm.

strip_system_prompt_and_format_from_inputs (bool):

Whether to strip the system prompt and formatting from the inputs that the models that is being judges received, when they are inserted to the llm-as-judge prompt.

inference_model (InferenceEngine):

The module that creates the inference of the judge llm.

reduction_map (dict):

A dictionary specifying the reduction method for the metric.

batch_size (int):

The size of the bulk.

infer_log_probs(bool):

whether to perform the inference using logprobs. If true, the template’s post-processing must support the logprobs output.

judge_to_generator_fields_mapping (Dict[str, str]):

optional mapping between the names of the fields in the generator task and the judge task. For example, if the generator task uses β€œreference_answers” and the judge task expect β€œground_truth”, include {β€œground_truth”: β€œreference_answers”} in this dictionary.

prediction_field (str):

if indicated, and prediction exist, copy prediction to this field name in task_data.

include_meta_data (bool):

whether to include the inference per-instance metadata in the returned results.

References: engines.classification.llama_4_maverick_17b_128e_instruct_fp8_watsonx, templates.rag_eval.faithfulness.judge_with_question_simplified_verbal, tasks.rag_eval.faithfulness.binary

Read more about catalog usage here.