unitxt.llm_as_judge module¶
- class unitxt.llm_as_judge.LLMAsJudge(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, main_score: str = 'llm_as_judge', prediction_type: ~typing.Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, reduction_map: ~typing.Dict[str, ~typing.List[str]] | None = None, implemented_reductions: ~typing.List[str], task: ~typing.Literal['rating.single_turn', 'rating.single_turn_with_reference', 'pairwise_comparative_rating.single_turn'], template: ~unitxt.templates.Template, format: ~unitxt.formats.Format = None, system_prompt: ~unitxt.system_prompts.SystemPrompt = None, strip_system_prompt_and_format_from_inputs: bool = True, inference_model: ~unitxt.inference.InferenceEngine, batch_size: int = 32)¶
Bases:
BulkInstanceMetric
LLM as judge based metric class for evaluating correctness.
- main_score¶
The main score label used for evaluation.
- Type:
str
- task¶
The type of task the llm-as-judge runs. This defines the output and input format of the jude model.
- Type:
Literal[“rating.single_turn”]
- system_prompt¶
The system prompt used when generating inputs for judge llm.
- Type:
- strip_system_prompt_and_format_from_inputs¶
Whether to strip the system prompt and formatting from the inputs that the models that is being judges received, when they are inserted to the llm-as-judge prompt.
- Type:
bool
- inference_model¶
the module that creates the inference of the judge llm.
- Type:
- reduction_map¶
A dictionary specifying the reduction method for the metric.
- Type:
dict
- batch_size¶
The size of the bulk.
- Type:
int
- prediction_type: Type | str = typing.Any¶