unitxt.llm_as_judge module¶

class unitxt.llm_as_judge.LLMAsJudge(__tags__: ~typing.Dict[str, str] = {}, main_score: str = 'llm_as_judge', prediction_type: str = None, single_reference_per_prediction: bool = False, n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, reduction_map: ~typing.Dict[str, ~typing.List[str]] = None, implemented_reductions: ~typing.List[str], batch_size: int = 32, recipe: str, inference_model: ~unitxt.inference.InferenceEngine)¶

Bases: BulkInstanceMetric

LLM as judge based metric class for evaluating correctness.

main_score¶

The main score used for evaluation.

Type:: str

reduction_map¶

A dictionary specifying the reduction method for the metric.

Type:: dict

betch_size¶

The size of the bulk.

Type:: int

recipe¶

The unitxt recipe that will be used to create the judge dataset.

Type:: str

inference¶

the module that creates the inference.

Type:: InferenceEngine

prepare(self)¶: Initialization method for the metric.

compute(self, references, predictions, additional_inputs)¶: Method to compute the metric.

Usage:: metric = LlamaIndexCorrectnessMetric() scores = metric.compute(references, prediction, additional_inputs)