unitxt.llm_as_judge module

class unitxt.llm_as_judge.LLMAsJudge(__tags__: ~typing.Dict[str, str] = {}, main_score: str = 'llm_as_judge', prediction_type: str = None, single_reference_per_prediction: bool = False, n_resamples: int = 1000, confidence_level: float = 0.95, ci_scores: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, reduction_map: ~typing.Dict[str, ~typing.List[str]] = None, implemented_reductions: ~typing.List[str], batch_size: int = 32, recipe: str, inference_model: ~unitxt.inference.InferenceEngine)

Bases: BulkInstanceMetric

LLM as judge based metric class for evaluating correctness.

main_score

The main score used for evaluation.

Type:

str

reduction_map

A dictionary specifying the reduction method for the metric.

Type:

dict

betch_size

The size of the bulk.

Type:

int

recipe

The unitxt recipe that will be used to create the judge dataset.

Type:

str

inference

the module that creates the inference.

Type:

InferenceEngine

prepare(self)

Initialization method for the metric.

compute(self, references, predictions, additional_inputs)

Method to compute the metric.

Usage:

metric = LlamaIndexCorrectnessMetric() scores = metric.compute(references, prediction, additional_inputs)