unitxt.metrics module¶

class unitxt.metrics.ANLS(data_classification_policy: List[str] = None, main_score: str = 'anls', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['anls']}, reference_field: str = 'references', prediction_field: str = 'prediction', threshold: float = 0.5)[source]¶

Bases: InstanceMetric

Average Normalized Levenshtein Similarity for text comparison.

Range: [0, 1] (higher is better) Measures semantic similarity between texts using edit distance normalization.

Reference: https://arxiv.org/abs/1704.00560 (ICDAR 2019 Robust Reading Challenge)

compute(references: List[Any], prediction: Any, task_data: List[Dict]) → dict[source]¶: ANLS image-text accuracy metric.

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['anls']}¶

class unitxt.metrics.Accuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Measures exact match accuracy between prediction and references.

Range: [0, 1] (higher is better) Returns 1.0 if prediction matches any reference, 0.0 otherwise.

Reference: https://en.wikipedia.org/wiki/Accuracy_and_precision

ci_scores: List[str] = ['accuracy']¶

prediction_type: Type | str = typing.Any¶

reduction_map: Dict[str, List[str]] = {'mean': ['accuracy']}¶

class unitxt.metrics.AccuracyFast(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'accuracy', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.AggregationReduction(data_classification_policy: List[str] = None)[source]¶: Bases: Artifact, Generic[IntermediateType]

class unitxt.metrics.BertScore(data_classification_policy: List[str] = None, device: str | NoneType = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'f1', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['bert_score'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.DictReduction = None, model_name: str = __required__, batch_size: int = 32, model_layer: int = None)[source]¶

Bases: MapReduceMetric[str, Dict[str, float]], TorchDeviceMixin

Computes BERTScore using contextual embeddings for text evaluation.

Range: [0, 1] (higher is better) Measures semantic similarity using BERT-based token embeddings.

Reference: https://arxiv.org/abs/1904.09675

reduction: DictReduction = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.BinaryAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy_binary'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['accuracy_binary']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Computes accuracy for binary classification tasks.

Range: [0, 1] (higher is better) Uses 0.5 threshold for float predictions.

ci_scores: List[str] = ['accuracy_binary']¶

prediction_type¶: alias of Union[float, int]

reduction_map: Dict[str, List[str]] = {'mean': ['accuracy_binary']}¶

class unitxt.metrics.BinaryMaxAccuracy(data_classification_policy: List[str] = None, main_score: str = 'max_accuracy_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

Finds optimal accuracy and threshold for binary classification.

Range: [0, 1] (higher is better) Tests all possible thresholds to maximize accuracy.

prediction_type¶: alias of Union[float, int]

class unitxt.metrics.BinaryMaxF1(data_classification_policy: List[str] = None, main_score: str = 'max_f1_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['max_f1_binary', 'max_f1_binary_neg'], ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: F1Binary

Finds optimal F1 score and threshold for binary classification.

Range: [0, 1] (higher is better) Tests all possible thresholds to maximize F1 score.

ci_scores: List[str] = ['max_f1_binary', 'max_f1_binary_neg']¶

class unitxt.metrics.BulkInstanceMetric(data_classification_policy: List[str] = None, main_score: str = __required__, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = __required__, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'])[source]¶: Bases: StreamOperator, MetricWithConfidenceInterval

class unitxt.metrics.CharEditDistance(data_classification_policy: List[str] = None, main_score: str = 'char_edit_distance', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['char_edit_distance'], ci_method: str = 'BCa', _requirements_list: List[str] = ['editdistance'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_distance']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Computes character-level edit distance between texts.

Range: [0, ∞) (lower is better) Measures minimum character edits needed to transform prediction into reference.

Reference: https://en.wikipedia.org/wiki/Edit_distance

ci_scores: List[str] = ['char_edit_distance']¶

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_distance']}¶

class unitxt.metrics.CharEditDistanceAccuracy(data_classification_policy: List[str] = None, main_score: str = 'char_edit_dist_accuracy', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['char_edit_dist_accuracy'], ci_method: str = 'BCa', _requirements_list: List[str] = ['editdistance'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_dist_accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: CharEditDistance

ci_scores: List[str] = ['char_edit_dist_accuracy']¶

reduction_map: Dict[str, List[str]] = {'mean': ['char_edit_dist_accuracy']}¶

class unitxt.metrics.ConfidenceIntervalMixin(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True)[source]¶: Bases: Artifact

class unitxt.metrics.CorrelationMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = ['scipy'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: MapReduceMetric[float, Tuple[float, float]]

Computes Spearman rank correlation coefficient.

Range: [-1, 1] (higher absolute value is better) Measures monotonic relationship between predictions and references.

Reference: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

prediction_type¶: alias of float

class unitxt.metrics.CustomF1(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶

Bases: GlobalMetric

prediction_type: Type | str = typing.Any¶

class unitxt.metrics.CustomF1Fuzzy(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True, min_score_for_match: float = __required__)[source]¶: Bases: CustomF1

class unitxt.metrics.Detector(data_classification_policy: List[str] = None, main_score: str = 'detector_score', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['detector_score']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], batch_size: int = 32, model_name: str = __required__)[source]¶

Bases: BulkInstanceMetric

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['detector_score']}¶

class unitxt.metrics.DictReduction(data_classification_policy: List[str] = None)[source]¶: Bases: AggregationReduction[Dict[str, float]]

class unitxt.metrics.EvaluationInput(prediction: PredictionType, references: List[PredictionType], task_data: Dict[str, Any])[source]¶: Bases: tuple, Generic[PredictionType]

class unitxt.metrics.ExactMatchMM(data_classification_policy: List[str] = None, main_score: str = 'exact_match_mm', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['exact_match_mm']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Multi-modal exact match metric with flexible matching patterns.

Range: [0, 1] (higher is better) Handles various answer formats like single characters, options, and “the answer is X”.

static exact_match(pred, gt)¶: Brought from MMStar.

prediction_type: Type | str = typing.Any¶

reduction_map: Dict[str, List[str]] = {'mean': ['exact_match_mm']}¶

class unitxt.metrics.F1(data_classification_policy: List[str] = None, main_score: str = 'f1_macro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

Computes macro-averaged F1 score across all classes.

Range: [0, 1] (higher is better) Balances precision and recall, giving equal weight to all classes.

Reference: https://en.wikipedia.org/wiki/F-score

prediction_type¶: alias of str

class unitxt.metrics.F1Binary(data_classification_policy: List[str] = None, main_score: str = 'f1_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

Computes F1 score for binary classification tasks.

Range: [0, 1] (higher is better) Uses 0.5 threshold for float predictions, balances precision and recall.

Reference: https://en.wikipedia.org/wiki/F-score

ci_scores: List[str] = ['f1_binary', 'f1_binary_neg']¶

prediction_type¶: alias of Union[float, int]

class unitxt.metrics.F1BinaryPosOnly(data_classification_policy: List[str] = None, main_score: str = 'f1_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1Binary

class unitxt.metrics.F1Fast(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'f1', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = ['scikit-learn', 'regex'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', averages: List[Literal['f1', 'macro', 'micro', 'per_class']] = ['f1', 'micro', 'macro', 'per_class'], ignore_punc: bool = True, ignore_case: bool = True)[source]¶

Bases: MapReduceMetric[str, Tuple[int, int]]

Computes F1 score across all classes.

Range: [0, 1] (higher is better) Balances precision and recall, giving equal weight to all classes.

Reference: https://en.wikipedia.org/wiki/F-score

averages: List[Literal['f1', 'macro', 'micro', 'per_class']] = ['f1', 'micro', 'macro', 'per_class']¶

class unitxt.metrics.F1Macro(data_classification_policy: List[str] = None, main_score: str = 'f1_macro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1

class unitxt.metrics.F1MacroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'f1_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1MultiLabel

class unitxt.metrics.F1Micro(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: F1

Computes micro-averaged F1 score across all classes.

Range: [0, 1] (higher is better) Aggregates predictions and references globally before computing F1.

Reference: https://en.wikipedia.org/wiki/F-score

class unitxt.metrics.F1MicroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1MultiLabel

class unitxt.metrics.F1MultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'f1_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric, PackageRequirementsMixin

prediction_type¶: alias of List[str]

class unitxt.metrics.F1Strings(data_classification_policy: List[str] = None, main_score: str = 'f1_strings', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = {'spacy': 'Please pip install spacy'}, requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['f1_strings']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['f1_strings']}¶

class unitxt.metrics.F1Weighted(data_classification_policy: List[str] = None, main_score: str = 'f1_weighted', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1

class unitxt.metrics.FaithfulnessHHEM(data_classification_policy: List[str] = None, main_score: str = 'hhem_score', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['hhem_score']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], batch_size: int = 2, model_name: str = 'vectara/hallucination_evaluation_model')[source]¶

Bases: BulkInstanceMetric

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['hhem_score']}¶

class unitxt.metrics.FinQAEval(data_classification_policy: List[str] = None, main_score: str = 'program_accuracy', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['program_accuracy', 'execution_accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['program_accuracy', 'execution_accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

ci_scores: List[str] = ['program_accuracy', 'execution_accuracy']¶

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['program_accuracy', 'execution_accuracy']}¶

class unitxt.metrics.FixedGroupAbsvalNormCohensHParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseAccuracy.<lambda> at 0x7f80a13dbca0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupAbsvalNormCohensHParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseStringContainment.<lambda> at 0x7f80a13dbe50>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_cohens_h_paraphrase', <function FixedGroupAbsvalNormCohensHParaphraseStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupAbsvalNormHedgesGParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseAccuracy.<lambda> at 0x7f80a13ed040>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupAbsvalNormHedgesGParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseStringContainment.<lambda> at 0x7f80a13ed1f0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['absval_norm_hedges_g_paraphrase', <function FixedGroupAbsvalNormHedgesGParaphraseStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupMeanAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f80a14a5310>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, True]}}¶

class unitxt.metrics.FixedGroupMeanBaselineAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineAccuracy.<lambda> at 0x7f80a144da60>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupMeanBaselineStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineStringContainment.<lambda> at 0x7f80a144ddc0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_baseline', <function FixedGroupMeanBaselineStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupMeanParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseAccuracy.<lambda> at 0x7f80a144dc10>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupMeanParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseStringContainment.<lambda> at 0x7f80a144df70>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean_paraphrase', <function FixedGroupMeanParaphraseStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupMeanStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f80a14a5310>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, True]}}¶

class unitxt.metrics.FixedGroupNormCohensHParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseAccuracy.<lambda> at 0x7f80a13db5e0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupNormCohensHParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseStringContainment.<lambda> at 0x7f80a13db790>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_cohens_h_paraphrase', <function FixedGroupNormCohensHParaphraseStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupNormHedgesGParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseAccuracy.<lambda> at 0x7f80a13db940>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupNormHedgesGParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseStringContainment.<lambda> at 0x7f80a13dbaf0>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['norm_hedges_g_paraphrase', <function FixedGroupNormHedgesGParaphraseStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupPDRParaphraseAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseAccuracy.<lambda> at 0x7f80a13db160>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseAccuracy.<lambda>>, True]}}¶

class unitxt.metrics.FixedGroupPDRParaphraseStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseStringContainment.<lambda> at 0x7f80a13db310>, True]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['pdr_paraphrase', <function FixedGroupPDRParaphraseStringContainment.<lambda>>, True]}}¶

class unitxt.metrics.FuzzyNer(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[typing.Tuple[str, str]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True, min_score_for_match: float = 0.750001)[source]¶

Bases: CustomF1Fuzzy

prediction_type¶: alias of List[Tuple[str, str]]

class unitxt.metrics.GlobalMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: StreamOperator, MetricWithConfidenceInterval

A class for computing metrics that require joint calculations over all instances and are not just aggregation of scores of individuals instances.

For example, macro_F1 requires calculation requires calculation of recall and precision per class, so all instances of the class need to be considered. Accuracy, on the other hand, is just an average of the accuracy of all the instances.

abstract compute(references: List[List[Any]], predictions: List[Any], task_data: List[Any]) → dict[source]¶

Computes a scores dictionary on a list of references, predictions and input.

This function is called once per instance, and then another time over all data instances.

Returns:: the instance scores when called on a single data instance the global score when called on the all data instances
Return type:: a dictionary of scores that is set as

class unitxt.metrics.GraniteGuardianAgenticRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.AGENTIC: 'agentic_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶: Bases: GraniteGuardianBase

class unitxt.metrics.GraniteGuardianAssistantRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶: Bases: GraniteGuardianBase

class unitxt.metrics.GraniteGuardianBase(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = None, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶

Bases: InstanceMetric

Return metric for different kinds of “risk” from the Granite-3.0 Guardian model.

available_risks: Dict[RiskType, List[str]] = {RiskType.AGENTIC: ['function_call'], RiskType.ASSISTANT_MESSAGE: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], RiskType.RAG: ['context_relevance', 'groundedness', 'answer_relevance'], RiskType.USER_MESSAGE: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior']}¶

reduction_map: Dict[str, List[str]] = {}¶

wml_params = {'decoding_method': 'greedy', 'max_new_tokens': 20, 'return_options': {'input_text': True, 'input_tokens': False, 'top_n_tokens': 5}, 'temperature': 0}¶

class unitxt.metrics.GraniteGuardianCustomRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.CUSTOM_RISK: 'custom_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶: Bases: GraniteGuardianBase

class unitxt.metrics.GraniteGuardianRagRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.RAG: 'rag_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶: Bases: GraniteGuardianBase

class unitxt.metrics.GraniteGuardianUserRisk(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['torch', 'transformers'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {}, reference_field: str = 'references', prediction_field: str = 'prediction', wml_model_name: str = 'ibm/granite-guardian-3-8b', hf_model_name: str = 'ibm-granite/granite-guardian-3.1-8b', inference_engine: unitxt.inference.LogProbInferenceEngine = None, generation_params: Dict = None, risk_name: str = None, risk_type: <enum 'RiskType = <RiskType.USER_MESSAGE: 'user_risk'>, risk_definition: Union[str, NoneType] = None, user_message_field: str = 'user', assistant_message_field: str = 'assistant', context_field: str = 'context', tools_field: str = 'tools', available_risks: Dict[unitxt.metrics.RiskType, List[str]] = {<RiskType.USER_MESSAGE: 'user_risk'>: ['harm', 'social_bias', 'jailbreak', 'violence', 'profanity', 'unethical_behavior'], <RiskType.ASSISTANT_MESSAGE: 'assistant_risk'>: ['harm', 'social_bias', 'violence', 'profanity', 'unethical_behavior'], <RiskType.RAG: 'rag_risk'>: ['context_relevance', 'groundedness', 'answer_relevance'], <RiskType.AGENTIC: 'agentic_risk'>: ['function_call']})[source]¶: Bases: GraniteGuardianBase

class unitxt.metrics.GroupMean(data_classification_policy: List[str] = None)[source]¶: Bases: GroupReduction

class unitxt.metrics.GroupMeanAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f80a14a5310>, False]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, False]}}¶

class unitxt.metrics.GroupMeanStringContainment(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f80a14a5310>, False]}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StringContainmentOld

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, False]}}¶

class unitxt.metrics.GroupMeanTokenOverlap(data_classification_policy: List[str] = None, main_score: str = 'f1', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['f1', 'precision', 'recall'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean at 0x7f80a14a5310>, False], 'score_fields': ['f1', 'precision', 'recall']}}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: TokenOverlap

reduction_map: Dict[str, List[str]] = {'group_mean': {'agg_func': ['mean', <function nan_mean>, False], 'score_fields': ['f1', 'precision', 'recall']}}¶

class unitxt.metrics.GroupMetric(data_classification_policy: List[str] = None, n_resamples: int = None, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', metric: unitxt.metrics.MapReduceMetric[PredictionType, IntermediateType] = __required__, group_id_field: str = __required__, item_id_field: str = __required__, in_group_reduction: unitxt.metrics.GroupReduction = None, cross_group_reduction: unitxt.metrics.GroupReduction = None)[source]¶

Bases: MapReduceMetric[PredictionType, IntermediateType], Generic[PredictionType, IntermediateType]

cross_group_reduction: GroupReduction = GroupMean(__type__='group_mean', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

in_group_reduction: GroupReduction = GroupMean(__type__='group_mean', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.GroupReduction(data_classification_policy: List[str] = None)[source]¶: Bases: AggregationReduction[Tuple[str, Dict[str, float]]]

class unitxt.metrics.HuggingfaceBulkMetric(data_classification_policy: List[str] = None, main_score: str = __required__, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = __required__, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], hf_metric_name: str = __required__, hf_metric_fields: List[str] = __required__, hf_compute_args: dict = {}, hf_additional_input_fields: List = [])[source]¶

Bases: BulkInstanceMetric

hf_compute_args: dict = {}¶

class unitxt.metrics.HuggingfaceInstanceMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = <class 'unitxt.dataclass.Undefined'>, reference_field: str = 'references', prediction_field: str = 'prediction', hf_metric_name: str = __required__, hf_metric_fields: List[str] = __required__, hf_compute_args: dict = {})[source]¶

Bases: InstanceMetric

hf_compute_args: dict = {}¶

class unitxt.metrics.HuggingfaceMetric(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = None, hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶: Bases: GlobalMetric

class unitxt.metrics.InstanceMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = <class 'unitxt.dataclass.Undefined'>, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: StreamOperator, MetricWithConfidenceInterval

Class for metrics for which a global score can be calculated by aggregating the instance scores (possibly with additional instance inputs).

InstanceMetric currently allows two reductions:

‘mean’, which calculates the mean of instance scores,
‘group_mean’, which first applies an aggregation function specified in the reduction_map to instance scores grouped by the field grouping_field (which must not be None), and returns the mean of the group scores; if grouping_field is None, grouping is disabled. See _validate_group_mean_reduction for formatting instructions.

get_group_scores(instances: List[dict], score_names: List[str], group_aggregation_func, prepend_score_prefix: bool)[source]¶

Group scores by the group_id and subgroup_type fields of each instance, and compute group_aggregation_func by group.

Parameters:

instances (list) – List of observation instances with instance-level scores (fields) computed.
score_names (list) – List of instance score names in each instance to apply the aggregation function.
group_aggregation_func (Callable) – aggregation function accepting a list of numeric scores; or, if self.subgroup_column is not None, a dict of subgroup types scores by subgroup_column value. callable function returns a single score for the group
prepend_score_prefix (bool) – if True - prepend the score_prefix to the score names in the returned dicts. Set to False if down the stream such a prepending is expected.

Returns:

List of dicts, each corresponding to a group of instances (defined by ‘group_id’),: with an aggregate group score for each score_name

class unitxt.metrics.IsCodeMixed(data_classification_policy: List[str] = None, main_score: str = 'is_code_mixed', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['is_code_mixed']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], inference_model: unitxt.inference.InferenceEngine = None)[source]¶

Bases: BulkInstanceMetric

Uses a generative model to assess whether a given text is code-mixed.

Our goal is to identify whether a text is code-mixed, i.e., contains a mixture of different languages. The model is asked to identify the language of the text; if the model response begins with a number we take this as an indication that the text is code-mixed, for example: - Model response: “The text is written in 2 different languages” vs. - Model response: “The text is written in German”

Note that this metric is quite tailored to specific model-template combinations, as it relies on the assumption that the model will complete the answer prefix “The text is written in ___” in a particular way.

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['is_code_mixed']}¶

class unitxt.metrics.JaccardIndex(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'jaccard_index', prediction_type: Any | str = typing.Union[list, set], single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

Computes Jaccard similarity coefficient between prediction and reference sets.

Range: [0, 1] (higher is better) Measures overlap as intersection over union of two sets.

Reference: https://en.wikipedia.org/wiki/Jaccard_index

prediction_type¶: alias of Union[list, set]

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.JaccardIndexString(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'jaccard_index', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[~IntermediateType] = None, splitter: unitxt.operators.FieldOperator = __required__)[source]¶

Bases: JaccardIndex

Calculates JaccardIndex on strings.

Requires setting the ‘splitter’ to a FieldOperator (such as Split or RegexSplit) to tokenize the predictions and references into lists of strings tokens.

These tokens are passed to the JaccardIndex as lists.

prediction_type¶: alias of str

class unitxt.metrics.KPA(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶

Bases: CustomF1

prediction_type¶: alias of str

class unitxt.metrics.KendallTauMetric(data_classification_policy: List[str] = None, main_score: str = 'kendalltau_b', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scipy'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

Computes Kendall’s tau rank correlation coefficient.

Range: [-1, 1] (higher absolute value is better) Measures strength of ordinal association between predictions and references.

Reference: https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient

prediction_type¶: alias of float

class unitxt.metrics.KeyValueExtraction(data_classification_policy: List[str] = None, main_score: str = '', prediction_type: Any | str = typing.Dict[str, str], single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, metric: unitxt.base_metric.Metric = __required__)[source]¶

Bases: GlobalMetric

prediction_type¶: alias of Dict[str, str]

class unitxt.metrics.LlamaIndexCorrectness(data_classification_policy: List[str] = ['public'], main_score: str = '', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = 'correctness_', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['llama-index-core', 'llama-index-llms-openai'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = '', openai_models: List[str] = ['gpt-3.5-turbo'], anthropic_models: List[str] = [], mock_models: List[str] = ['mock'])[source]¶

Bases: LlamaIndexLLMMetric

LlamaIndex based metric class for evaluating correctness.

compute(references: List[str], prediction: str, task_data: Dict) → Dict[str, Any][source]¶

Method to compute the correctness metric.

Parameters:

references (List[str]) – List of reference instances.
prediction (str) – List of predicted instances.
task_data (Dict) – List of additional input data.

Returns:

List of computed scores and feedback.

Return type:

Dict[str, Any]

Raises:

AssertionError – If the input does not meet the expected format.

prepare()[source]¶: Initialization method for the metric. Initializes the CorrectnessEvaluator with the OpenAI model.

class unitxt.metrics.LlamaIndexFaithfulness(data_classification_policy: List[str] = ['public'], main_score: str = '', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = 'faithfulness_', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['llama-index-core', 'llama-index-llms-openai'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = '', openai_models: List[str] = ['gpt-3.5-turbo'], anthropic_models: List[str] = [], mock_models: List[str] = ['mock'])[source]¶

Bases: LlamaIndexLLMMetric

LlamaIndex based metric class for evaluating faithfulness.

prepare()[source]¶: Initialization method for the metric. Initializes the FaithfulnessEvaluator with the OpenAI model.

class unitxt.metrics.LlamaIndexLLMMetric(data_classification_policy: List[str] = ['public'], main_score: str = '', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['llama-index-core', 'llama-index-llms-openai'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = '', openai_models: List[str] = ['gpt-3.5-turbo'], anthropic_models: List[str] = [], mock_models: List[str] = ['mock'])[source]¶

Bases: InstanceMetric

anthropic_models: List[str] = []¶

data_classification_policy: List[str] = ['public']¶

external_api_models = ['gpt-3.5-turbo']¶

mock_models: List[str] = ['mock']¶

openai_models: List[str] = ['gpt-3.5-turbo']¶

prediction_type¶: alias of str

class unitxt.metrics.MAP(data_classification_policy: List[str] = None, main_score: str = 'map', prediction_type: Any | str = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['map'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['map']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: RetrievalMetric

Mean Average Precision for information retrieval evaluation.

Range: [0, 1] (higher is better) Averages precision values at ranks where relevant documents are retrieved.

Reference: https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision

ci_scores: List[str] = ['map']¶

reduction_map: Dict[str, List[str]] = {'mean': ['map']}¶

class unitxt.metrics.MRR(data_classification_policy: List[str] = None, main_score: str = 'mrr', prediction_type: Any | str = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['mrr'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['mrr']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: RetrievalMetric

Mean Reciprocal Rank for information retrieval evaluation.

Range: [0, 1] (higher is better) Measures the average of reciprocal ranks of first relevant items.

Reference: https://en.wikipedia.org/wiki/Mean_reciprocal_rank

ci_scores: List[str] = ['mrr']¶

reduction_map: Dict[str, List[str]] = {'mean': ['mrr']}¶

class unitxt.metrics.MapReduceMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶: Bases: StreamOperator, Metric, ConfidenceIntervalMixin, Generic[PredictionType, IntermediateType]

class unitxt.metrics.MatthewsCorrelation(data_classification_policy: List[str] = None, main_score: str = 'matthews_correlation', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'matthews_correlation', hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [], str_to_id: dict = {})[source]¶

Bases: HuggingfaceMetric

Computes Matthews correlation coefficient for classification.

Range: [-1, 1] (higher is better) Balanced metric for binary classification, handles class imbalance well.

Reference: https://en.wikipedia.org/wiki/Phi_coefficient

prediction_type¶: alias of str

class unitxt.metrics.MaxAccuracy(data_classification_policy: List[str] = None, main_score: str = 'accuracy', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['accuracy'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'max': ['accuracy']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: Accuracy

Calculate the maximal accuracy over all instances as the global score.

reduction_map: Dict[str, List[str]] = {'max': ['accuracy']}¶

class unitxt.metrics.MaxReduction(data_classification_policy: List[str] = None)[source]¶: Bases: DictReduction

class unitxt.metrics.MeanReduction(data_classification_policy: List[str] = None)[source]¶: Bases: DictReduction

class unitxt.metrics.MeanSquaredError(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'mean_squared_error', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = True, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: MapReduceMetric[float, float]

Computes mean squared error between predictions and references.

Range: [0, ∞) (lower is better) Measures average squared differences between predicted and true values.

prediction_type¶: alias of float

class unitxt.metrics.Meteor(data_classification_policy: List[str] = None, main_score: str = 'meteor', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['meteor'], ci_method: str = 'BCa', _requirements_list: List[str] = ['nltk>=3.6.6'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['meteor']}, reference_field: str = 'references', prediction_field: str = 'prediction', alpha: float = 0.9, beta: int = 3, gamma: float = 0.5)[source]¶

Bases: InstanceMetric

ci_scores: List[str] = ['meteor']¶

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['meteor']}¶

class unitxt.metrics.MeteorFast(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'meteor', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['nltk>=3.6.6'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None, alpha: float = 0.9, beta: int = 3, gamma: float = 0.5)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.MetricBasedNer(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[typing.Tuple[str, str]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True, min_score_for_match: float = 0.75, metric: unitxt.base_metric.Metric = __required__)[source]¶

Bases: CustomF1Fuzzy

Calculates f1 metrics for NER , by comparing entity using a provided Unitxt metric.

While the Ner metric uses exact match to compare entities and FuzzyNer uses fuzzy matching, this customiziable metric can use any Unitxt metric to compare entities, including LLM as Judge. The metric must acceptstring prediction and references as input. The similarity threshold is set by the ‘min_score_for_match’ attribute.

Example: MetricBasedNer(metric=Rouge(), min_score_for_match=0.9)

MetricBasedNer(metric=”metrics.llm_as_judge.direct.watsonx.llama3_3_70b[criteria=metrics.llm_as_judge.direct.criteria.correctness_based_on_ground_truth,context_fields=ground_truth]”)

prediction_type¶: alias of List[Tuple[str, str]]

class unitxt.metrics.MetricPipeline(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, preprocess_steps: List[unitxt.operator.StreamingOperator] | NoneType = [], postprocess_steps: List[unitxt.operator.StreamingOperator] | NoneType = [], postpreprocess_steps: List[unitxt.operator.StreamingOperator] | NoneType = None, metric: unitxt.base_metric.Metric = None)[source]¶: Bases: MultiStreamOperator, Metric

class unitxt.metrics.MetricWithConfidenceInterval(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = None, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa')[source]¶

Bases: Metric

static average_item_scores(instances: List[dict], score_name: str)[source]¶

Calculate mean of a set of instance scores (given by score_name), omitting NaN values.

Parameters:

instances – list of dicts of each instance’s instance scores.
score_name – score field names to compute the mean for.

compute_global_confidence_intervals(references, predictions, task_data, score_name)[source]¶: Computed confidence intervals for a set of references and predictions.

static max_item_scores(instances: List[dict], score_name: str)[source]¶

Calculate max of a set of instance scores (given by score_name), omitting NaN values.

Parameters:

instances – list of dicts of each instance’s instance scores.
score_name – score field names to compute the mean for.

resample_from_non_nan(values)[source]¶

Given an array values, will replace any NaN values with elements resampled with replacement from the non-NaN ones.

here we deal with samples on which the metric could not be computed. These are edge cases - for example, when the sample contains only empty strings. CI is about the distribution around the statistic (e.g. mean), it doesn’t deal with cases in which the metric is not computable. Therefore, we ignore these edge cases as part of the computation of CI.

In theory there would be several ways to deal with this: 1. skip the errors and return a shorter array => this fails because Scipy requires this callback (i.e. the statistic() callback) to return an array of the same size as the number of resamples 2. Put np.nan for the errors => this fails because in such case the ci itself becomes np.nan. So one edge case can fail the whole CI computation. 3. Replace the errors with a sampling from the successful cases => this is what is implemented.

This resampling makes it so that, if possible, the bca confidence interval returned by bootstrap will not be NaN, since bootstrap does not ignore NaNs. However, if there are 0 or 1 non-NaN values, or all non-NaN values are equal, the resulting distribution will be degenerate (only one unique value) so the CI will still be NaN since there is no variability. In this case, the CI is essentially an interval of length 0 equaling the mean itself.

score_based_confidence_interval(instances: List[dict], score_names: List[str], aggregation_func=None, ci_score_prefix='')[source]¶

Compute confidence intervals based on existing scores, already computed on the input instances.

Unlike GlobalMetric, this is simply a function of the instance scores (possibly taking into account task_data field),: so they don’t need to be recomputed after every bootstrap draw.

Parameters:

instances – The instances for which the confidence intervals are computed; should already have the relevant instance scores calculated.
score_names – List of instance score field names to compute a confidence interval for.
aggregation_func – A function with arguments instances, field_name; is applied on list of instances (which may include task_data field, as well as the prediction and references), and the field_name; default is simply to take the mean field_name from instances after resampling, if argument is None.
ci_score_prefix – An optional string prefix to the score_name in the CI. Useful in cases where the aggregation_func is something other than the mean

Returns:

Dict of confidence interval values

class unitxt.metrics.MetricsEnsemble(data_classification_policy: List[str] = None, main_score: str = 'ensemble_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['ensemble_score']}, reference_field: str = 'references', prediction_field: str = 'prediction', metrics: List[unitxt.base_metric.Metric | str] = __required__, weights: List[float] = None)[source]¶

Bases: InstanceMetric, ArtifactFetcherMixin

Metrics Ensemble class for creating ensemble of given metrics.

Parameters:

main_score (str) – The main score label used for evaluation.
metrics (List[Union[Metric, str]]) – List of metrics that will be ensemble.
weights (List[float]) – Weight of each the metrics
reduction_map (Dict[str, List[str]]) – Specifies the redaction method of the global score. InstanceMetric currently allows two reductions (see it definition at InstanceMetric class). This class define its default value to reduce by the mean of the main score.

reduction_map: Dict[str, List[str]] = {'mean': ['ensemble_score']}¶

class unitxt.metrics.MetricsList(data_classification_policy: List[str] = None, items: List[unitxt.artifact.Artifact] = [])[source]¶: Bases: ListCollection

class unitxt.metrics.MultiTurnMetric(data_classification_policy: List[str] = None, n_resamples: int = None, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', metric: unitxt.metrics.MapReduceMetric[PredictionType, IntermediateType] = __required__, group_id_field: str = 'conversation/id', item_id_field: str = 'conversation/dialog', in_group_reduction: unitxt.metrics.GroupReduction = None, cross_group_reduction: unitxt.metrics.GroupReduction = None)[source]¶: Bases: GroupMetric[PredictionType, IntermediateType], Generic[PredictionType, IntermediateType]

class unitxt.metrics.MultiTurnToolCallingMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'argument_schema_validation', prediction_type: Any | str = typing.List[unitxt.types.ToolCall], single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = ['jsonschema-rs'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

Compares each predicted tool call with list of references tool call.

prediction_type¶: alias of List[ToolCall]

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.NDCG(data_classification_policy: List[str] = None, main_score: str = 'nDCG', prediction_type: Any | str = typing.Union[float, NoneType], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

Normalized Discounted Cumulative Gain: measures the quality of ranking with respect to ground truth ranking scores.

Range: [0, 1] (higher is better)

As this measures ranking, it is a global metric that can only be calculated over groups of instances. In the common use case where the instances are grouped by different queries, i.e., where the task is to provide a relevance score for a search result w.r.t. a query, an nDCG score is calculated per each query (specified in the “query” input field of an instance) and the final score is the average across all queries. Note that the expected scores are relevance scores (i.e., higher is better) and not rank indices. The absolute value of the scores is only meaningful for the reference scores; for the predictions, only the ordering of the scores affects the outcome - for example, predicted scores of [80, 1, 2] and [0.8, 0.5, 0.6] will receive the same nDCG score w.r.t. a given set of reference scores.

Reference: https://en.wikipedia.org/wiki/Discounted_cumulative_gain

prediction_type¶: alias of Optional[float]

class unitxt.metrics.NER(data_classification_policy: List[str] = None, main_score: str = 'f1_micro', prediction_type: Any | str = typing.List[typing.Tuple[str, str]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, zero_division: float = 0.0, report_per_group_scores: bool = True)[source]¶

Bases: CustomF1

F1 Metrics that receives as input a list of (Entity,EntityType) pairs.

prediction_type¶: alias of List[Tuple[str, str]]

class unitxt.metrics.NLTKMixin(data_classification_policy: List[str] = None)[source]¶: Bases: Artifact

class unitxt.metrics.NormalizedSacrebleu(data_classification_policy: List[str] = None, _requirements_list: Union[List[str], Dict[str, str]] = ['sacrebleu'], requirements: Union[List[str], Dict[str, str]] = [], n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = ['sacrebleu'], return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'sacrebleu', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', language_to_tokenizer: Union[Dict[str, str], NoneType] = None, tokenize: str = None, lowercase: bool = False, force: bool = False, smooth_method: str = 'exp', smooth_value: Union[float, NoneType] = None, use_effective_order: bool = True, max_ngram_order: int = 4)[source]¶

Bases: MapReduceMetric[str, SacreBleuStats], PackageRequirementsMixin

SacreBLEU metric implementation using MapReduceMetric pattern.

This implementation uses the official sacrebleu library for tokenization and BLEU computation, while supporting the map-reduce pattern for proper corpus-level evaluation that matches the behavior of the HuggingFace version.

Range: [0, 1] (higher is better) Reference: Post, M. 2018. A Call for Clarity in Reporting BLEU Scores.

ci_score_names: List[str] = ['sacrebleu']¶

map(prediction: str, references: List[str], task_data: Dict[str, Any]) → SacreBleuStats[source]¶: Map function: compute BLEU statistics for a single instance using sacrebleu.

prediction_type¶: alias of str

reduce(intermediates: List[SacreBleuStats]) → Dict[str, Any][source]¶: Reduce function: aggregate statistics and compute corpus BLEU using sacrebleu.

class unitxt.metrics.Pearsonr(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = ['pearsonr'], return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'pearsonr', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = ['scipy'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: CorrelationMetric

Computes Pearson correlation coefficient.

Range: [-1, 1] (higher absolute value is better) Measures linear relationship between predictions and references.

Reference: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

ci_score_names: List[str] = ['pearsonr']¶

class unitxt.metrics.Perplexity(data_classification_policy: List[str] = None, main_score: str = 'perplexity', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reduction_map: Dict[str, List[str]] = {'mean': ['perplexity']}, implemented_reductions: List[str] = ['mean', 'weighted_win_rate'], source_template: str = __required__, target_template: str = __required__, batch_size: int = 32, model_name: str = __required__, single_token_mode: bool = False)[source]¶

Bases: BulkInstanceMetric

Computes perplexity of generating target text given source context.

Range: [1, ∞) (lower is better) Measures how well a language model predicts the target sequence.

Reference: https://en.wikipedia.org/wiki/Perplexity

compute(references: List[List[Any]], predictions: List[Any], task_data: List[Dict]) → List[Dict[str, Any]][source]¶

Computes the likelihood of generating text Y after text X - P(Y|X).

Parameters:

predictions – the list of Y texts = the targets of the generation
references – the list of list of X texts = the sources of the generation

Returns:

the likelihood of generating text Y_i after each text X_i_j = P(Y_i|X_i_1), …, P(Y_i|X_i_n) for every i.

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['perplexity']}¶

class unitxt.metrics.PrecisionBinary(data_classification_policy: List[str] = None, main_score: str = 'precision_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1Binary

class unitxt.metrics.PrecisionMacroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'precision_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1MultiLabel

class unitxt.metrics.PrecisionMicroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'precision_micro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1MultiLabel

class unitxt.metrics.PredictionLength(data_classification_policy: List[str] = None, main_score: str = 'prediction_length', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['prediction_length']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Returns the length of the prediction.

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['prediction_length']}¶

class unitxt.metrics.RandomForestMetricsEnsemble(data_classification_policy: List[str] = None, main_score: str = 'ensemble_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['ensemble_score']}, reference_field: str = 'references', prediction_field: str = 'prediction', metrics: List[unitxt.base_metric.Metric | str] = __required__, weights: List[float] = None)[source]¶

Bases: MetricsEnsemble

This class extends the MetricsEnsemble base class and leverages a pre-trained scikit-learn Random Forest classification model to combine and aggregate scores from multiple judges.

load_weights method:: Loads model weights from dictionary representation of a random forest classifier.
ensemble method:: Decodes the RandomForestClassifier object and predict a score based on the given instance.

class unitxt.metrics.RecallBinary(data_classification_policy: List[str] = None, main_score: str = 'recall_binary', prediction_type: Any | str = typing.Union[float, int], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['f1_binary', 'f1_binary_neg'], ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1Binary

class unitxt.metrics.RecallMacroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'recall_macro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1MultiLabel

class unitxt.metrics.RecallMicroMultiLabel(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = ['scikit-learn'], requirements: List[str] | Dict[str, str] = [], main_score: str = 'recall_micro', prediction_type: Any | str = typing.List[str], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: F1MultiLabel

class unitxt.metrics.ReductionInstanceMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[~IntermediateType] = __required__)[source]¶: Bases: MapReduceMetric[PredictionType, IntermediateType], Generic[PredictionType, IntermediateType]

class unitxt.metrics.ReflectionToolCallingMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'overall_valid', prediction_type: Union[Any, str] = <class 'unitxt.types.ToolCall'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = {'llmevalkit': 'Install with "pip install \'git+ssh://git@github.ibm.com/MLT/LLMEvalKit.git\'".\nTo gain access please reach the team.'}, requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[~IntermediateType] = None, runtime_pipeline: bool = True)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

Measures syntactic and semantic validity of tool calls.

The final output contains two main fields: “semantic” and “static” (i.e., semantic). Under the semantics we define two types of metrics: general and function selection.

General metrics evaluate the overall quality and correctness of the tool call. These metrics contains:

General hallucination check: Evaluate whether each parameter value in the function call is correct and directly supported by the provided conversation history and adhere the tool specifications.

Value format alignment: Check if the format of the parameter values aligns with the expected formats defined in the tool specifications.

Function selection metrics evaluate the appropriateness of the selected function for the given context. These metrics include:

Function selection appropriateness: Assess whether the chosen function is suitable for the task at hand.

Agentic constraints satisfaction: Assess whether the proposed tool call satisfies all agentic constraints required for execution.

Static metrics evaluate the syntactic validity of the tool call. It contains the following metrics: - non_existent_function: tool name not found. - non_existent_parameter: argument name not in tool spec. - incorrect_parameter_type: argument type mismatch. - missing_required_parameter: required argument missing. - allowed_values_violation: argument value outside allowed set. - json_schema_violation: call violates JSON schema. - empty_api_spec: no tool spec provided. - invalid_api_spec: tool spec is invalid. - invalid_tool_call: call is not a valid tool invocation. - overall_valid: validity of the call (main score). - score: alias of overall_valid.

Here is an example for a aggregated reflection output after calling reduce. The range of each score is [0, 1] (where higher indicates less errors). {

“static_non_existent_function”: 1.0, “static_non_existent_parameter”: 1.0, “static_incorrect_parameter_type”: 1.0, “static_missing_required_parameter”: 1.0, “static_allowed_values_violation”: 1.0, “static_json_schema_violation”: 1.0, “static_empty_api_spec”: 1.0, “static_invalid_api_spec”: 1.0, “static_invalid_tool_call”: 1.0, “semantic_general_hallucination_check”: 0.0, “semantic_general_value_format_alignment”: 0.0, “semantic_avg_score_general”: 1.0, “semantic_function_selection_appropriateness”: 0.0, “semantic_agentic_constraints_satisfaction”: 0.0, “semantic_avg_score_function_selection”: 1.0, “overall_valid”: 1.0

}

Where overall_valid is the final decision made by the reflection pipeline, indicating whether the tool call is valid or not.

Before the aggregation each metric contains also evidence, explanation, a more fine-grained score, etc.

Reference: https://github.ibm.com/MLT/LLMEvalKit

map_stream(items: Iterable[Tuple[ToolCall, None, Dict[str, Any]]], *, max_concurrency: int = 8) → List[Dict[str, Any]][source]¶: Run self.map in parallel over an iterable and return results in order.

prediction_type[source]¶: alias of ToolCall

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.ReflectionToolCallingMetricSyntactic(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'overall_valid', prediction_type: Union[Any, str] = <class 'unitxt.types.ToolCall'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = {'llmevalkit': 'Install with "pip install \'git+ssh://git@github.ibm.com/MLT/LLMEvalKit.git\'".\nTo gain access please reach the team.'}, requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[~IntermediateType] = None)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

Measures syntactic and schema validity of tool calls.

Range: [0, 1] (higher indicates less errors). Returns 1.0 if the tool call is valid for each metric, 0.0 otherwise. overall_valid equals 1.0 if all metrics are valid, 0.0 otherwise. Global score is the percentage of valid instances across the dataset.

Scores: - non_existent_function: tool name not found. - non_existent_parameter: argument name not in tool spec. - incorrect_parameter_type: argument type mismatch. - missing_required_parameter: required argument missing. - allowed_values_violation: argument value outside allowed set. - json_schema_violation: call violates JSON schema. - empty_api_spec: no tool spec provided. - invalid_api_spec: tool spec is invalid. - invalid_tool_call: call is not a valid tool invocation. - overall_valid: validity of the call (main score). - score: alias of overall_valid.

Reference: https://github.ibm.com/MLT/LLMEvalKit

prediction_type[source]¶: alias of ToolCall

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.ReflectionToolCallingMixin[source]¶

Bases: object

static convert_tool_call(prediction: ToolCall)[source]¶

static convert_tools_inventory(tools)[source]¶

class unitxt.metrics.RegardMetric(data_classification_policy: List[str] = None, main_score: str = 'regard', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['transformers', 'torch', 'tqdm'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, model_name: str = 'sasha/regardv3', batch_size: int = 32)[source]¶

Bases: GlobalMetric

prediction_type: Type | str = typing.Any¶

class unitxt.metrics.RelaxedCorrectness(data_classification_policy: List[str] = None, main_score: str = 'relaxed_overall', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

prediction_type¶: alias of str

relaxed_correctness(prediction, target, max_relative_change: float = 0.05) → bool[source]¶

Calculates relaxed correctness.

The correctness tolerates certain error ratio defined by max_relative_change. See https://arxiv.org/pdf/2203.10244.pdf, end of section 5.1: “Following Methani et al. (2020), we use a relaxed accuracy measure for the numeric answers to allow a minor inaccuracy that may result from the automatic data extraction process. We consider an answer to be correct if it is within 5% of the gold answer. For non-numeric answers, we still need an exact match to consider an answer to be correct.”

This function is taken from https://github.com/QwenLM/Qwen-VL/blob/34b4c0ee7b07726371b960911f249fe61b362ca3/eval_mm/evaluate_vqa.py#L113 :param target: List of target string. :param prediction: List of predicted string. :param max_relative_change: Maximum relative change.

Returns:: Whether the prediction was correct given the specified tolerance.

class unitxt.metrics.RemoteMetric(data_classification_policy: List[str] = ['public', 'proprietary'], main_score: str = None, prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, endpoint: str = __required__, metric_name: str = __required__, api_key: str = None)[source]¶

Bases: StreamOperator, Metric

A metric that runs another metric remotely.

main_score: the score updated by this metric. endpoint: the remote host that supports the remote metric execution. metric_name: the name of the metric that is executed remotely. api_key: optional, passed to the remote metric with the input, allows secure authentication.

data_classification_policy: List[str] = ['public', 'proprietary']¶

set_confidence_interval_calculation(return_confidence_interval: bool)[source]¶

Confidence intervals are always disabled for RemoteMetric.

No need to do anything.

set_n_resamples(n_resample)[source]¶: Since confidence intervals are always disabled for remote metrics, this is a no-op.

static wrap_inner_metric_pipeline_metric(metric_pipeline: MetricPipeline, remote_metrics_endpoint: str) → MetricPipeline[source]¶

Wrap the inner metric in a MetricPipeline with a RemoteMetric.

When executing the returned MetricPipeline, the inner metric will be computed remotely (pre and post processing steps in the MetricPipeline will be computed locally).

class unitxt.metrics.RerankRecall(data_classification_policy: List[str] = None, main_score: str = 'recall_at_5', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = None, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['pandas', 'pytrec_eval'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, query_id_field: str = 'query_id', passage_id_field: str = 'passage_id', at_k: List[int] = [1, 2, 5])[source]¶

Bases: GlobalMetric

RerankRecall: measures the quality of reranking with respect to ground truth ranking scores.

Range: [0, 1] (higher is better)

This metric measures ranking performance across a dataset. The references for a query will have a score of 1 for the gold passage and 0 for all other passages. The model returns scores in [0,1] for each passage,query pair. This metric measures recall at k by testing that the predicted score for the gold passage,query pair is at least the k’th highest for all passages for that query. A query receives 1 if so, and 0 if not. The 1’s and 0’s are averaged across the dataset.

query_id_field selects the field containing the query id for an instance. passage_id_field selects the field containing the passage id for an instance. at_k selects the value of k used to compute recall.

Reference: https://en.wikipedia.org/wiki/Information_retrieval#Recall

at_k: List[int] = [1, 2, 5]¶

class unitxt.metrics.RetrievalAtK(data_classification_policy: List[str] = None, main_score: str = None, prediction_type: Any | str = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = None, reference_field: str = 'references', prediction_field: str = 'prediction', k_list: List[int] = __required__)[source]¶: Bases: RetrievalMetric

class unitxt.metrics.RetrievalMetric(data_classification_policy: List[str] = None, main_score: str = <class 'unitxt.dataclass.Undefined'>, prediction_type: Union[Any, str] = typing.Union[typing.List[str], typing.List[int]], single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = <class 'unitxt.dataclass.Undefined'>, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

prediction_type¶: alias of Union[List[str], List[int]]

class unitxt.metrics.Reward(data_classification_policy: List[str] = None, device: str | NoneType = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'reward_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['transformers'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = __required__, batch_size: int = 32)[source]¶: Bases: MapReduceMetric[str, float], TorchDeviceMixin

class unitxt.metrics.RiskType(value)[source]¶

Bases: str, Enum

Risk type for the Granite Guardian models.

class unitxt.metrics.RocAuc(data_classification_policy: List[str] = None, main_score: str = 'roc_auc', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['scikit-learn'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

Computes Area Under the ROC Curve for binary classification.

Range: [0, 1] (higher is better) Measures discriminative ability across all classification thresholds.

Reference: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

prediction_type¶: alias of float

class unitxt.metrics.RootMeanReduction(data_classification_policy: List[str] = None)[source]¶: Bases: DictReduction

class unitxt.metrics.RootMeanSquaredError(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'root_mean_squared_error', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = True, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: MeanSquaredError

Computes root mean squared error between predictions and references.

Range: [0, ∞) (lower is better) Square root of mean squared error, same units as original values.

class unitxt.metrics.Rouge(data_classification_policy: List[str] = None, main_score: str = 'rougeL', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], ci_method: str = 'BCa', _requirements_list: List[str] = ['nltk', 'rouge_score'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}, reference_field: str = 'references', prediction_field: str = 'prediction', rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], sent_split_newline: bool = True)[source]¶

Bases: InstanceMetric, NLTKMixin

Computes ROUGE scores for text summarization evaluation.

Range: [0, 1] (higher is better) Measures n-gram overlap between prediction and reference texts.

Reference: https://en.wikipedia.org/wiki/ROUGE_(metric)

ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}¶

rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶

class unitxt.metrics.RougeHF(data_classification_policy: List[str] = None, main_score: str = 'rougeL', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], ci_method: str = 'BCa', _requirements_list: List[str] = ['nltk', 'rouge_score'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}, reference_field: str = 'references', prediction_field: str = 'prediction', hf_metric_name: str = 'rouge', hf_metric_fields: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], hf_compute_args: dict = {}, rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], sent_split_newline: bool = True)[source]¶

Bases: NLTKMixin, HuggingfaceInstanceMetric

HuggingFace implementation of ROUGE metrics for text evaluation.

Range: [0, 1] (higher is better) Uses HuggingFace’s ROUGE implementation for n-gram overlap scoring.

Reference: https://en.wikipedia.org/wiki/ROUGE_(metric)

ci_scores: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶

hf_metric_fields: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']}¶

rouge_types: List[str] = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']¶

class unitxt.metrics.SQLExecutionAccuracy(data_classification_policy: List[str] = None, main_score: str = 'non_empty_execution_accuracy', prediction_type: Any | str = 'Any', single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'gold_sql_runtime', 'predicted_sql_runtime'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = ['sqlglot', 'func_timeout'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']}, reference_field: str = 'references', prediction_field: str = 'prediction', sql_timeout: float = 60.0)[source]¶

Bases: InstanceMetric

all_metrics = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']¶

ci_scores: List[str] = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'gold_sql_runtime', 'predicted_sql_runtime']¶

reduction_map: Dict[str, List[str]] = {'mean': ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']}¶

class unitxt.metrics.SQLExecutionLogicAccuracy(data_classification_policy: List[str] = None, main_score: str = 'non_empty_execution_accuracy', prediction_type: Any | str = 'Any', single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'gold_sql_runtime', 'predicted_sql_runtime'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = ['sqlglot', 'func_timeout'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']}, reference_field: str = 'references', prediction_field: str = 'prediction', sql_timeout: float = 60.0)[source]¶

Bases: InstanceMetric

all_metrics = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']¶

ci_scores: List[str] = ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'gold_sql_runtime', 'predicted_sql_runtime']¶

reduction_map: Dict[str, List[str]] = {'mean': ['execution_accuracy', 'non_empty_execution_accuracy', 'subset_non_empty_execution_accuracy', 'execution_accuracy_bird', 'non_empty_gold_df', 'gold_sql_runtime', 'predicted_sql_runtime', 'pred_to_gold_runtime_ratio', 'gold_error', 'predicted_error']}¶

class unitxt.metrics.SQLNonExecutionAccuracy(data_classification_policy: List[str] = None, main_score: str = 'sqlglot_equivalence', prediction_type: Any | str = 'Any', single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match', 'sql_syntactic_equivalence'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = ['sqlglot', 'sqlparse'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match', 'sql_syntactic_equivalence']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

all_metrics = ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match', 'sql_syntactic_equivalence']¶

ci_scores: List[str] = ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match', 'sql_syntactic_equivalence']¶

reduction_map: Dict[str, List[str]] = {'mean': ['sqlglot_validity', 'sqlparse_validity', 'sqlglot_equivalence', 'sqlglot_optimized_equivalence', 'sqlparse_equivalence', 'sql_exact_match', 'sql_syntactic_equivalence']}¶

class unitxt.metrics.SacreBleuStats(counts: List[int], totals: List[int], sys_len: int, ref_len: int)[source]¶: Bases: object

class unitxt.metrics.SafetyMetric(data_classification_policy: List[str] = None, device: Union[str, NoneType] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = ['safety'], return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'safety', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['transformers', 'torch'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reward_name: str = 'OpenAssistant/reward-model-deberta-v3-large-v2', batch_size: int = 10, critical_threshold: int = -5, high_threshold: int = -4, medium_threshold: int = -3)[source]¶

Bases: MapReduceMetric[str, Tuple[float, str]], TorchDeviceMixin

The Safety Metric from the paper Unveiling Safety Vulnerabilities of Large Language Models.

The metric is described in the paper: Unveiling Safety Vulnerabilities of Large Language Models. As detailed in the paper, automatically evaluating the potential harm by LLMs requires a harmlessness metric. The model under test should be prompted by each question in the dataset, and the corresponding responses undergo evaluation using a metric that considers both the input and output. Our paper utilizes the “OpenAssistant/reward-model-deberta-v3-large-v2” Reward model, though other models such as “sileod/deberta-v3-large-tasksource-rlhf-reward-model” can also be employed.

ci_score_names: List[str] = ['safety']¶

prediction_type¶: alias of str

class unitxt.metrics.SentenceBert(data_classification_policy: List[str] = None, device: str | NoneType = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'sbert_score', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] = ['sentence_transformers'], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', model_name: str = __required__, batch_size: int = 32)[source]¶

Bases: MapReduceMetric[str, float], TorchDeviceMixin

Computes semantic similarity using Sentence-BERT embeddings.

Range: [-1, 1] (higher is better) Measures cosine similarity between sentence-level embeddings.

class unitxt.metrics.SequentialSuccess(data_classification_policy: List[str] = None, threshold: float = 0.5)[source]¶: Bases: GroupReduction

class unitxt.metrics.Spearmanr(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = ['spearmanr'], return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'spearmanr', prediction_type: Union[Any, str] = <class 'float'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = ['scipy'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: CorrelationMetric

Computes Spearman rank correlation coefficient.

Range: [-1, 1] (higher absolute value is better) Measures monotonic relationship between predictions and references.

Reference: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

ci_score_names: List[str] = ['spearmanr']¶

class unitxt.metrics.Squad(data_classification_policy: List[str] = None, main_score: str = 'f1', prediction_type: Any | str = typing.Dict[str, typing.Any], single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'squad', hf_main_score: str = None, scale: float = 100.0, scaled_fields: list = ['f1', 'exact_match'], hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶

Bases: HuggingfaceMetric

Stanford Question Answering Dataset (SQuAD) evaluation metric.

Range: [0, 100] (higher is better) Computes F1 score and exact match for question answering tasks.

Reference: https://arxiv.org/abs/1606.05250

prediction_type¶: alias of Dict[str, Any]

scaled_fields: list = ['f1', 'exact_match']¶

class unitxt.metrics.Statistic(data, score_names, scorer)[source]¶

Bases: object

Statistic for which the confidence interval is to be calculated.

statistic must be a callable that accepts len(data) samples as separate arguments and returns the resulting statistic. If vectorized is set True, statistic must also accept a keyword argument axis and be vectorized to compute the statistic along the provided axis.

mean(idx)[source]¶

std(idx)[source]¶

class unitxt.metrics.StringContainment(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'string_containment', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[IntermediateType] = None)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

Checks if any reference string is contained within the prediction.

Range: [0, 1] (higher is better) Returns 1.0 if any reference appears as substring in prediction.

prediction_type: Type | str = typing.Any¶

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.StringContainmentOld(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

ci_scores: List[str] = ['string_containment']¶

prediction_type: Type | str = typing.Any¶

reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}¶

class unitxt.metrics.StringContainmentRatio(data_classification_policy: List[str] = None, main_score: str = 'string_containment', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['string_containment'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}, reference_field: str = 'references', prediction_field: str = 'prediction', field: str = None)[source]¶

Bases: InstanceMetric

Metric that returns the ratio of values from a specific field contained in the prediction.

field¶

The field from the task_data that contains the values to be checked for containment.

Type:: str

Example task that contains this metric:

Task(
    input_fields={"question": str},
    reference_fields={"entities": str},
    prediction_type=str,
    metrics=["string_containment_ratio[field=entities]"],
)

ci_scores: List[str] = ['string_containment']¶

prediction_type: Type | str = typing.Any¶

reduction_map: Dict[str, List[str]] = {'mean': ['string_containment']}¶

class unitxt.metrics.TokenOverlap(data_classification_policy: List[str] = None, main_score: str = 'f1', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['f1', 'precision', 'recall'], ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['f1', 'precision', 'recall']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Computes token-level overlap F1, precision, and recall between texts.

Range: [0, 1] (higher is better) Splits texts into tokens and measures set-based overlap metrics.

ci_scores: List[str] = ['f1', 'precision', 'recall']¶

prediction_type¶: alias of str

reduction_map: Dict[str, List[str]] = {'mean': ['f1', 'precision', 'recall']}¶

class unitxt.metrics.ToolCallKeyValueExtraction(data_classification_policy: List[str] = None, main_score: str = '', prediction_type: Union[Any, str] = <class 'unitxt.types.ToolCall'>, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: Union[List[str], Dict[str, str]] = [], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, metric: unitxt.base_metric.Metric = __required__)[source]¶

Bases: KeyValueExtraction

Metrics that formulate ToolCall evaluation as a Key Value Extraction task.

Each argument and each nested value are first flatten to a key value.

{ arguments : {“name” : “John”, “address” : { “street” : “Main St”, “City” : “Smallville” } } }

becomes

argument.names = “John” argument.address.street = “Main St” argument.address.city = “Smallvile”

Note that by default, if a parameter is a list of dictionaries, they are flattened with indexes

{ arguments{“addresses”[{ “street”“Main St”, “City”“Smallville” } ,
{ “street” : “Log St”, “City” : “BigCity” } ] } }

argument.address.0.street = “Main St” argument.address.0.city = “Smallvile” argument.address.1.street = “Log St” argument.address.1.city = “BigCity”

But if each dictionary in the list has a single unique key, it is used instead.

{ arguments{“addresses”[ { “home”{ “street”“Main St”, “City”“Smallville” }} ,: { “work” : {“street” : “Log St”, “City” : “BigCity” } ] } }

argument.address.home.street = “Main St” argument.address.home.city = “Smallvile” argument.address.work.street = “Log St” argument.address.work.city = “BigCity”

prediction_type[source]¶: alias of ToolCall

class unitxt.metrics.ToolCallingMetric(data_classification_policy: List[str] = None, n_resamples: int = 1000, confidence_level: float = 0.95, ci_score_names: List[str] = None, return_confidence_interval: bool = True, ci_method: str = 'BCa', ci_paired: bool = True, main_score: str = 'exact_match', prediction_type: Union[Any, str] = <class 'unitxt.types.ToolCall'>, single_reference_per_prediction: bool = False, score_prefix: str = '', _requirements_list: Union[List[str], Dict[str, str]] = ['jsonschema-rs'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, reference_field: str = 'references', prediction_field: str = 'prediction', reduction: unitxt.metrics.AggregationReduction[~IntermediateType] = None)[source]¶

Bases: ReductionInstanceMetric[str, Dict[str, float]]

Compares each predicted tool call with list of references tool call.

prediction_type[source]¶: alias of ToolCall

reduction: AggregationReduction[IntermediateType] = MeanReduction(__type__='mean_reduction', __title__=None, __description__=None, __tags__={}, __deprecated_msg__=None, data_classification_policy=None)¶

class unitxt.metrics.UnsortedListExactMatch(data_classification_policy: List[str] = None, main_score: str = 'unsorted_list_exact_match', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 1000, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = ['unsorted_list_exact_match'], ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, implemented_reductions: List[str] = ['mean', 'group_mean', 'max'], reduction_map: Dict[str, List[str]] = {'mean': ['unsorted_list_exact_match']}, reference_field: str = 'references', prediction_field: str = 'prediction')[source]¶

Bases: InstanceMetric

Measures exact match between prediction and reference lists, ignoring order.

Range: [0, 1] (higher is better) Returns 1.0 if sorted prediction equals sorted reference, 0.0 otherwise.

ci_scores: List[str] = ['unsorted_list_exact_match']¶

reduction_map: Dict[str, List[str]] = {'mean': ['unsorted_list_exact_match']}¶

class unitxt.metrics.UpdateStream(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, update: dict = __required__)[source]¶: Bases: InstanceOperator

class unitxt.metrics.WebsrcSquadF1(data_classification_policy: List[str] = None, main_score: str = 'websrc_squad_f1', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶

Bases: GlobalMetric

DOMAINS = ['auto', 'book', 'camera', 'game', 'jobs', 'movie', 'phone', 'restaurant', 'sports', 'university', 'hotel']¶

compute(references: List[List[str]], predictions: List[str], task_data: List[Dict]) → dict[source]¶: ANLS image-text accuracy metric.

prediction_type: Type | str = typing.Any¶

class unitxt.metrics.WeightedWinRateCorrelation(data_classification_policy: List[str] = None, main_score: str = 'spearman_corr', prediction_type: Any | str = typing.Any, single_reference_per_prediction: bool = False, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None)[source]¶: Bases: GlobalMetric

class unitxt.metrics.Wer(data_classification_policy: List[str] = None, main_score: str = 'wer', prediction_type: Union[Any, str] = <class 'str'>, single_reference_per_prediction: bool = True, score_prefix: str = '', n_resamples: int = 100, confidence_interval_calculation: bool = True, confidence_level: float = 0.95, ci_scores: List[str] = None, ci_method: str = 'BCa', _requirements_list: List[str] = ['jiwer'], requirements: Union[List[str], Dict[str, str]] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, hf_metric_name: str = 'wer', hf_main_score: str = None, scale: float = 1.0, scaled_fields: list = None, hf_compute_args: Dict[str, Any] = {}, hf_additional_input_fields: List = [], hf_additional_input_fields_pass_one_value: List = [])[source]¶

Bases: HuggingfaceMetric

Word Error Rate for speech recognition and text comparison.

Range: [0, ∞) (lower is better) Measures word-level edits normalized by reference length.

Reference: https://en.wikipedia.org/wiki/Word_error_rate

prediction_type¶: alias of str

unitxt.metrics.abstract_factory()[source]¶

unitxt.metrics.abstract_field()[source]¶

unitxt.metrics.get_index_or_default(lst, item, default=-1)[source]¶

unitxt.metrics.hf_evaluate_load(path: str, *args, **kwargs)[source]¶

unitxt.metrics.interpret_effect_size(x: float)[source]¶

Return a string rule-of-thumb interpretation of an effect size value, as defined by Cohen/Sawilowsky.

See Effect size
Cohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences; and
Sawilowsky, S (2009). “New effect size rules of thumb”. Journal of Modern Applied Statistical Methods. 8 (2): 467-474.

Value has interpretation of

- essentially 0 if |x| < 0.01
- very small if 0.01 <= |x| < 0.2
- small difference if 0.2 <= |x| < 0.5
- a medium difference if 0.5 <= |x| < 0.8
- a large difference if 0.8 <= |x| < 1.2
- a very large difference if 1.2 <= |x| < 2.0
- a huge difference if 2.0 <= |x|

Parameters:: x – float effect size value
Returns:: string interpretation

unitxt.metrics.is_original_key(key)[source]¶

unitxt.metrics.mean_subgroup_score(subgroup_scores_dict: Dict[str, List], subgroup_types: List[str])[source]¶

Return the mean instance score for a subset (possibly a single type) of variants (not a comparison).

Parameters:

subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
subgroup_types – the keys (subgroup types) for which the average will be computed.

Returns:

float score

unitxt.metrics.nan_max(x)[source]¶

unitxt.metrics.nan_mean(x)[source]¶

unitxt.metrics.nan_std(x)[source]¶

unitxt.metrics.new_random_generator()[source]¶

unitxt.metrics.normalize_answer(s)[source]¶: Lower text and remove punctuation, articles and extra whitespace.

unitxt.metrics.normalized_cohens_h(subgroup_scores_dict: Dict[str, List], control_subgroup_types: List[str], comparison_subgroup_types: List[str], interpret=False)[source]¶

Cohen’s h effect size between two proportions, normalized to interval [-1,1].

Allows for change-type metric when the baseline is 0 (percentage change, and thus PDR, is undefined) Conhen’s h

Cohen’s h effect size metric between two proportions p2 and p1 is 2 * (arcsin(sqrt(p2)) - arcsin(sqrt(p1))). h in -pi, pi, with +/-pi representing the largest increase/decrease (p1=0, p2=1), or (p1=1, p2=0). h=0 is no change. Unlike percentage change, h is defined even if the baseline (p1) is 0. Assumes the scores are in [0,1], either continuous or binary; hence taking the average of a group of scores yields a proportion.. Calculates the change in the average of the other_scores relative to the average of the baseline_scores. We rescale this to [-1,1] from [-pi,pi] for clarity, where +- 1 are the most extreme changes, and 0 is no change

Interpretation: the original unscaled Cohen’s h can be interpreted according to function interpret_effect_size

Thus, the rule of interpreting the effect of the normalized value is to use the same thresholds divided by pi

- essentially 0 if |norm h| < 0.0031831
- very small if 0.0031831 <= |norm h| < 0.06366198
- small difference if 0.06366198 <= |norm h| < 0.15915494
- a medium difference if 0.15915494 <= |norm h| < 0.25464791
- a large difference if 0.25464791 <= |norm h| < 0.38197186
- a very large difference if 0.38197186 <= |norm h| < 0.63661977
- a huge difference if 0.63661977 <= |norm h|

Parameters:

subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group
group. (to be compared to the control) –
interpret – boolean, whether to interpret the significance of the score or not

Returns:

float score between -1 and 1, and a string interpretation if interpret=True

unitxt.metrics.normalized_hedges_g(subgroup_scores_dict: Dict[str, List[float]], control_subgroup_types: List[str], comparison_subgroup_types: List[str], interpret=False)[source]¶

Hedge’s g effect size between mean of two samples, normalized to interval [-1,1]. Better than Cohen’s d for small sample sizes.

Takes into account the variances within the samples, not just the means.

Parameters:

subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group to be compared to the control group.
interpret – boolean, whether to interpret the significance of the score or not

Returns:

float score between -1 and 1, and a string interpretation if interpret=True

unitxt.metrics.parse_string_types_instead_of_actual_objects(obj)[source]¶

unitxt.metrics.performance_drop_rate(subgroup_scores_dict: Dict[str, List], control_subgroup_types: List[str], comparison_subgroup_types: List[str])[source]¶

Percentage decrease of mean performance on test elements relative to that on a baseline (control).

from https://arxiv.org/pdf/2306.04528.pdf.

Parameters:

subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group to be compared to the control group.

Returns:

numeric PDR metric. If only one element (no test set) or the first is 0 (percentage change is undefined) return NaN otherwise, calculate PDR

unitxt.metrics.pytrec_eval_at_k(results, qrels, at_k, metric_name)[source]¶

unitxt.metrics.validate_subgroup_types(subgroup_scores_dict: Dict[str, List], control_subgroup_types: List[str], comparison_subgroup_types: List[str])[source]¶

Validate a dict of subgroup type instance score lists, and subgroup type lists.

Parameters:

subgroup_scores_dict – dict where keys are subgroup types and values are lists of instance scores.
control_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the control (baseline) group
comparison_subgroup_types – list of subgroup types (potential keys of subgroup_scores_dict) that are the group to be compared to the control group.

Returns:

dict with all NaN scores removed; control_subgroup_types and comparison_subgroup_types will have non-unique elements removed