unitxt.test_utils.card module¶

unitxt.test_utils.card.load_examples_from_dataset_recipe(card, template_card_index, debug, **kwargs)[source]¶

unitxt.test_utils.card.print_predictions(correct_predictions, results)[source]¶

unitxt.test_utils.card.print_recipe_output(recipe, max_steps, num_examples, print_header, print_stream_size, streams=None)[source]¶

unitxt.test_utils.card.test_adding_to_catalog(card)[source]¶

unitxt.test_utils.card.test_card(card, debug=False, strict=True, test_exact_match_score_when_predictions_equal_references=True, test_full_mismatch_score_with_full_mismatch_prediction_values=True, exact_match_score=1.0, maximum_full_mismatch_score=0.0, full_mismatch_prediction_values=None, **kwargs)[source]¶

Tests a given card.

By default, the test goes over all templates defined in the card, and generates sample outputs for template. It also runs two tests on sample data. The first is running the metrics in the card with predictions which are equal to the references. The expected score in this case is typically 1. The second test is running the metrics in the card with random predictions (selected from a fixed set of values). The score expected in this case is typically 0.

During the test, sample datasets instances, as well as the predictions/references are displayed. It also shows the processed predictions and references, after the template’s post processors are applied. Thus wayit is possible to debug and see that the inputs to the metrics are as expected.

Parameters:

card : The Card object to be tested.
debug : A boolean value indicating whether to enable debug mode. In debug mode, the data processing pipeline is executed step by step, printing a representative output of each step. Default is False.
strict : A boolean value indicating whether to fail if scores do not match the expected ones. Default is True.
test_exact_match_score_when_predictions_equal_references : A boolean value indicating whether to test the exact match score when predictions equal references. Default is True.
test_full_mismatch_score_with_full_mismatch_prediction_values : A boolean value indicating whether to test the full mismatch score with full mismatch prediction values. The potential mismatched predeiction values are specified in full_mismatch_prediction_values`. Default is True.
exact_match_score : The expected score to be returned when predictions are equal the gold reference. Default is 1.0.
maximum_full_mismatch_score : The maximum score allowed to be returned when predictions are full mismatched. Default is 0.0.
full_mismatch_prediction_values : An optional list of prediction values to use for testing full mismatches. Default is None. If not set, a default set of values: [“a1s”, “bfsdf”, “dgdfgs”, “gfjgfh”, “ghfjgh”]
**kwargs : Additional keyword arguments to be passed to the recipe.

Examples

# Test the templates with few shots
test_card(card,num_demos=1,demo_pool_size=10)

# Shows the step by step processing of data.
test_card(card,debug=True)

# In some metrics (e.g. BertScore) random predictions do not generate a score of zero so we disable this test
test_card(card,test_full_mismatch_score_with_full_mismatch_prediction_values=False)

# Alternatively, we can ensure the score on random predictions is less than 0.7
test_card(card,maximum_full_mismatch_score=0.7)

# Override the values used when running the test to check that fully mismatched values get 0 score
test_card(card,full_mismatch_prediction_values=["NA","NONE])

unitxt.test_utils.card.test_correct_predictions(examples, strict, exact_match_score)[source]¶

unitxt.test_utils.card.test_loading_from_catalog(card)[source]¶

unitxt.test_utils.card.test_metrics_exist(card)[source]¶

unitxt.test_utils.card.test_wrong_predictions(examples, strict, maximum_full_mismatch_score, full_mismatch_prediction_values)[source]¶