ExamplesΒΆ
Here you will find complete coding samples showing how to perform different tasks using Unitxt. Each example comes with a self contained python file that you can run and later modify.
Basic UsageΒΆ
Evaluate an existing dataset from the Unitxt catalog.ΒΆ
Demonstrates how to evaluate an existing entailment dataset using Unitxt. Unitxt is used to load the dataset, generate the input to the model, run inference and evaluate the results.
Related documentation: Installation , WNLI dataset card in catalog, Relation template in catalog, Inference Engines.
Evaluate a custom datasetΒΆ
This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template.
Related documentation: Add new dataset tutorial.
Evaluate a custom dataset - reusing existing catalog assetsΒΆ
This example demonstrates how to evaluate a user QA dataset using the predefined open qa task and templates. It also shows how to use preprocessing steps to align the raw input of the dataset with the predefined task fields.
Related documentation: Add new dataset tutorial, Open QA task in catalog, Open QA template in catalog, Inference Engines.
Evaluate a custom dataset - with existing predictionsΒΆ
These examples demonstrate how to evaluate a datasets of different tasks when predictions are already available and no inference is required.
Example code for classification task
Related documentation: Evaluating datasets
Evaluate a Named Entity Recognition (NER) datasetΒΆ
This example demonstrates how to evaluate a named entity recognition task. The ground truth entities are provided as spans within the provided texts, and the model is prompted to identify these entities. Classifical f1_micro, f1_macro, and per-entity-type f1 metrics are reported.
Related documentation: Add new dataset tutorial, NER task in catalog, Inference Engines.
Evaluate API CallΒΆ
This example demonstrates how to evaluate a text to API call task. It receives as input an OpenAPI specification, a set of user texttual requests and corresponding reference answers formatted as CURL API calls. The model is expected to generate CURL API calls, and these are compared to the references. The model output is post processed and split into components (e.g. url, parameters) which are each compared to the references via F1 metrics using the standard key_value_extraction metric.
Related documentation: Key Value Extraction metric in catalog,:ref:Templates tutorial <adding_template>,
Evaluation usecasesΒΆ
Evaluate the impact of different templates and in-context learning demonstrationsΒΆ
This example demonstrates how different templates and the number of in-context learning examples impacts the performance of a model on an entailment task. It also shows how to register assets into a local catalog and reuse them.
Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.
Evaluate the impact of different formats and system promptsΒΆ
This example demonstrates how different formats and system prompts affect the input provided to a llama3 chat model and evaluate their impact on the obtained scores.
Related documentation: Formatting tutorial.
Evaluate the impact of different demonstration example selectionsΒΆ
This example demonstrates how different methods of selecting the demonstrations in in-context learning affect the results. Three methods are considered: fixed selection of example demonstrations for all test instances, random selection of example demonstrations for each test instance, and choosing the demonstration examples most (lexically) similar to each test instance.
Related documentation: Formatting tutorial.
Evaluate dataset with a pool of templates and some number of demonstrationsΒΆ
This example demonstrates how to evaluate a dataset using a pool of templates and a varying number of in-context learning demonstrations. It shows how to sample a template and specify the number of demonstrations for each instance from predefined lists.
Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.
Long ContextΒΆ
This example explores the effect of long context in classification. It converts a standard multi class classification dataset (sst2 sentiment classification), where single sentence texts are classified one by one, to a dataset where multiple sentences are classified using a single LLM call. It compares the f1_micro in both approaches on two models. It uses serializers to verbalize and enumerated list of multiple sentences and labels.
Related documentation: Sst2 dataset card in catalog Types and Serializers Guide.
Construct a benchmark of multiple datasets and obtain the final scoreΒΆ
This example shows how to construct a benchmark that includes multiple datasets, each with a specific template. It demonstrates how to use these templates to evaluate the datasets and aggregate the results to obtain a final score. This approach provides a comprehensive evaluation across different tasks and datasets.
Related documentation: Benchmarks tutorial, Formatting tutorial, Using the Catalog, Inference Engines.
LLM as JudgesΒΆ
Using LLM as judge for direct comparison using a predefined criteriaΒΆ
This example demonstrates how to use LLM-as-a-Judge with a predefined criteria, in this case answer_relevance. The unitxt catalog has more than 40 predefined criteria for direct evaluators.
Related documentation: Using LLM as a Judge in Unitxt
Using LLM as judge for direct comparison using a custom criteriaΒΆ
The user can also specify a bespoke criteria that the judge model uses as a guide to evaluate the responses. This example demonstrates how to use LLM-as-a-Judge with a user-defined criteria. The criteria must have options and option_map.
Related documentation: Creating a custom criteria
Evaluate an existing dataset using an LLM-as-a-Judge for direct comparisonΒΆ
This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for direct evaluation. Note that here we also showcase unitxtβs ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness
Related documentation: End to end Direct example
Using LLM as a judge for pairwise comparison using a predefined criteriaΒΆ
This example demonstrates how to use LLM-as-a-Judge for pairwise comparison using a predefined criteria from the catalog. The unitxt catalog has 7 predefined criteria for pairwise evaluators. We also showcase that the criteria does not need to be the same across the entire dataset and that the framework can handle different criteria for each datapoint.
This example demonstrates using LLM-as-a-Judge for pairwise comparison using a single predefined criteria for the entire dataset
Evaluate an existing dataset using an LLM-as-a-Judge for direct comparisonΒΆ
This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for pairwise evaluation. Note that here we also showcase unitxtβs ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness
Related documentation: End to end Pairwise example
RAGΒΆ
Evaluate RAG response generationΒΆ
This example demonstrates how to use the standard Unitxt RAG response generation task. The response generation task is the following: Given a question and one or more context(s), generate an answer that is correct and faithful to the context(s). The example shows how to map the dataset input fields to the RAG response task fields and use the existing metrics to evaluate model results.
Related documentation: RAG Guide, Response generation task, Inference Engines.
Evaluate RAG End to End - with existing predictionsΒΆ
This example demonstrates how to evaluate an end to end RAG system, given that the RAG system outputs are available.
Related documentation: Evaluating datasets
Multi-ModalityΒΆ
Evaluate Image-Text to Text ModelΒΆ
This example demonstrates how to evaluate an image-text to text model using Unitxt. The task involves generating text responses based on both image and text inputs. This is particularly useful for tasks like visual question answering (VQA) where the model needs to understand and reason about visual content to answer questions. The example shows how to:
Load a pre-trained image-text model (LLaVA in this case)
Prepare a dataset with image-text inputs
Run inference on the model
Evaluate the modelβs predictions
The code uses the document VQA dataset in English, applies a QA template with context, and formats it for the LLaVA model. It then selects a subset of the test data, generates predictions, and evaluates the results. This approach can be adapted for various image-text to text tasks, such as image captioning, visual reasoning, or multimodal dialogue systems.
Related documentation: Multi-Modality Guide, Inference Engines.
Evaluate Image-Text to Text Model With Different TemplatesΒΆ
Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations.
Related documentation: Multi-Modality Guide, Inference Engines.
Evaluate Image Key Value Extraction taskΒΆ
This example demonstrates how to evaluate an image key value extraction task. It renders several images of given texts and then prompts a vision model to extract key value pairs from the images. This requires the vision model to understand the texts in the images, and extract relevant values. It computes overall F1 scores and F1 scores for each of the keys based on ground truth key value pairs. Note the same code can be used for textual key value extraction, just py providing input texts instead of input images.
Related documentation: Key Value Extraction task in catalog, Inference Engines. Multi-Modality Guide, Inference Engines.
Advanced topicsΒΆ
Custom Types and SerializersΒΆ
This example show how to define new data types as well as the way these data type should be handled when processed to text.
Related documentation: Types and Serializers Guide, Inference Engines.
Evaluate an existing dataset from the Unitxt catalog (No installation)ΒΆ
This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.
Related documentation: Evaluating datasets, WNLI dataset card in catalog, Relation template in catalog, Inference Engines.