ExamplesΒΆ

Here you will find complete coding samples showing how to perform different tasks using Unitxt. Each example comes with a self contained python file that you can run and later modify.

Basic UsageΒΆ

Evaluate an existing dataset from the Unitxt catalog.ΒΆ

Demonstrates how to evaluate an existing entailment dataset using Unitxt. Unitxt is used to load the dataset, generate the input to the model, run inference and evaluate the results.

Example code

Related documentation: Installation , WNLI dataset card in catalog, Relation template in catalog, Inference Engines.

Evaluate a custom datasetΒΆ

This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template.

Example code

Related documentation: Add new dataset tutorial.

Evaluate a custom dataset - reusing existing catalog assetsΒΆ

This example demonstrates how to evaluate a user QA dataset using the predefined open qa task and templates. It also shows how to use preprocessing steps to align the raw input of the dataset with the predefined task fields.

Example code

Related documentation: Add new dataset tutorial, Open QA task in catalog, Open QA template in catalog, Inference Engines.

Evaluate a custom dataset - with existing predictionsΒΆ

These examples demonstrate how to evaluate a datasets of different tasks when predictions are already available and no inference is required.

Example code for QA task

Example code for classification task

Related documentation: Evaluating datasets

Evaluate a Named Entity Recognition (NER) datasetΒΆ

This example demonstrates how to evaluate a named entity recognition task. The ground truth entities are provided as spans within the provided texts, and the model is prompted to identify these entities. Classifical f1_micro, f1_macro, and per-entity-type f1 metrics are reported.

Example code

Related documentation: Add new dataset tutorial, NER task in catalog, Inference Engines.

Evaluate API CallΒΆ

This example demonstrates how to evaluate a text to API call task. It receives as input an OpenAPI specification, a set of user texttual requests and corresponding reference answers formatted as CURL API calls. The model is expected to generate CURL API calls, and these are compared to the references. The model output is post processed and split into components (e.g. url, parameters) which are each compared to the references via F1 metrics using the standard key_value_extraction metric.

Example code

Related documentation: Key Value Extraction metric in catalog,:ref:Templates tutorial <adding_template>,

Evaluation usecasesΒΆ

Evaluate the impact of different templates and in-context learning demonstrationsΒΆ

This example demonstrates how different templates and the number of in-context learning examples impacts the performance of a model on an entailment task. It also shows how to register assets into a local catalog and reuse them.

Example code

Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

Evaluate the impact of different formats and system promptsΒΆ

This example demonstrates how different formats and system prompts affect the input provided to a llama3 chat model and evaluate their impact on the obtained scores.

Example code

Related documentation: Formatting tutorial.

Evaluate the impact of different demonstration example selectionsΒΆ

This example demonstrates how different methods of selecting the demonstrations in in-context learning affect the results. Three methods are considered: fixed selection of example demonstrations for all test instances, random selection of example demonstrations for each test instance, and choosing the demonstration examples most (lexically) similar to each test instance.

Example code

Related documentation: Formatting tutorial.

Evaluate dataset with a pool of templates and some number of demonstrationsΒΆ

This example demonstrates how to evaluate a dataset using a pool of templates and a varying number of in-context learning demonstrations. It shows how to sample a template and specify the number of demonstrations for each instance from predefined lists.

Example code

Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

Long ContextΒΆ

This example explores the effect of long context in classification. It converts a standard multi class classification dataset (sst2 sentiment classification), where single sentence texts are classified one by one, to a dataset where multiple sentences are classified using a single LLM call. It compares the f1_micro in both approaches on two models. It uses serializers to verbalize and enumerated list of multiple sentences and labels.

Example code

Related documentation: Sst2 dataset card in catalog Types and Serializers Guide.

Construct a benchmark of multiple datasets and obtain the final scoreΒΆ

This example shows how to construct a benchmark that includes multiple datasets, each with a specific template. It demonstrates how to use these templates to evaluate the datasets and aggregate the results to obtain a final score. This approach provides a comprehensive evaluation across different tasks and datasets.

Example code

Related documentation: Benchmarks tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

LLM as JudgesΒΆ

Using LLM as judge for direct comparison using a predefined criteriaΒΆ

This example demonstrates how to use LLM-as-a-Judge with a predefined criteria, in this case answer_relevance. The unitxt catalog has more than 40 predefined criteria for direct evaluators.

Example code

Related documentation: Using LLM as a Judge in Unitxt

Using LLM as judge for direct comparison using a custom criteriaΒΆ

The user can also specify a bespoke criteria that the judge model uses as a guide to evaluate the responses. This example demonstrates how to use LLM-as-a-Judge with a user-defined criteria. The criteria must have options and option_map.

Example code

Related documentation: Creating a custom criteria

Evaluate an existing dataset using an LLM-as-a-Judge for direct comparisonΒΆ

This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for direct evaluation. Note that here we also showcase unitxt’s ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness

Example code

Related documentation: End to end Direct example

Using LLM as a judge for pairwise comparison using a predefined criteriaΒΆ

This example demonstrates how to use LLM-as-a-Judge for pairwise comparison using a predefined criteria from the catalog. The unitxt catalog has 7 predefined criteria for pairwise evaluators. We also showcase that the criteria does not need to be the same across the entire dataset and that the framework can handle different criteria for each datapoint.

Example code

This example demonstrates using LLM-as-a-Judge for pairwise comparison using a single predefined criteria for the entire dataset

Example code

Evaluate an existing dataset using an LLM-as-a-Judge for direct comparisonΒΆ

This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for pairwise evaluation. Note that here we also showcase unitxt’s ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness

Example code

Related documentation: End to end Pairwise example

RAGΒΆ

Evaluate RAG response generationΒΆ

This example demonstrates how to use the standard Unitxt RAG response generation task. The response generation task is the following: Given a question and one or more context(s), generate an answer that is correct and faithful to the context(s). The example shows how to map the dataset input fields to the RAG response task fields and use the existing metrics to evaluate model results.

Example code

Related documentation: RAG Guide, Response generation task, Inference Engines.

Evaluate RAG End to End - with existing predictionsΒΆ

This example demonstrates how to evaluate an end to end RAG system, given that the RAG system outputs are available.

Example code

Related documentation: Evaluating datasets

Multi-ModalityΒΆ

Evaluate Image-Text to Text ModelΒΆ

This example demonstrates how to evaluate an image-text to text model using Unitxt. The task involves generating text responses based on both image and text inputs. This is particularly useful for tasks like visual question answering (VQA) where the model needs to understand and reason about visual content to answer questions. The example shows how to:

  1. Load a pre-trained image-text model (LLaVA in this case)

  2. Prepare a dataset with image-text inputs

  3. Run inference on the model

  4. Evaluate the model’s predictions

The code uses the document VQA dataset in English, applies a QA template with context, and formats it for the LLaVA model. It then selects a subset of the test data, generates predictions, and evaluates the results. This approach can be adapted for various image-text to text tasks, such as image captioning, visual reasoning, or multimodal dialogue systems.

Example code

Related documentation: Multi-Modality Guide, Inference Engines.

Evaluate Image-Text to Text Model With Different TemplatesΒΆ

Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations.

Example code

Related documentation: Multi-Modality Guide, Inference Engines.

Evaluate Image Key Value Extraction taskΒΆ

This example demonstrates how to evaluate an image key value extraction task. It renders several images of given texts and then prompts a vision model to extract key value pairs from the images. This requires the vision model to understand the texts in the images, and extract relevant values. It computes overall F1 scores and F1 scores for each of the keys based on ground truth key value pairs. Note the same code can be used for textual key value extraction, just py providing input texts instead of input images.

Example code

Related documentation: Key Value Extraction task in catalog, Inference Engines. Multi-Modality Guide, Inference Engines.

Advanced topicsΒΆ

Custom Types and SerializersΒΆ

This example show how to define new data types as well as the way these data type should be handled when processed to text.

Example code

Related documentation: Types and Serializers Guide, Inference Engines.

Evaluate an existing dataset from the Unitxt catalog (No installation)ΒΆ

This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.

Example code

Related documentation: Evaluating datasets, WNLI dataset card in catalog, Relation template in catalog, Inference Engines.