Read our 2024 Summary blog post!

[[ visible ? '▲ HIDE' : '▼ SHOW BANNER' ]]

Examples¶

Here you will find complete coding samples showing how to perform different tasks using Unitxt. Each example comes with a self contained python file that you can run and later modify.

Basic Usage¶

Evaluate an existing dataset from the Unitxt catalog.¶

Demonstrates how to evaluate an existing entailment dataset using Unitxt. Unitxt is used to load the dataset, generate the input to the model, run inference and evaluate the results.

Related documentation: Installation , WNLI dataset card in catalog, Relation template in catalog, Inference Engines.

Evaluate a custom dataset¶

This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template.

Related documentation: Add new dataset tutorial.

Evaluate a custom dataset - reusing existing catalog assets¶

This example demonstrates how to evaluate a user QA dataset using the predefined open qa task and templates. It also shows how to use preprocessing steps to align the raw input of the dataset with the predefined task fields.

Related documentation: Add new dataset tutorial, Open QA task in catalog, Open QA template in catalog, Inference Engines.

Evaluate a custom dataset - with existing predictions¶

These examples demonstrate how to evaluate a datasets of different tasks when predictions are already available and no inference is required.

Example code for QA task

Example code for classification task

Related documentation: Evaluating datasets

Evaluate a custom dataset with custom metric¶

This example demonstrates to add a custom metric. It adds a referenceless metric that checks if the model output is a valid json, for an extraction task.

Related documentation: Add new metric tutorial

Evaluate a Named Entity Recognition (NER) dataset¶

This example demonstrates how to evaluate a named entity recognition task. The ground truth entities are provided as spans within the provided texts, and the model is prompted to identify these entities. Classical f1_micro, f1_macro, and per-entity-type f1 metrics are reported.

Related documentation: Add new dataset tutorial, NER task in catalog, Inference Engines.

Evaluate a multi choice QA dataset¶

This example demonstrates how to evaluate a multi choice question answering dataset.

Related documentation: Add new dataset tutorial, Multiple choice task in catalog, Inference Engines.

Evaluate multi turn conversation¶

This example demonstrates how to evaluate multi turn conversations. Each instance is a complete conversation, and we evaluate the model response on the last user turn, comparing to gold answers using different metrics.

It shows a case where LLM as judges are significantly more aligned with human evaluation that syntactic metrics like Rouge or Bleu.

Related documentation: Using LLM as a Judge in Unitxt Types and Serializers Guide.

Evaluate Tool Calling¶

This example demonstrates how to evaluate model tool calling capabilities. It receives as input a LLM tool specification, a set of user textual requests and corresponding reference answers as tool calls. The model is expected to generate tool calls and these are compared to the references.

Related documentation: Tool calling task in catalog, Tool calling tutorial

Evaluate API Call¶

This example demonstrates how to evaluate a text to API call task. It receives as input an OpenAPI specification, a set of user textual requests and corresponding reference answers formatted as CURL API calls. The model is expected to generate CURL API calls, and these are compared to the references. The model output is post processed and split into components (e.g. url, parameters) which are each compared to the references via F1 metrics using the standard key_value_extraction metric.

Related documentation: Key Value Extraction metric in catalog, Templates tutorial

Evaluate using Unitxt metrics on existing predictions¶

This example demonstrates how to evaluate existing model predictions and references using Unitxt metrics. This is in the case, Unitxt is only used for metric calculation. It is not used for creating model inputs or inference.

Related documentation: Add new metric tutorial

Evaluation usecases¶

Evaluate the impact of different templates and in-context learning demonstrations¶

This example demonstrates how different templates and the number of in-context learning examples impacts the performance of a model on an entailment task. It also shows how to register assets into a local catalog and reuse them.

Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

Evaluate the impact of different formats and system prompts¶

This example demonstrates how different formats and system prompts affect the input provided to a granite chat model and evaluate their impact on the obtained scores.

Related documentation: Formatting tutorial.

Evaluate the impact of different demonstration example selections¶

This example demonstrates how different methods of selecting the demonstrations in in-context learning affect the results. Three methods are considered: fixed selection of example demonstrations for all test instances, random selection of example demonstrations for each test instance, and choosing the demonstration examples most (lexically) similar to each test instance.

Related documentation: Formatting tutorial.

Evaluate dataset with a pool of templates and some number of demonstrations¶

This example demonstrates how to evaluate a dataset using a pool of templates and a varying number of in-context learning demonstrations. It shows how to sample a template and specify the number of demonstrations for each instance from predefined lists.

Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

Long Context¶

This example explores the effect of long context in classification. It converts a standard multi class classification dataset (sst2 sentiment classification), where single sentence texts are classified one by one, to a dataset where multiple sentences are classified using a single LLM call. It compares the f1_micro in both approaches on two models. It uses serializers to verbalize and enumerated list of multiple sentences and labels.

Related documentation: Sst2 dataset card in catalog Types and Serializers Guide.

Construct a benchmark of multiple datasets and obtain the final score¶

This example shows how to construct a benchmark that includes multiple datasets, each with a specific template. It demonstrates how to use these templates to evaluate the datasets and aggregate the results to obtain a final score. This approach provides a comprehensive evaluation across different tasks and datasets.

Related documentation: Benchmarks tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

LLM as Judges¶

Using LLM as judge for direct comparison using a predefined criteria¶

This example demonstrates how to use LLM-as-a-Judge with a predefined criteria, in this case answer_relevance. The unitxt catalog has more than 40 predefined criteria for direct evaluators.

Related documentation: Using LLM as a Judge in Unitxt

Using LLM as judge for direct comparison using a custom criteria¶

The user can also specify a bespoke criteria that the judge model uses as a guide to evaluate the responses. This example demonstrates how to use LLM-as-a-Judge with a user-defined criteria. The criteria must have options and option_map.

Related documentation: Creating a custom criteria

Evaluate an existing dataset using an LLM-as-a-Judge for direct comparison¶

This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for direct evaluation. Note that here we also showcase unitxt’s ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness

Related documentation: End to end Direct example

Using LLM as a judge for pairwise comparison using a predefined criteria¶

This example demonstrates how to use LLM-as-a-Judge for pairwise comparison using a predefined criteria from the catalog. The unitxt catalog has 7 predefined criteria for pairwise evaluators. We also showcase that the criteria does not need to be the same across the entire dataset and that the framework can handle different criteria for each datapoint.

This example demonstrates using LLM-as-a-Judge for pairwise comparison using a single predefined criteria for the entire dataset

Evaluate an existing dataset using an LLM-as-a-Judge for direct comparison¶

This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefined criteria for pairwise evaluation. Note that here we also showcase unitxt’s ability to evaluate the dataset on multiple criteria, namely, answer_relevance, coherence and conciseness

Related documentation: End to end Pairwise example

RAG¶

Evaluate RAG response generation¶

This example demonstrates how to use the standard Unitxt RAG response generation task. The response generation task is the following: Given a question and one or more context(s), generate an answer that is correct and faithful to the context(s). The example shows how to map the dataset input fields to the RAG response task fields and use the existing metrics to evaluate model results.

Related documentation: RAG Guide, Response generation task, Inference Engines.

Evaluate RAG End to End - with existing predictions¶

This example demonstrates how to evaluate an end to end RAG system, given that the RAG system outputs are available.

Related documentation: Evaluating datasets

Multi-Modality¶

Evaluate Image-Text to Text Model¶

This example demonstrates how to evaluate an image-text to text model using Unitxt. The task involves generating text responses based on both image and text inputs. This is particularly useful for tasks like visual question answering (VQA) where the model needs to understand and reason about visual content to answer questions. The example shows how to:

Load a pre-trained image-text model (LLaVA in this case)

Prepare a dataset with image-text inputs

Run inference on the model

Evaluate the model’s predictions

The code uses the document VQA dataset in English, applies a QA template with context, and formats it for the LLaVA model. It then selects a subset of the test data, generates predictions, and evaluates the results. This approach can be adapted for various image-text to text tasks, such as image captioning, visual reasoning, or multimodal dialogue systems.

Related documentation: Multi-Modality Guide, Inference Engines.

Evaluate Image-Text to Text Model With Different Templates¶

Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations.

Related documentation: Multi-Modality Guide, Inference Engines.

Evaluate Image Key Value Extraction task¶

This example demonstrates how to evaluate an image key value extraction task. It renders several images of given texts and then prompts a vision model to extract key value pairs from the images. This requires the vision model to understand the texts in the images, and extract relevant values. It computes overall F1 scores and F1 scores for each of the keys based on ground truth key value pairs. Note the same code can be used for textual key value extraction, just py providing input texts instead of input images.

Related documentation: Key Value Extraction task in catalog, Inference Engines. Multi-Modality Guide, Inference Engines.

Advanced topics¶

Custom Types and Serializers¶

This example show how to define new data types as well as the way these data type should be handled when processed to text.

Related documentation: Types and Serializers Guide, Inference Engines.

Evaluate an existing dataset from the Unitxt catalog (No installation)¶

This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.

Related documentation: Evaluating datasets, WNLI dataset card in catalog, Relation template in catalog, Inference Engines.

>Page contents:

Examples

<General Settings

Blog 📣>