ExamplesΒΆ

Here you will find complete coding samples showing how to perform different tasks using Unitxt. Each example comes with a self contained python file that you can run and later modify.

Basic UsageΒΆ

Evaluate an existing dataset from the Unitxt catalog (No installation)ΒΆ

This example demonstrates how to evaluate an existing entailment dataset (wnli) using HuggingFace Datasets and Evaluate APIs, with no installation required.

Example code

Related documentation: Evaluating datasets, WNLI dataset card in catalog, Relation template in catalog, Inference Engines.

Evaluate an existing dataset from the Unitxt catalog (with Unitxt installation)ΒΆ

Demonstrates how to evaluate an existing entailment dataset (wnli) using Unitxt native APIs. This approach is faster than using Huggingface APIs.

Example code

Related documentation: Installation , WNLI dataset card in catalog, Relation template in catalog, Inference Engines.

Evaluate a custom datasetΒΆ

This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template.

Example code

Related documentation: Add new dataset tutorial.

Evaluate a custom dataset - reusing existing catalog assetsΒΆ

This example demonstrates how to evaluate a user QA dataset using the predefined open qa task and templates. It also shows how to use preprocessing steps to align the raw input of the dataset with the predefined task fields.

Example code

Related documentation: Add new dataset tutorial, Open QA task in catalog, Open QA template in catalog, Inference Engines.

Evaluation usecasesΒΆ

Evaluate the impact of different templates and in-context learning demonstrationsΒΆ

This example demonstrates how different templates and the number of in-context learning examples impacts the performance of a model on an entailment task. It also shows how to register assets into a local catalog and reuse them.

Example code

Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

Evaluate the impact of different formats and system promptsΒΆ

This example demonstrates how different formats and system prompts affect the input provided to a llama3 chat model and evaluate their impact on the obtained scores.

Example code

Related documentation: Formatting tutorial.

Evaluate the impact of different demonstration example selectionsΒΆ

This example demonstrates how different methods of selecting the demonstrations in in-context learning affect the results. Three methods are considered: fixed selection of example demonstrations for all test instances, random selection of example demonstrations for each test instance, and choosing the demonstration examples most (lexically) similar to each test instance.

Example code

Related documentation: Formatting tutorial.

Evaluate dataset with a pool of templates and some number of demonstrationsΒΆ

This example demonstrates how to evaluate a dataset using a pool of templates and a varying number of in-context learning demonstrations. It shows how to sample a template and specify the number of demonstrations for each instance from predefined lists.

Example code

Related documentation: Templates tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

Long ContextΒΆ

This example explores the effect of long context in classification. It converts a standard multi class classification dataset (sst2 sentiment classification), where single sentence texts are classified one by one, to a dataset where multiple sentences are classified using a single LLM call. It compares the f1_micro in both approaches on two models. It uses serializers to verbalize and enumerated list of multiple sentences and labels.

Example code

Related documentation: Sst2 dataset card in catalog Types and Serializers Guide.

Construct a benchmark of multiple datasets and obtain the final scoreΒΆ

This example shows how to construct a benchmark that includes multiple datasets, each with a specific template. It demonstrates how to use these templates to evaluate the datasets and aggregate the results to obtain a final score. This approach provides a comprehensive evaluation across different tasks and datasets.

Example code

Related documentation: Benchmarks tutorial, Formatting tutorial, Using the Catalog, Inference Engines.

LLM as JudgesΒΆ

Evaluate an existing dataset using a predefined LLM as judgeΒΆ

This example demonstrates how to evaluate an existing QA dataset (squad) using the HuggingFace Datasets and Evaluate APIs and leveraging a predefine LLM as a judge metric.

Example code

Related documentation: Evaluating datasets, LLM as a Judge Metrics Guide, Inference Engines.

Evaluate a custom dataset using a custom LLM as JudgeΒΆ

This example demonstrates how to evaluate a user QA answering dataset in a standalone file using a user-defined task and template. In addition, it shows how to define an LLM as a judge metric, specify the template it uses to produce the input to the judge, and select the judge model and platform.

Example code

Related documentation: LLM as a Judge Metrics Guide.

Evaluate an existing dataset from the catalog comparing two custom LLM as judgesΒΆ

This example demonstrates how to evaluate a document summarization dataset by defining an LLM as a judge metric, specifying the template it uses to produce the input to the judge, and selecting the judge model and platform. The example adds two LLM judges, one that uses the ground truth (references) from the dataset and one that does not.

Example code

Related documentation: LLM as a Judge Metrics Guide.

Evaluate the quality of an LLM as judgeΒΆ

This example demonstrates how to evaluate an LLM as judge by checking its scores using the gold references of a dataset. It checks if the judge consistently prefers correct outputs over clearly wrong ones. Note that to check the the ability of the LLM as judge to discern suitable differences between partially correct answers requires more refined tests and corresponding labeled data. The example shows an 8b llama based judge is not a good judge for a summarization task, while the 70b model performs much better.

Example code

Related documentation: LLM as a Judge Metrics Guide, Inference Engines.

Evaluate your model on the Arena Hard benchmark using a custom LLMaJΒΆ

This example demonstrates how to evaluate a user model on the Arena Hard benchmark, using an LLMaJ other than the GPT4.

Example code

Related documentation: Evaluate a Model on Arena Hard Benchmark, Inference Engines.

Evaluate a judge model performance judging the Arena Hard BenchmarkΒΆ

This example demonstrates how to evaluate the capabilities of a user model, to act as a judge on the Arena Hard benchmark. The model is evaluated on its capability to give a judgment that is in correlation with GPT4 judgment on the benchmark.

Example code

Related documentation: Evaluate a Model on Arena Hard Benchmark, Inference Engines.

Evaluate using ensemble of LLM as a judge metricsΒΆ

This example demonstrates how to create a metric which is an ensemble of LLM as a judge metrics. The example shows how to ensemble two judges which uses different templates.

Example code

Related documentation: LLM as a Judge Metrics Guide, Inference Engines.

Evaluate predictions of models using pre-trained ensemble of LLM as judgesΒΆ

This example demonstrates how to use a pre-trained ensemble model or an off-the-shelf LLM as judge to assess multi-turn conversation quality of models on a set of pre-defined metrics.

Topicality: Response of the model only contains information that is related to and helpful for the user inquiry. Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_ensemble_judge.py>

Groundedness: Every substantial claim in the response of the model is derivable from the content of the document Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_grounded_ensemble_judge.py>

IDK: Does the model response say I don’t know? Example code <https://github.com/IBM/unitxt/blob/main/examples/evaluate_idk_judge.py>

Related documentation: LLM as a Judge Metrics Guide, Inference Engines.

RAGΒΆ

Evaluate RAG response generationΒΆ

This example demonstrates how to use the standard Unitxt RAG response generation task. The response generation task is the following: Given a question and one or more context(s), generate an answer that is correct and faithful to the context(s). The example shows how to map the dataset input fields to the RAG response task fields and use the existing metrics to evaluate model results.

Example code

Related documentation: RAG Guide, Response generation task, Inference Engines.

Multi-ModalityΒΆ

Evaluate Image-Text to Text ModelΒΆ

This example demonstrates how to evaluate an image-text to text model using Unitxt. The task involves generating text responses based on both image and text inputs. This is particularly useful for tasks like visual question answering (VQA) where the model needs to understand and reason about visual content to answer questions. The example shows how to:

  1. Load a pre-trained image-text model (LLaVA in this case)

  2. Prepare a dataset with image-text inputs

  3. Run inference on the model

  4. Evaluate the model’s predictions

The code uses the document VQA dataset in English, applies a QA template with context, and formats it for the LLaVA model. It then selects a subset of the test data, generates predictions, and evaluates the results. This approach can be adapted for various image-text to text tasks, such as image captioning, visual reasoning, or multimodal dialogue systems.

Example code

Related documentation: Multi-Modality Guide, Inference Engines.

Evaluate Image-Text to Text Model With Different TemplatesΒΆ

Evaluate Image-Text to Text Models with different templates and explore the sensitivity of the model to different textual variations. Example code

Related documentation: Multi-Modality Guide, Inference Engines.

Types and SerializersΒΆ

Custom Types and SerializersΒΆ

This example show how to define new data types as well as the way these data type should be handled when processed to text.

Example code

Related documentation: Types and Serializers Guide, Inference Engines.