Note

To use this tutorial, you need to install unitxt with the following command:

pip install unitxt

Adding Datasets

This guide will assist you in adding or using your new dataset in unitxt.

The information needed for loading your data will be defined in TaskCard class:

card = TaskCard(
    # will be defined in the rest of this guide
)

Loading The Raw Data

To load data from an external source, use a loader. For example, to load the wmt16 translation dataset from the HuggingFace hub:

loader=LoadHF(path="wmt16", name="de-en"),

More loaders for different sources are available in the loaders section.

The Task

Your data usually corresponds to a task like translation, sentiment classification, or summarization. To ensure compatibility and processing into textual training examples, define your task schema:

task=FormTask(
    inputs=["text", "source_language", "target_language"], # str, str, str
    outputs=["translation"], # str
    metrics=["metrics.bleu"],
),

We have predefined several tasks in catalog.tasks.

If a cataloged task fits your use case, call it by name:

task='tasks.tanslation.directed',

The Preprocessing pipeline

The preprocessing pipeline consists of operations to prepare your data according to the task’s schema.

For example, prepare the dataset for translation task:

...
preprocess_steps=[
    CopyFields( # copy the fields to prepare the fields required by the task schema
        field_to_field=[
            ["translation/en", "text"],
            ["translation/de", "translation"],
        ],
        use_query=True,
    ),
    AddFields( # add new fields required by the task schema
        fields={
            "source_language": "english",
            "target_language": "deutch",
        }
    ),
]

For more built-in operators read operators.

Most data can be normalized to the task schema using built-in operators, ensuring your data is processed with verified high-standard streaming code.

For custom operators, refer to the adding operator guide.

The Template

Templates convert data points into a model-friendly textual form. If using a predefined task, choose from the corresponding templates available in catalog.templates.

Alternively define your custom templates:

..
templates=TemplatesList([
    InputOutputTemplate(
        input_format="Translate this sentence from {source_language} to {target_language}: {text}.",
        output_format='{translation}',
    ),
])

Testing your card

Once your card is ready you can test it:

from unitxt.card import TaskCard
from unitxt.loaders import LoadHF
from unitxt.operators import CopyFields, AddFields
from unitxt.test_utils.card import test_card

 card = TaskCard(
    loader=LoadHF(path="wmt16", name="de-en"),
    preprocess_steps=[
        CopyFields( # copy the fields to prepare the fields required by the task schema
            field_to_field=[
                ["translation/en", "text"],
                ["translation/de", "translation"],
            ],
            use_query=True,
        ),
        AddFields( # add new fields required by the task schema
            fields={
                "source_language": "english",
                "target_language": "deutch",
            }
        ),
    ],
    task="tasks.tanslation.directed",
    templates="tasks.tanslation.directed.all"
)

test_card(card)

Adding to the catalog

Once your card is ready and tested you can add it to the catalog.

from unitxt import add_to_catalog

add_to_catalog(card, 'cards.wmt.en_de')

In the same way you can save also your custom templates and tasks.

Note

By default, a new artifact will be added to a local catalog stored in the library directory. To use a different catalog, use the catalog_path argument.

In order to load automatically from your new catalog remember to register your new catalog by unitxt.register_catalog(‘my_catalog’) or by setting the UNITXT_ARTIFACTORIES environment variable to include your catalog.

Putting it all together!

Now everything is ready to use the data! we use standard ICL recipe to load it:

from unitxt.standard import StandardRecipe
from unitxt import load_dataset

recipe = StandardRecipe(
    card='cards.wmt.en_de',
    num_demos=3, # The number of demonstrations for in-context learning
    demos_pool_size=100 # The size of the demonstration pool from which to sample the 5 demonstrations
)

dataset = load_dataset(recipe)

Or even simpler with hugginface datasets:

from datasets import load_dataset

dataset = load_dataset('unitxt/data', 'card=cards.wmt.en_de,num_demos=5,demos_pool_size=100,instruction_item=0')

And the same results as before will be obtained.