Note
To use this tutorial, you need to install unitxt with the following command:
pip install unitxt
Adding Datasets
This guide will assist you in adding or using your new dataset in unitxt.
The information needed for loading your data will be defined in TaskCard class:
card = TaskCard(
# will be defined in the rest of this guide
)
Loading The Raw Data
To load data from an external source, use a loader. For example, to load the wmt16 translation dataset from the HuggingFace hub:
loader=LoadHF(path="wmt16", name="de-en"),
More loaders for different sources are available in the loaders section.
The Task
Your data usually corresponds to a task like translation, sentiment classification, or summarization. To ensure compatibility and processing into textual training examples, define your task schema:
task=FormTask(
inputs=["text", "source_language", "target_language"], # str, str, str
outputs=["translation"], # str
metrics=["metrics.bleu"],
),
We have predefined several tasks in catalog.tasks.
If a cataloged task fits your use case, call it by name:
task='tasks.tanslation.directed',
The Preprocessing pipeline
The preprocessing pipeline consists of operations to prepare your data according to the task’s schema.
For example, prepare the dataset for translation task:
...
preprocess_steps=[
CopyFields( # copy the fields to prepare the fields required by the task schema
field_to_field=[
["translation/en", "text"],
["translation/de", "translation"],
],
use_query=True,
),
AddFields( # add new fields required by the task schema
fields={
"source_language": "english",
"target_language": "deutch",
}
),
]
For more built-in operators read operators.
Most data can be normalized to the task schema using built-in operators, ensuring your data is processed with verified high-standard streaming code.
For custom operators, refer to the adding operator guide.
The Template
Templates convert data points into a model-friendly textual form. If using a predefined task, choose from the corresponding templates available in catalog.templates.
Alternively define your custom templates:
..
templates=TemplatesList([
InputOutputTemplate(
input_format="Translate this sentence from {source_language} to {target_language}: {text}.",
output_format='{translation}',
),
])
Testing your card
Once your card is ready you can test it:
from unitxt.card import TaskCard
from unitxt.loaders import LoadHF
from unitxt.operators import CopyFields, AddFields
from unitxt.test_utils.card import test_card
card = TaskCard(
loader=LoadHF(path="wmt16", name="de-en"),
preprocess_steps=[
CopyFields( # copy the fields to prepare the fields required by the task schema
field_to_field=[
["translation/en", "text"],
["translation/de", "translation"],
],
use_query=True,
),
AddFields( # add new fields required by the task schema
fields={
"source_language": "english",
"target_language": "deutch",
}
),
],
task="tasks.tanslation.directed",
templates="tasks.tanslation.directed.all"
)
test_card(card)
Adding to the catalog
Once your card is ready and tested you can add it to the catalog.
from unitxt import add_to_catalog
add_to_catalog(card, 'cards.wmt.en_de')
In the same way you can save also your custom templates and tasks.
Note
By default, a new artifact will be added to a local catalog stored in the library directory. To use a different catalog, use the catalog_path argument.
In order to load automatically from your new catalog remember to register your new catalog by unitxt.register_catalog(‘my_catalog’) or by setting the UNITXT_ARTIFACTORIES environment variable to include your catalog.
Putting it all together!
Now everything is ready to use the data! we use standard ICL recipe to load it:
from unitxt.standard import StandardRecipe
from unitxt import load_dataset
recipe = StandardRecipe(
card='cards.wmt.en_de',
num_demos=3, # The number of demonstrations for in-context learning
demos_pool_size=100 # The size of the demonstration pool from which to sample the 5 demonstrations
)
dataset = load_dataset(recipe)
Or even simpler with hugginface datasets:
from datasets import load_dataset
dataset = load_dataset('unitxt/data', 'card=cards.wmt.en_de,num_demos=5,demos_pool_size=100,instruction_item=0')
And the same results as before will be obtained.