Note
To use this tutorial, you need to install unitxt.
Adding Datasets ✨¶
This guide will assist you in adding or using your new dataset in unitxt.
The information needed for loading your data will be defined in TaskCard class:
card = TaskCard(
# will be defined in the rest of this guide
)
Loading The Raw Data¶
To load data from an external source, use a loader. For example, to load the wmt16 translation dataset from the HuggingFace hub:
loader=LoadHF(path="wmt16", name="de-en"),
More loaders for different sources are available in the loaders section.
The Task¶
Your data usually corresponds to a task like translation, sentiment classification, or summarization. To ensure compatibility and processing into textual training examples, define your task schema:
task=FormTask(
inputs=["text", "source_language", "target_language"], # str, str, str
outputs=["translation"], # str
metrics=["metrics.bleu"],
),
We have predefined several tasks in the catalog’s Tasks section.
If a cataloged task fits your use case, call it by name:
task='tasks.translation.directed',
The Preprocessing pipeline¶
The preprocessing pipeline consists of operations to prepare your data according to the task’s schema.
For example, prepare the dataset for translation task:
...
preprocess_steps=[
CopyFields( # copy the fields to prepare the fields required by the task schema
field_to_field=[
["translation/en", "text"],
["translation/de", "translation"],
],
),
AddFields( # add new fields required by the task schema
fields={
"source_language": "english",
"target_language": "deutch",
}
),
]
For more built-in operators read operators.
Most data can be normalized to the task schema using built-in operators, ensuring your data is processed with verified high-standard streaming code.
For custom operators, refer to the adding operator guide.
The Template¶
Templates convert data points into a model-friendly textual form. If using a predefined task, choose from the corresponding templates available in the catalog’s Templates section.
Note
Use the comprehnisve guide on templates for more templates features.
Alternively define your custom templates:
..
templates=TemplatesList([
InputOutputTemplate(
input_format="Translate this sentence from {source_language} to {target_language}: {text}.",
output_format='{translation}',
),
])
Testing your card¶
Once your card is ready you can test it:
from unitxt.card import TaskCard
from unitxt.loaders import LoadHF
from unitxt.operators import CopyFields, AddFields
from unitxt.test_utils.card import test_card
card = TaskCard(
loader=LoadHF(path="wmt16", name="de-en"),
preprocess_steps=[
CopyFields( # copy the fields to prepare the fields required by the task schema
field_to_field=[
["translation/en", "text"],
["translation/de", "translation"],
],
),
AddFields( # add new fields required by the task schema
fields={
"source_language": "english",
"target_language": "deutch",
}
),
],
task="tasks.translation.directed",
templates="templates.translation.directed.all"
)
test_card(card)
Adding to the catalog¶
Once your card is ready and tested you can add it to the catalog.
from unitxt import add_to_catalog
add_to_catalog(card, 'cards.wmt.en_de')
In the same way you can save also your custom templates and tasks.
Note
By default, a new artifact will be added to a local catalog stored in the library directory. To use a different catalog, use the catalog_path argument.
In order to load automatically from your new catalog remember to register your new catalog by unitxt.register_catalog(‘my_catalog’) or by setting the UNITXT_ARTIFACTORIES environment variable to include your catalog.
Putting it all together!¶
Now everything is ready to use the data! we use standard ICL recipe to load it:
from unitxt.standard import StandardRecipe
from unitxt import load_dataset
recipe = StandardRecipe(
card='cards.wmt.en_de',
num_demos=3, # The number of demonstrations for in-context learning
demos_pool_size=100 # The size of the demonstration pool from which to sample the 5 demonstrations
)
dataset = load_dataset(recipe)
Or even simpler with hugginface datasets:
from datasets import load_dataset
dataset = load_dataset('unitxt/data', 'card=cards.wmt.en_de,num_demos=5,demos_pool_size=100,instruction_item=0')
And the same results as before will be obtained.