Types and Serializers

Unitxt employs various tools for serializing data into textual format. One of these mechanisms is the Type-Serializers mechanism, which assigns serialization to specific types. For example, consider the following typing types:

from typing import NewType, TypedDict, Union, Literal, List, Any

Text = NewType("Text", str)
Number = NewType("Number", Union[float, int])

class Turn(TypedDict):
    role: Literal["system", "user", "agent"]
    content: Text

Dialog = NewType("Dialog", List[Turn])

class Table(TypedDict):
    header: List[str]
    rows: List[List[Any]]

For each type, we can assign different serialization methods specific to that type. This enables us to “plug and play” different serialization methods and modify our data’s textual representation accordingly.

Registering the Types

First, we need to register the different types we want to support from the Python typing types defined above.

from unitxt.type_utils import register_type

register_type(Text)
register_type(Number)
register_type(Turn)
register_type(Dialog)
register_type(Table)

Defining a Serializer for a Type

Once the types are registered, we can define serializers for those types. For example, consider creating a serializer for the Dialog type:

from unitxt.serializers import SingleTypeSerializer

class DialogSerializer(SingleTypeSerializer):
    serialized_type = Dialog

    def serialize(self, value: Dialog, instance: Dict[str, Any]) -> str:
        # Convert the Dialog into a string representation, typically combining roles and content
        return "\n".join(f"{turn['role']}: {turn['content']}" for turn in value)

Using the New Serializer

To use the new serializer, we need to do two things: 1. Ensure our task supports this type. 2. Add the serializer to the data loading recipe.

Using the New Type in a Task

Once the new type is registered, we can create a task that requires this type:

from unitxt.task import Task

dialog_summarization_task = Task(
    input_fields={"dialog": Dialog},
    reference_fields={"summary": str},
    prediction_type=str,
    metrics=["metrics.rouge"],
)

Loading Data with the Serializer

Once the task is defined with the type, we can use the serializer when loading the data.

Given this standalone card:

data = {
    "test": [
        {
            "dialog": [
                {"role": "user", "content": "What is the time?"},
                {"role": "system", "content": "4:13 PM"},
            ],
            "summary": "User asked for the time and got an answer."
        }
    ]
}

card = TaskCard(
    loader=LoadFromDictionary(data=data),
    task=dialog_summarization_task,
)

We can load the data with the serializer using:

dataset = load_dataset(
    card=card,
    template=InputOutputTemplate(
        instruction="Summarize the following dialog.",
        input_format="{dialog}",
        output_format="{summary}",
    ),
    serializer=DialogSerializer(),
)

Now if you print the input of the first instance of the dataset by print(dataset["test"][0]["source"]) you will get:

Summarize the following dialog.
user: What is the time?
system: 4:13 PM

Adding a Serializer to a Template

Another option is to set a default serializer for a given template. When creating a template, we need to add all the serializers for all the types we want to support. For this purpose, we use a multi-type serializer that wraps all the serializers together.

from unitxt.serializers import (
    MultiTypeSerializer, ImageSerializer, TableSerializer, DialogSerializer, ListSerializer,
)

serializer = MultiTypeSerializer(
    serializers=[
        ImageSerializer(),
        TableSerializer(),
        DialogSerializer(),
        ListSerializer(),
    ]
)

Now, we can add them to the template:

InputOutputTemplate(
    instruction="Summarize the following dialog.",
    input_format="{dialog}",
    output_format="{summary}",
    serializer=serializer
)

Important: Serializers are activated in the order they are defined, in a “first in, first serve” manner. This means that if you place the ListSerializer before the DialogSerializer, the ListSerializer will serialize the dialog, as the Dialog is also a List and matches the type requirement of the ListSerializer.