unitxt.dialog_operators module¶
Dialog Serializers.
Dialog serializers are the way to take dialog data and turn it into text that can be fed to the model.
The format of the dialog is:
dialog = [
{"user": "hello", "system": "hi"},
{"user": "kkk", "system": ""},
{"user": "kkk", "system": ""},
]
- class unitxt.dialog_operators.SerializeDialog(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | NoneType = None, to_field: str | NoneType = None, field_to_field: List[List[str]] | Dict[str, str] | NoneType = None, use_query: bool | NoneType = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, format: unitxt.formats.SystemFormat = None, last_response_to_field: str | NoneType = None, context_field: str | NoneType = None, context_separator: str = ' ', slice_first_and_last_turns_format: bool = True)[source]¶
Bases:
InstanceFieldOperator
Serializes dialog data for feeding into a model.
This class takes structured dialog data and converts it into a text format according to a specified template. It allows for the inclusion or exclusion of system responses and can operate on a per-turn basis or aggregate the entire dialog.
- Parameters:
field (str) – The field in the input data that contains the dialog.
to_field (Optional[str]) – The field in the output data where the serialized dialog will be stored.
last_user_turn_to_field (Optional[str]) – Field to store the last user turn.
last_system_turn_to_field (Optional[str]) – Field to store the last system turn.
context_field (Optional[str]) – Field that contains additional context to be prepended to the dialog.
- class unitxt.dialog_operators.SerializeOpenAiFormatDialog(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, field: str | NoneType = None, to_field: str | NoneType = None, field_to_field: List[List[str]] | Dict[str, str] | NoneType = None, use_query: bool | NoneType = None, process_every_value: bool = False, get_default: Any = None, not_exist_ok: bool = False, not_exist_do_nothing: bool = False, format: unitxt.formats.SystemFormat = None, last_response_to_field: str | NoneType = None, context_field: str | NoneType = None, context_separator: str = ' ', slice_first_and_last_turns_format: bool = True, is_last_turn_user_only: bool = True)[source]¶
Bases:
SerializeDialog
Serializes dialog data for feeding into a model.
This class takes structured dialog data in the OpenAi format, and converts it into a text format according to a specified template. It allows for the inclusion or exclusion of system responses and can operate on a per-turn basis or aggregate the entire dialog.
- Parameters:
field (str) – The field in the input data that contains the dialog.
to_field (Optional[str]) – The field in the output data where the serialized dialog will be stored.
last_user_turn_to_field (Optional[str]) – Field to store the last user turn.
last_system_turn_to_field (Optional[str]) – Field to store the last system turn.
context_field (Optional[str]) – Field that contains additional context to be prepended to the dialog.
- static merge_dialog_entries(dialog: List[Dict[str, str]]) List[Dict[str, str]] [source]¶
Merges consecutive dialog entries with the same role.
- Parameters:
dialog (List[Dict[str, str]]) – The input dialog list where each dictionary has a ‘role’ and ‘content’.
- Returns:
A new list where consecutive entries with the same role are merged.
- Return type:
List[Dict[str, str]]
- transform_dialog_to_standard_format(dialog: List[Dict[str, str]]) List[Dict[str, str]] [source]¶
Transforms a dialog from OpenAI format to a simplified format.
Each dictionary contains ‘user’ and ‘system’ keys with their respective contents. Consecutive entries with the same role are merged. Entries with invalid roles raise an error.
- Parameters:
dialog (List[Dict[str, str]]) – The input dialog in OpenAI format.
- Returns:
The transformed dialog.
- Return type:
List[Dict[str, str]]
- Raises:
ValueError – If an invalid role is detected.
- static validate_openai_dialog_format(dialog: List[Dict[str, str]]) None [source]¶
Validates that the given dialog follows the correct OpenAI format.
The function checks that: 1. The dialog is a list of dictionaries. 2. Each dictionary contains the keys ‘role’ and ‘content’. 3. The ‘role’ value is either ‘user’ or ‘assistant’. 4. Both ‘role’ and ‘content’ values are strings. 5. The first ‘role’ is ‘user’
If the dialog does not conform to the expected format, a descriptive ValueError is raised indicating the issue.
- Parameters:
dialog (List[Dict[str, str]]) – The dialog to validate.
- Raises:
ValueError – If the dialog does not meet the format requirements.