unitxt.struct_data_operators module¶

This section describes unitxt operators for structured data.

These operators are specialized in handling structured data like tables. For tables, expected input format is: {

“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]

}

For triples, expected input format is: [[ “subject1”, “relation1”, “object1” ], [ “subject1”, “relation2”, “object2”]]

For key-value pairs, expected input format is: {“key1”: “value1”, “key2”: value2, “key3”: “value3”} ————————

class unitxt.struct_data_operators.ConvertTableColNamesToSequential(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Replaces actual table column names with static sequential names like col_0, col_1,…

Sample input: {

“header”: [“name”, “age”], “rows”: [[“Alex”, 21], [“Donald”, 34]]

} Sample output: {

“header”: [“col_0”, “col_1”], “rows”: [[“Alex”, 21], [“Donald”, 34]]

}

class unitxt.struct_data_operators.DumpJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶: Bases: FieldOperator

class unitxt.struct_data_operators.ListToKeyValPairs(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: ~typing.List[str], to_field: str)¶

Bases: InstanceOperator

Maps list of keys and values into key:value pairs.

Sample input in expected format: {“keys”: [“name”, “age”, “sex”], “values”: [“Alex”, 31, “M”]} Sample output: {“name”: “Alex”, “age”: 31, “sex”: “M”}

class unitxt.struct_data_operators.LoadJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, failure_value: ~typing.Any = None, allow_failure: bool = False)¶: Bases: FieldOperator

class unitxt.struct_data_operators.MapHTMLTableToJSON(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, _requirements_list: ~typing.List[str] | ~typing.Dict[str, str] = ['bs4'], caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Converts HTML table format to the basic one (JSON).

JSON format {

“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]

}

class unitxt.struct_data_operators.MapTableListsToStdTableJSON(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Converts lists table format to the basic one (JSON).

JSON format {

“header”: [“col1”, “col2”], “rows”: [[“row11”, “row12”], [“row21”, “row22”], [“row31”, “row32”]]

}

class unitxt.struct_data_operators.SerializeKeyValPairs(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Serializes key, value pairs into a flat sequence.

Sample input in expected format: {“name”: “Alex”, “age”: 31, “sex”: “M”} Sample output: name is Alex, age is 31, sex is M

class unitxt.struct_data_operators.SerializeTable(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: ABC, FieldOperator

TableSerializer converts a given table into a flat sequence with special symbols.

Output format varies depending on the chosen serializer. This abstract class defines structure of a typical table serializer that any concrete implementation should follow.

class unitxt.struct_data_operators.SerializeTableAsDFLoader(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: SerializeTable

DFLoader Table Serializer.

Pandas dataframe based code snippet format serializer. Format(Sample): pd.DataFrame({

“name” : [“Alex”, “Diana”, “Donald”], “age” : [26, 34, 39]

}, index=[0,1,2])

class unitxt.struct_data_operators.SerializeTableAsIndexedRowMajor(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: SerializeTable

Indexed Row Major Table Serializer.

Commonly used row major serialization format. Format: col : col1 | col2 | col 3 row 1 : val1 | val2 | val3 | val4 row 2 : val1 | …

class unitxt.struct_data_operators.SerializeTableAsJson(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: SerializeTable

JSON Table Serializer.

Json format based serializer. Format(Sample): {

“0”:{“name”:”Alex”,”age”:26}, “1”:{“name”:”Diana”,”age”:34}, “2”:{“name”:”Donald”,”age”:39}

}

class unitxt.struct_data_operators.SerializeTableAsMarkdown(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: SerializeTable

Markdown Table Serializer.

Markdown table format is used in GitHub code primarily. Format: |col1|col2|col3| |---|—|---| |A|4|1| |I|2|1| …

class unitxt.struct_data_operators.SerializeTableRowAsList(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: str, to_field: str, max_cell_length: int | None = None)¶

Bases: InstanceOperator

Serializes a table row as list.

Parameters:

fields (str) –
to_field (str) –
max_cell_length (int) –

class unitxt.struct_data_operators.SerializeTableRowAsText(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, fields: str, to_field: str, max_cell_length: int | None = None)¶

Bases: InstanceOperator

Serializes a table row as text.

Parameters:

fields (str) –
to_field (str) –
max_cell_length (int) –

class unitxt.struct_data_operators.SerializeTriples(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Serializes triples into a flat sequence.

Sample input in expected format: [[ “First Clearing”, “LOCATION”, “On NYS 52 1 Mi. Youngsville” ], [ “On NYS 52 1 Mi. Youngsville”, “CITY_OR_TOWN”, “Callicoon, New York”]]

Sample output: First Clearing : LOCATION : On NYS 52 1 Mi. Youngsville | On NYS 52 1 Mi. Youngsville : CITY_OR_TOWN : Callicoon, New York

class unitxt.struct_data_operators.ShuffleTableColumns(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Shuffles the table columns randomly.

Sample Input:

{: “header”: [“name”, “age”], “rows”: [[“Alex”, 26], [“Raj”, 34], [“Donald”, 39]],

}

Sample Output:

{: “header”: [“age”, “name”], “rows”: [[26, “Alex”], [34, “Raj”], [39, “Donald”]],

}

class unitxt.struct_data_operators.ShuffleTableRows(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False)¶

Bases: FieldOperator

Shuffles the input table rows randomly.

Sample Input: {

“header”: [“name”, “age”], “rows”: [[“Alex”, 26], [“Raj”, 34], [“Donald”, 39]],

}

Sample Output: {

“header”: [“name”, “age”], “rows”: [[“Donald”, 39], [“Raj”, 34], [“Alex”, 26]],

}

class unitxt.struct_data_operators.TruncateTableCells(__tags__: Dict[str, str] = {}, data_classification_policy: List[str] = None, caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, max_length: int = 15, table: str = None, text_output: str | None = None)¶

Bases: InstanceOperator

Limit the maximum length of cell values in a table to reduce the overall length.

Parameters:

max_length (int) –
answer (For tasks that produce a cell value as) –
replicated (truncating a cell value should be) –
implementation. (with truncating the corresponding answer as well. This has been addressed in the) –

class unitxt.struct_data_operators.TruncateTableRows(__tags__: ~typing.Dict[str, str] = {}, data_classification_policy: ~typing.List[str] = None, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, field: str | None = None, to_field: str | None = None, field_to_field: ~typing.List[~typing.List[str]] | ~typing.Dict[str, str] | None = None, use_query: bool, process_every_value: bool = False, get_default: ~typing.Any = None, not_exist_ok: bool = False, rows_to_keep: int = 10)¶

Bases: FieldOperator

Limits table rows to specified limit by removing excess rows via random selection.

Parameters:: rows_to_keep (int) –

unitxt.struct_data_operators.truncate_cell(cell_value, max_len)¶