unitxt.span_lableing_operators module

class unitxt.span_lableing_operators.IobExtractor(__tags__: ~typing.Dict[str, str] = {}, caching: bool = None, apply_to_streams: ~typing.List[str] = None, dont_apply_to_streams: ~typing.List[str] = None, labels: ~typing.List[str], begin_labels: ~typing.List[str], inside_labels: ~typing.List[str], outside_label: int)

Bases: StreamInstanceOperator

A class designed to extract entities from sequences of text using the Inside-Outside-Beginning (IOB) tagging convention. It identifies entities based on IOB tags and categorizes them into predefined labels such as Person, Organization, and Location.

labels

A list of entity type labels, e.g., [“Person”, “Organization”, “Location”].

Type:

List[str]

begin_labels

A list of labels indicating the beginning of an entity, e.g., [“B-PER”, “B-ORG”, “B-LOC”].

Type:

List[str]

inside_labels

A list of labels indicating the continuation of an entity, e.g., [“I-PER”, “I-ORG”, “I-LOC”].

Type:

List[str]

outside_label

The label indicating tokens outside of any entity, typically “O”.

Type:

str

The extraction process identifies spans of text corresponding to entities and labels them according to their entity type. Each span is annotated with a start and end character offset, the entity text, and the corresponding label.

Example of instantiation and usage: ```python operator = IobExtractor(

labels=[“Person”, “Organization”, “Location”], begin_labels=[“B-PER”, “B-ORG”, “B-LOC”], inside_labels=[“I-PER”, “I-ORG”, “I-LOC”], outside_label=”O”,

)

instance = {

“labels”: [“B-PER”, “I-PER”, “O”, “B-ORG”, “I-ORG”], “tokens”: [“John”, “Doe”, “works”, “at”, “OpenAI”]

} processed_instance = operator.process(instance) print(processed_instance[“spans”]) # Output: [{‘start’: 0, ‘end’: 8, ‘text’: ‘John Doe’, ‘label’: ‘Person’}, …] ```

For more details on the IOB tagging convention, see: https://en.wikipedia.org/wiki/Inside-outside-beginning_(tagging)