unitxt.span_lableing_operators module¶
- class unitxt.span_lableing_operators.IobExtractor(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, labels: List[str] = __required__, begin_labels: List[str] = __required__, inside_labels: List[str] = __required__, outside_label: int = __required__)[source]¶
Bases:
InstanceOperatorA class designed to extract entities from sequences of text using the Inside-Outside-Beginning (IOB) tagging convention. It identifies entities based on IOB tags and categorizes them into predefined labels such as Person, Organization, and Location.
- labels¶
A list of entity type labels, e.g., [“Person”, “Organization”, “Location”].
- Type:
List[str]
- begin_labels¶
A list of labels indicating the beginning of an entity, e.g., [“B-PER”, “B-ORG”, “B-LOC”].
- Type:
List[str]
- inside_labels¶
A list of labels indicating the continuation of an entity, e.g., [“I-PER”, “I-ORG”, “I-LOC”].
- Type:
List[str]
- outside_label¶
The label indicating tokens outside of any entity, typically “O”.
- Type:
str
The extraction process identifies spans of text corresponding to entities and labels them according to their entity type. Each span is annotated with a start and end character offset, the entity text, and the corresponding label.
Example of instantiation and usage: ```python operator = IobExtractor(
labels=[“Person”, “Organization”, “Location”], begin_labels=[“B-PER”, “B-ORG”, “B-LOC”], inside_labels=[“I-PER”, “I-ORG”, “I-LOC”], outside_label=”O”,
)
- instance = {
“labels”: [“B-PER”, “I-PER”, “O”, “B-ORG”, “I-ORG”], “tokens”: [“John”, “Doe”, “works”, “at”, “OpenAI”]
} processed_instance = operator.process(instance) print(processed_instance[“spans”]) # Output: [{‘start’: 0, ‘end’: 8, ‘text’: ‘John Doe’, ‘label’: ‘Person’}, …] ```
For more details on the IOB tagging convention, see: https://en.wikipedia.org/wiki/Inside-outside-beginning_(tagging)