unitxt.span_lableing_operators module¶
- class unitxt.span_lableing_operators.IobExtractor(data_classification_policy: List[str] = None, _requirements_list: List[str] | Dict[str, str] = [], requirements: List[str] | Dict[str, str] = [], caching: bool = None, apply_to_streams: List[str] = None, dont_apply_to_streams: List[str] = None, labels: List[str] = __required__, begin_labels: List[str] = __required__, inside_labels: List[str] = __required__, outside_label: int = __required__)[source]¶
Bases:
InstanceOperator
A class designed to extract entities from sequences of text using the Inside-Outside-Beginning (IOB) tagging convention. It identifies entities based on IOB tags and categorizes them into predefined labels such as Person, Organization, and Location.
- Parameters:
labels (List[str]) – A list of entity type labels, e.g., [“Person”, “Organization”, “Location”].
begin_labels (List[str]) – A list of labels indicating the beginning of an entity, e.g., [“B-PER”, “B-ORG”, “B-LOC”].
inside_labels (List[str]) – A list of labels indicating the continuation of an entity, e.g., [“I-PER”, “I-ORG”, “I-LOC”].
outside_label (str) – The label indicating tokens outside of any entity, typically “O”.
The extraction process identifies spans of text corresponding to entities and labels them according to their entity type. Each span is annotated with a start and end character offset, the entity text, and the corresponding label.
Example of instantiation and usage:
operator = IobExtractor( labels=["Person", "Organization", "Location"], begin_labels=["B-PER", "B-ORG", "B-LOC"], inside_labels=["I-PER", "I-ORG", "I-LOC"], outside_label="O", ) instance = { "labels": ["B-PER", "I-PER", "O", "B-ORG", "I-ORG"], "tokens": ["John", "Doe", "works", "at", "OpenAI"] } processed_instance = operator.process(instance) print(processed_instance["spans"]) # Output: [{'start': 0, 'end': 8, 'text': 'John Doe', 'label': 'Person'}, ...]
For more details on the IOB tagging convention, see: https://en.wikipedia.org/wiki/Inside-outside-beginning_(tagging)