Multi-Modality¶
Note
This tutorial requires a Unitxt installation.
Introduction¶
This tutorial explores multi-modality processing with Unitxt, focusing on handling image-text-to-text datasets and creating an evaluation and inference pipeline. By the end, you’ll be equipped to process complex multi-modal data efficiently.
Part 1: Understanding Image-Text to Text Tasks¶
Image-text to text tasks involve providing a model with a combination of text and images and expecting a textual answer. These tasks are increasingly relevant in modern AI applications.
Tutorial Overview¶
We’ll create an image-text to text evaluation pipeline using Unitxt, concentrating on a document visual question answering (DocVQA) task. This task involves asking questions about images and generating textual answers.
Part 2: Data Preparation¶
Creating a Unitxt DataCard¶
Our first step is to prepare the data using a Unitxt DataCard. If you it’s your first time adding a DataCard we recommend reading the Adding Datasets Tutorial.
Dataset Selection¶
We’ll use the doc_vqa
dataset from Hugging Face, formatting it for a question-answering task. Specifically, we’ll use the tasks.qa.with_context.abstractive
task from the Unitxt Catalog.
DataCard Implementation¶
Our goal in the DataCard will be to adjust the data as it comes from hugginface to task schema.
Create a Python file named doc_vqa.py
and implement the DataCard as follows:
from unitxt.blocks import LoadHF, Set, TaskCard
from unitxt.collections_operators import Explode, Wrap
from unitxt.image_operators import ImageToText
from unitxt.operators import Copy
card = TaskCard(
loader=LoadHF(path="cmarkea/doc-vqa"),
preprocess_steps=[
"splitters.small_no_dev",
Explode(field=f"qa/en", to_field="pair"),
Copy(field="pair/question", to_field="question"),
Copy(field="pair/answer", to_field="answers"),
Wrap(field="answers", inside="list"),
Set(fields={"context_type": "image"}),
ImageToText(field="image", to_field="context"),
],
task="tasks.qa.with_context.abstractive",
templates="templates.qa.with_context.all",
)
The ImageToText Operator¶
The ImageToText
operator is a key component that integrates the image into the text, allowing inference engines to process both elements simultaneously.
Testing and Catalog Addition¶
Test the card and add it to the catalog:
test_card(card)
add_to_catalog(card, f"cards.doc_vqa.en", overwrite=True)
Part 3: Inference and Evaluation¶
With our data prepared, we can now test model performance.
Pipeline Setup¶
Set up the inference and evaluation pipeline:
from unitxt.api import evaluate, load_dataset
from unitxt.inference import HFLlavaInferenceEngine
from unitxt.text_utils import print_dict
# Initialize the inference model
model = HFLlavaInferenceEngine(
model_name="llava-hf/llava-interleave-qwen-0.5b-hf", max_new_tokens=32
)
# Load and prepare the dataset
dataset = load_dataset(
card="cards.doc_vqa.en",
template="templates.qa.with_context.title",
format="formats.models.llava_interleave",
loader_limit=30,
split="test"
)
# Select a subset for testing
dataset = dataset.select(range(5))
Executing Inference and Evaluation¶
Run the model and evaluate the results:
# Perform inference
predictions = model(dataset)
# Evaluate the predictions
results = evaluate(predictions=predictions, data=dataset)
# Print the results
print(results.global_scores.summary)
print(results.instances_scores.summary)
Conclusion¶
You have now successfully implemented an image-text to text evaluation pipeline with Unitxt. This tool enables the processing of complex multi-modal data, opening up new possibilities for AI applications.
We encourage you to explore further by experimenting with different datasets, models, and tasks to fully leverage Unitxt’s capabilities in multi-modal processing.