
Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking.
How would you like to start?
Why Unitxt?
Unitxt was built by IBM Research to host a maintainable large collection of evaluation assets. If you care about robust evaluation that last, then Unitxt is for you.
IBM
Research
Unitxt catalog is a one stop shop containing well documented assets constructing robust evaluation pipelines such as: Task Instructions, Data Loaders, Data Types Serializers and Inference Engines
End to End evaluation made simple
1 from unitxt import evaluate, create_dataset
2 from unitxt.blocks import Task, InputOutputTemplate
3 from unitxt.inference import HFAutoModelInferenceEngine
4
5 # Question-answer dataset
6 data = [
7 {"question": "What is the capital of Texas?", "answer": "Austin"},
8 {"question": "What is the color of the sky?", "answer": "Blue"},
9 ]
10
11 # Define the task and evaluation metric
12 task = Task(
13 input_fields={"question": str},
14 reference_fields={"answer": str},
15 prediction_type=str,
16 metrics=["metrics.accuracy"],
17 )
18
19 # Create a template to format inputs and outputs
20 template = InputOutputTemplate(
21 instruction="Answer the following question.",
22 input_format="{question}",
23 output_format="{answer}",
24 postprocessors=["processors.lower_case"],
25 )
26
27 # Prepare the dataset
28 dataset = create_dataset(
29 task=task,
30 template=template,
31 format="formats.chat_api",
32 test_set=data,
33 split="test",
34 )
35
36 # Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
37 model = HFAutoModelInferenceEngine(
38 model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
39 )
40
41 # Generate predictions and evaluate
42 predictions = model(dataset)
43 results = evaluate(predictions=predictions, data=dataset)
44
45 # Print results
46 print("Global Results:\n", results.global_scores.summary)
47 print("Instance Results:\n", results.instance_scores.summary)