Concepts
Unitxt Data Processing Workflow
In the dynamic world of large language models, data introduction methods for processing are diverse. This includes using prompts, specific instructions, pre-loaded examples, or different textual layouts and templates. The key to managing these methods effectively is a standardized approach for communication and sharing data processing strategies, facilitating replicability and adaptability in varied data scenarios.
Unitxt addresses this need with a comprehensive data-to-model processing pipeline. This pipeline comprises various components, each essential in standardizing and streamlining the data processing communication.
Components of the Workflow
Data Point The foundational component, representing raw, unprocessed data awaiting transformation.
Task Acts as a schema for a specific skill in the data, like question answering. It sets the structure and format for data processing, providing a blueprint for operations.
Preprocessing Pipeline A series of operations that modify raw data to align with the task’s schema, preparing it for more detailed processing.
Template Here, the preprocessed data undergoes creative processing. The template, guided by the task schema, shapes the data into a query and response format, readying it for application.
Query and Response Resulting from the template’s processing, these components represent the input (query) and expected output (response), tailored according to the task’s requirements.
Instruction A singular, clear directive outlining the approach to process the query to achieve the desired response, guiding effective data usage.
Demonstration Practical examples of Query and Response pairs, illustrating the execution of the task and the data processing as per the task schema.
Format This component organizes all others into a structured and coherent whole, ensuring the data is ready for its intended application or analysis.
The Unitxt Data Processing Workflow, with its comprehensive components, aims to simplify the transition from raw data to structured, meaningful outputs, enhancing clarity and efficiency in data processing.