π Open Australian Legal QaΒΆ
cards.rag.response_generation.train.open_australian_legal_qa
type: TaskCard
loader:
type: LoadHF
path: umarbutler/open-australian-legal-qa
preprocess_steps:
- type: SplitRandomMix
mix:
train: train[0.5]
validation: train[0.2]
test: train[0.3]
- type: Shuffle
- type: Copy
field_to_field:
source/text: contexts
answer: reference_answers
source/citation: contexts_ids
- type: ListFieldValues
fields:
- reference_answers
to_field: reference_answers
- type: ListFieldValues
fields:
- contexts
to_field: contexts
- type: ListFieldValues
fields:
- contexts_ids
to_field: contexts_ids
task: tasks.rag.response_generation
templates:
default: templates.rag.response_generation.please_respond_chat
[source]Explanation about TaskCardΒΆ
TaskCard delineates the phases in transforming the source dataset into model input, and specifies the metrics for evaluation of model output.
- Attributes:
loader: specifies the source address and the loading operator that can access that source and transform it into a unitxt multistream.
preprocess_steps: list of unitxt operators to process the data source into model input.
task: specifies the fields (of the already (pre)processed instance) making the inputs, the fields making the outputs, and the metrics to be used for evaluating the model output.
templates: format strings to be applied on the input fields (specified by the task) and the output fields. The template also carries the instructions and the list of postprocessing steps, to be applied to the model output.
Explanation about SplitRandomMixΒΆ
Splits a multistream into new streams (splits), whose names, source input stream, and amount of instances, are specified by arg βmixβ.
The keys of arg βmixβ, are the names of the new streams, the values are of the form: βname-of-source-stream[percentage-of-source-stream]β Each input instance, of any input stream, is selected exactly once for inclusion in any of the output streams.
Examples: When processing a multistream made of two streams whose names are βtrainβ and βtestβ, by SplitRandomMix(mix = { βtrainβ: βtrain[99%]β, βvalidationβ: βtrain[1%]β, βtestβ: βtestβ }) the output is a multistream, whose three streams are named βtrainβ, βvalidationβ, and βtestβ. Output stream βtrainβ is made of randomly selected 99% of the instances of input stream βtrainβ, output stream βvalidationβ is made of the remaining 1% instances of input βtrainβ, and output stream βtestβ is made of the whole of input stream βtestβ.
When processing the above input multistream by SplitRandomMix(mix = { βtrainβ: βtrain[50%]+test[0.1]β, βvalidationβ: βtrain[50%]+test[0.2]β, βtestβ: βtest[0.7]β }) the output is a multistream, whose three streams are named βtrainβ, βvalidationβ, and βtestβ. Output stream βtrainβ is made of randomly selected 50% of the instances of input stream βtrainβ + randomly selected 0.1 (i.e., 10%) of the instances of input stream βtestβ. Output stream βvalidationβ is made of the remaining 50% instances of input βtrainβ+ randomly selected 0.2 (i.e., 20%) of the original instances of input βtestβ, that were not selected for output βtrainβ, and output stream βtestβ is made of the remaining instances of input βtestβ.
Explanation about ShuffleΒΆ
Shuffles the order of instances in each page of a stream.
- Args (of superclass):
page_size (int): The size of each page in the stream. Defaults to 1000.
Explanation about LoadHFΒΆ
Loads datasets from the HuggingFace Hub.
It supports loading with or without streaming, and it can filter datasets upon loading.
- Args:
path: The path or identifier of the dataset on the HuggingFace Hub. name: An optional dataset name. data_dir: Optional directory to store downloaded data. split: Optional specification of which split to load. data_files: Optional specification of particular data files to load. revision: Optional. The revision of the dataset. Often the commit id. Use in case you want to set the dataset version. streaming: Bool indicating if streaming should be used. filtering_lambda: A lambda function for filtering the data after loading. num_proc: Optional integer to specify the number of processes to use for parallel dataset loading.
- Example:
Loading glueβs mrpc dataset
load_hf = LoadHF(path='glue', name='mrpc')
Explanation about CopyΒΆ
Copies values from specified fields to specified fields.
- Args (of parent class):
field_to_field (Union[List[List], Dict[str, str]]): A list of lists, where each sublist contains the source field and the destination field, or a dictionary mapping source fields to destination fields.
- Examples:
An input instance {βaβ: 2, βbβ: 3}, when processed by Copy(field_to_field={βaβ: βbβ} would yield {βaβ: 2, βbβ: 2}, and when processed by Copy(field_to_field={βaβ: βcβ} would yield {βaβ: 2, βbβ: 3, βcβ: 2}
with field names containing / , we can also copy inside the field: Copy(field=βa/0β,to_field=βaβ) would process instance {βaβ: [1, 3]} into {βaβ: 1}
Explanation about ListFieldValuesΒΆ
Concatenates values of multiple fields into a list, and assigns it to a new field.
References: templates.rag.response_generation.please_respond_chat, tasks.rag.response_generation
Read more about catalog usage here.