Conversation
|
@AmrMKayid Here is the complete PR. |
|
Closed it by accident! |
data/project_from_psrc.py
Outdated
| @@ -0,0 +1,227 @@ | |||
| import os | |||
There was a problem hiding this comment.
I'm curious why the name is project_from_psrc.py ? 👀
There was a problem hiding this comment.
Project from promptsource. If it's not clear we can rename it as project_from_promptsource.py. But I like the psrc short form. :D
There was a problem hiding this comment.
I think project_from_promptsource.py is better, let's change it pls
Co-authored-by: Amr Kayid <amrmkayid@gmail.com>
…:for-ai/instruct-multilingual into sbmaruf/project_instruct_data_using_psrc
data/project_from_psrc.py
Outdated
| executor.map( | ||
| export_dataset_func, | ||
| [ | ||
| prompted_sample_gen_io[0] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # dataset_output_dir | ||
| [ | ||
| prompted_sample_gen_io[1] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # dataset_name_or_path | ||
| [ | ||
| prompted_sample_gen_io[2] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # dataset_config | ||
| [ | ||
| prompted_sample_gen_io[3] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # psrc_prompt_template_signature | ||
| [ | ||
| prompted_sample_gen_io[4] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # prompt_template | ||
| [ | ||
| prompted_sample_gen_io[5] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # dataset | ||
| [ | ||
| prompted_sample_gen_io[6] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # args.add_source_metadata | ||
| [ | ||
| prompted_sample_gen_io[7] | ||
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
| ], # args.highlight_variables | ||
| ), | ||
| total=len(args.dataset_name_or_paths), | ||
| ): |
There was a problem hiding this comment.
can we please change this? 🥺
| executor.map( | |
| export_dataset_func, | |
| [ | |
| prompted_sample_gen_io[0] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # dataset_output_dir | |
| [ | |
| prompted_sample_gen_io[1] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # dataset_name_or_path | |
| [ | |
| prompted_sample_gen_io[2] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # dataset_config | |
| [ | |
| prompted_sample_gen_io[3] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # psrc_prompt_template_signature | |
| [ | |
| prompted_sample_gen_io[4] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # prompt_template | |
| [ | |
| prompted_sample_gen_io[5] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # dataset | |
| [ | |
| prompted_sample_gen_io[6] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # args.add_source_metadata | |
| [ | |
| prompted_sample_gen_io[7] | |
| for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
| ], # args.highlight_variables | |
| ), | |
| total=len(args.dataset_name_or_paths), | |
| ): | |
| executor.map(export_dataset_func, zip(*prompted_sample_gen_io_tuple_list)), | |
| total=len(args.dataset_name_or_paths), | |
| ): |
There was a problem hiding this comment.
I didn't want to write map because it was difficult to debug.
I think it should be executor.map(export_dataset_func, *zip(*prompted_sample_gen_io_tuple_list)),
Can you recheck?
Already updated that in the code.
| def xp3_export_dataset( | ||
| dataset_output_dir: str, | ||
| dataset_name: str, | ||
| dataset_config: str, | ||
| psrc_prompt_template_signature: str, | ||
| prompt_template: Type[Template], | ||
| dataset: Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset], | ||
| add_source_metadata: bool = False, | ||
| highlight_variables: bool = False, | ||
| lang: str = 'en' | ||
| ) -> str: | ||
| """ | ||
| Given a `hf-dataset` (arg: dataset) and a prompt template (arg: prompt_template), | ||
| project/transform samples from all the splits of dataset (arg: dataset) into an instruction format and | ||
| writes in the disk (arg: dataset_output_dir) | ||
|
|
||
| Args: | ||
| dataset_output_dir (str): Path to the output directory where data will be saved. | ||
| dataset_name (str): Name of the hf-dataset. | ||
| dataset_config (str): Name of the hf-dataset config. | ||
| psrc_prompt_template_signature (str): Name of the dataset & dataset-config for which prompts are written for. | ||
| prompt_template (Type[Template]): Transformation/projection module that will take a sample from arg:dataset and transform it to an instruction. | ||
| dataset (Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]): huggingface dataset that will be transformed into an instruction dataset. | ||
| add_source_metadata (bool = False): If True, all the data column from the args:dataset will be saved as a meta information with the instruction dataset. | ||
| add_source_metadata (bool = False): If True, prompt tokens and dataset tokens will be highlighted differently. This metadata will be saved as `highlighted_source` & `highlighted_target`. | ||
| lang (str = 'en'): language name of the dataset | ||
| """ |
There was a problem hiding this comment.
@sbmaruf what is the difference between this method and export_dataset I can see that both are very similar?
There was a problem hiding this comment.
At first, I wrote export_dataset where I exported all possible metadata while projecting data with templates. xp3_export_dataset doesn't export all possible metadata and strictly follows xP3 format. Please note that the xP3 projection doesn't contain any metadata.
Feature
A sample script to run the code.
Output folder structure
Output folder of the above run.
tree $SRC_DATA_FOLDER/glue/train,validation,test(huggingface datasets split name).glue_cola.editing.jsonlmeans, dataset is "glue", dataset config is "cola" and prompt name is "editing"glue_cola.editing.jsonl) is a prompted sample.Output Format
A sample
jsondata in thejsonlfile,The definition of each of the keys in the data,
jsonlfile containsjsondata which has a unique id within thejsonlfile. (datatype: string/int)psrc_prompt_template_signaturethere could be many prompt templates.prompt_namerefers to each of those prompt templates. (datatype: string)sentence. we save those data here. (datatype: from huggingface data source)label. we save those data here. (datatype: from huggingface data source)Note: Different datasets may have a different number of "src_meta_*" keys. It depends on the original huggingface dataset columns.