sbmaruf/project instruct data using psrc by sbmaruf · Pull Request #1 · Cohere-Labs-Community/instruct-multilingual

sbmaruf · 2023-03-08T20:00:51Z

Feature

Project/transform structured dataset to an instruction dataset.
Multiprocessing generation

A sample script to run the code.

DUMP_FOLDER='' # fill this with your desired address
SRC_DATA_FOLDER=$DUMP_FOLDER/projection_from_psrc
mkdir -p $SRC_DATA_FOLDER
mkdir -p $SRC_DATA_FOLDER/cache

python data/project_from_psrc.py \
--dataset-name-or-paths glue glue glue glue glue \
--dataset-configs cola sst2 mrpc qqp stsb \
--prompt-templates-configs None None None None None \
--cache-dir $SRC_DATA_FOLDER/cache \
--output-dir $SRC_DATA_FOLDER \
--highlight-variables \
--add-source-metadata \
--num-proc 16

Output folder structure

Output folder of the above run. tree $SRC_DATA_FOLDER/glue/

├── cola
│   ├── test
│   │   ├── glue_cola.editing.jsonl
│   │   ├── glue_cola.Following_sentence_acceptable.jsonl
│   │   ├── glue_cola.is_this_correct.jsonl
│   │   ├── glue_cola.Make_sense_yes_no.jsonl
│   │   └── glue_cola.Previous_sentence_acceptable.jsonl
│   ├── train
│   │   ├── glue_cola.editing.jsonl
│   │   ├── glue_cola.Following_sentence_acceptable.jsonl
│   │   ├── glue_cola.is_this_correct.jsonl
│   │   ├── glue_cola.Make_sense_yes_no.jsonl
│   │   └── glue_cola.Previous_sentence_acceptable.jsonl
│   └── validation
│       ├── glue_cola.editing.jsonl
│       ├── glue_cola.Following_sentence_acceptable.jsonl
│       ├── glue_cola.is_this_correct.jsonl
│       ├── glue_cola.Make_sense_yes_no.jsonl
│       └── glue_cola.Previous_sentence_acceptable.jsonl
├── mrpc
│   ├── test
│   │   ├── glue_mrpc.equivalent.jsonl
│   │   ├── glue_mrpc.generate_paraphrase.jsonl
│   │   ├── glue_mrpc.generate_sentence.jsonl
│   │   ├── glue_mrpc.paraphrase.jsonl
│   │   ├── glue_mrpc.replace.jsonl
│   │   ├── glue_mrpc.same_thing.jsonl
│   │   └── glue_mrpc.want_to_know.jsonl
│   ├── train
│   │   ├── glue_mrpc.equivalent.jsonl
│   │   ├── glue_mrpc.generate_paraphrase.jsonl
│   │   ├── glue_mrpc.generate_sentence.jsonl
│   │   ├── glue_mrpc.paraphrase.jsonl
│   │   ├── glue_mrpc.replace.jsonl
│   │   ├── glue_mrpc.same_thing.jsonl
│   │   └── glue_mrpc.want_to_know.jsonl
│   └── validation
│       ├── glue_mrpc.equivalent.jsonl
│       ├── glue_mrpc.generate_paraphrase.jsonl
│       ├── glue_mrpc.generate_sentence.jsonl
│       ├── glue_mrpc.paraphrase.jsonl
│       ├── glue_mrpc.replace.jsonl
│       ├── glue_mrpc.same_thing.jsonl
│       └── glue_mrpc.want_to_know.jsonl
├── qqp
│   ├── test
│   │   ├── glue_qqp.answer.jsonl
│   │   ├── glue_qqp.duplicate.jsonl
│   │   ├── glue_qqp.duplicate_or_not.jsonl
│   │   ├── glue_qqp.meaning.jsonl
│   │   ├── glue_qqp.quora.jsonl
│   │   └── glue_qqp.same_thing.jsonl
│   ├── train
│   │   ├── glue_qqp.answer.jsonl
│   │   ├── glue_qqp.duplicate.jsonl
│   │   ├── glue_qqp.duplicate_or_not.jsonl
│   │   ├── glue_qqp.meaning.jsonl
│   │   ├── glue_qqp.quora.jsonl
│   │   └── glue_qqp.same_thing.jsonl
│   └── validation
│       ├── glue_qqp.answer.jsonl
│       ├── glue_qqp.duplicate.jsonl
│       ├── glue_qqp.duplicate_or_not.jsonl
│       ├── glue_qqp.meaning.jsonl
│       ├── glue_qqp.quora.jsonl
│       └── glue_qqp.same_thing.jsonl
├── sst2
│   ├── test
│   │   ├── glue_sst2.following_positive_negative.jsonl
│   │   ├── glue_sst2.happy_or_mad.jsonl
│   │   ├── glue_sst2.positive_negative_after.jsonl
│   │   ├── glue_sst2.review.jsonl
│   │   └── glue_sst2.said.jsonl
│   ├── train
│   │   ├── glue_sst2.following_positive_negative.jsonl
│   │   ├── glue_sst2.happy_or_mad.jsonl
│   │   ├── glue_sst2.positive_negative_after.jsonl
│   │   ├── glue_sst2.review.jsonl
│   │   └── glue_sst2.said.jsonl
│   └── validation
│       ├── glue_sst2.following_positive_negative.jsonl
│       ├── glue_sst2.happy_or_mad.jsonl
│       ├── glue_sst2.positive_negative_after.jsonl
│       ├── glue_sst2.review.jsonl
│       └── glue_sst2.said.jsonl
└── stsb
    ├── test
    │   ├── glue_stsb.examples.jsonl
    │   ├── glue_stsb.rank.jsonl
    │   ├── glue_stsb.rate.jsonl
    │   ├── glue_stsb.score.jsonl
    │   └── glue_stsb.similarity.jsonl
    ├── train
    │   ├── glue_stsb.examples.jsonl
    │   ├── glue_stsb.rank.jsonl
    │   ├── glue_stsb.rate.jsonl
    │   ├── glue_stsb.score.jsonl
    │   └── glue_stsb.similarity.jsonl
    └── validation
        ├── glue_stsb.examples.jsonl
        ├── glue_stsb.rank.jsonl
        ├── glue_stsb.rate.jsonl
        ├── glue_stsb.score.jsonl
        └── glue_stsb.similarity.jsonl

Each of the task has 3 different folder train, validation, test (huggingface datasets split name).
Each of the files in this folder is a project dataset of a prompt template.
A file name glue_cola.editing.jsonl means, dataset is "glue", dataset config is "cola" and prompt name is "editing"
Each of the lines in the file (i.e., glue_cola.editing.jsonl) is a prompted sample.

Output Format

A sample json data in the jsonl file,

{"id": 0, 
"source": "I'm copy-editing a story for publication. It has the following sentence in it:\nOur friends won't buy this analysis, let alone the next one we propose.\nDoes this sentence make sense and is it grammatically correct? Please answer yes or no.", 
"target": "yes", 
"psrc_prompt_template_signature": "glue/cola", 
"prompt_name": "editing", 
"prompt_answer_choice_list": ["no", "yes"], 
"dataset_name": "glue", 
"dataset_config": "cola", 
"split": "train", 
"metrics": ["Accuracy"], 
"original_task": true, 
"choices_in_prompt": true, 
"languages": ["en"], 
"highlighted_source": "I'm copy-editing a story for publication. It has the following sentence in it:\n<span style='color: #F08080'>Our friends won't buy this analysis, let alone the next one we propose.</span>\nDoes this sentence make sense and is it grammatically correct? Please answer <span style='color: #F08080'>yes or no</span>.", 
"highlighted_target": "<span style='color: #F08080'>yes</span>", 
"src_meta_sentence": "Our friends won't buy this analysis, let alone the next one we propose.", 
"src_meta_label": 1, 
"src_meta_idx": 0
}

The definition of each of the keys in the data,

id: An unique id for the sample. Each line of the jsonl file contains json data which has a unique id within the jsonl file. (datatype: string/int)
source: projected input for the language model. This is the instruction. (datatype: string)
target: projected output for the language model. This is the gold response. (datatype: string)
psrc_prompt_template_signature: prompt template signature from promptsource repository. Usually, a set of prompt templates are written for a task (i.e., glue/cola, glue/mrpc). This usually refers to that task. (datatype: string)
prompt_name: Name of the individual prompt template. Under a psrc_prompt_template_signature there could be many prompt templates. prompt_name refers to each of those prompt templates. (datatype: string)
prompt_answer_choice_list: Name of all potential outcomes. We often do not have any data for this field. Especially for generative tasks. Only categorical task has this field (i.e., [yes, no], [True, False], [A, B, C, D]). (datatype: list of strings)
dataset_name: Name of the huggingface dataset (datatype: string)
dataset_config: Subset name of the huggingface dataset (datatype: string)
split: Split name (i.e., train, dev, test) (datatype: string)
metrics: metrics to evaluate the response. (datatype: list of strings)
original_task: If the prompted sample (source, target) refers to the original task for the dataset being created (datatype: True/False)
choices_in_prompt: If there is any randomness in the prompt generation (datatype: list of strings)
languages: The language of the prompt template (not the dataset). (datatype: list of strings)
highlighted_source: Highlight input tokens that are coming from the prompts and original dataset. This feature can be used to differentiate prompt tokens and input tokens. (datatype: string)
highlighted_target: "Highlight response/output tokens that are coming from the prompts and original dataset. This feature can be used to differentiate prompt tokens and input tokens. (datatype: string)
src_meta_sentence: In the original huggingface dataset there was a column named sentence. we save those data here. (datatype: from huggingface data source)
src_meta_label: In the original huggingface dataset there was a column named label. we save those data here. (datatype: from huggingface data source)
src_meta_idx: the index of the original huggingface data. (datatype: from huggingface data source)

Note: Different datasets may have a different number of "src_meta_*" keys. It depends on the original huggingface dataset columns.

sbmaruf · 2023-03-08T20:10:57Z

@AmrMKayid Here is the complete PR.

sbmaruf · 2023-03-08T20:11:32Z

Closed it by accident!

data/project_from_psrc.py

README.md

AmrMKayid · 2023-03-09T14:12:00Z

data/project_from_psrc.py

@@ -0,0 +1,227 @@
+import os


I'm curious why the name is project_from_psrc.py ? 👀

Project from promptsource. If it's not clear we can rename it as project_from_promptsource.py. But I like the psrc short form. :D

I think project_from_promptsource.py is better, let's change it pls

Co-authored-by: Amr Kayid <amrmkayid@gmail.com>

…:for-ai/instruct-multilingual into sbmaruf/project_instruct_data_using_psrc

AmrMKayid · 2023-04-27T19:44:58Z

data/project_from_psrc.py

+			executor.map(
+				export_dataset_func,
+				[
+					prompted_sample_gen_io[0]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # dataset_output_dir
+				[
+					prompted_sample_gen_io[1]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # dataset_name_or_path
+				[
+					prompted_sample_gen_io[2]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # dataset_config
+				[
+					prompted_sample_gen_io[3]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # psrc_prompt_template_signature
+				[
+					prompted_sample_gen_io[4]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # prompt_template
+				[
+					prompted_sample_gen_io[5]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # dataset
+				[
+					prompted_sample_gen_io[6]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # args.add_source_metadata
+				[
+					prompted_sample_gen_io[7]
+					for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list
+				],  # args.highlight_variables
+			),
+			total=len(args.dataset_name_or_paths),
+		):


can we please change this? 🥺

Suggested change

executor.map(

export_dataset_func,

[

prompted_sample_gen_io[0]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # dataset_output_dir

[

prompted_sample_gen_io[1]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # dataset_name_or_path

[

prompted_sample_gen_io[2]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # dataset_config

[

prompted_sample_gen_io[3]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # psrc_prompt_template_signature

[

prompted_sample_gen_io[4]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # prompt_template

[

prompted_sample_gen_io[5]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # dataset

[

prompted_sample_gen_io[6]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # args.add_source_metadata

[

prompted_sample_gen_io[7]

for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list

], # args.highlight_variables

),

total=len(args.dataset_name_or_paths),

):

executor.map(export_dataset_func, zip(*prompted_sample_gen_io_tuple_list)),

total=len(args.dataset_name_or_paths),

):

I didn't want to write map because it was difficult to debug.
I think it should be executor.map(export_dataset_func, *zip(*prompted_sample_gen_io_tuple_list)),
Can you recheck?
Already updated that in the code.

AmrMKayid · 2023-04-27T19:51:09Z

data/project_from_psrc.py

+def xp3_export_dataset(
+	dataset_output_dir: str,
+	dataset_name: str,
+	dataset_config: str,
+	psrc_prompt_template_signature: str,
+	prompt_template: Type[Template],
+	dataset: Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset],
+	add_source_metadata: bool = False,
+	highlight_variables: bool = False,
+	lang: str = 'en'
+) -> str:
+	"""
+	Given a `hf-dataset` (arg: dataset) and a prompt template (arg: prompt_template),
+	project/transform samples from all the splits of dataset (arg: dataset) into an instruction format and
+	writes in the disk (arg: dataset_output_dir)
+
+	Args:
+		dataset_output_dir (str): Path to the output directory where data will be saved.
+		dataset_name (str): Name of the hf-dataset.
+		dataset_config (str): Name of the hf-dataset config.
+		psrc_prompt_template_signature (str): Name of the dataset & dataset-config for which prompts are written for.
+		prompt_template (Type[Template]): Transformation/projection module that will take a sample from arg:dataset and transform it to an instruction.
+		dataset (Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]): huggingface dataset that will be transformed into an instruction dataset.
+		add_source_metadata (bool = False): If True, all the data column from the args:dataset will be saved as a meta information with the instruction dataset.
+		add_source_metadata (bool = False): If True, prompt tokens and dataset tokens will be highlighted differently. This metadata will be saved as  `highlighted_source` & `highlighted_target`.
+		lang (str = 'en'): language name of the dataset
+	"""


@sbmaruf what is the difference between this method and export_dataset I can see that both are very similar?

At first, I wrote export_dataset where I exported all possible metadata while projecting data with templates. xp3_export_dataset doesn't export all possible metadata and strictly follows xP3 format. Please note that the xP3 projection doesn't contain any metadata.

AmrMKayid

left few comments for code quality improvements, otherwise LGTM, thanks @sbmaruf!

sbmaruf added 5 commits March 9, 2023 03:10

promptsource dependency added

0ac6fe4

Project nlp dataset to a instruction dataset using promptsource.

7ffc88b

sample bash script for running data/project_from_psrc.py

fb3fbe7

update readme

73c2610

sync naming of argument and module var

f14a3d1

sbmaruf requested a review from AmrMKayid March 8, 2023 20:10

sbmaruf closed this Mar 8, 2023

sbmaruf reopened this Mar 8, 2023

Kowasaki approved these changes Mar 8, 2023

View reviewed changes

fix: typo & --help

93f2bfe