feat: add instructions to reproduce synthesis of DART-Math-Prop2Diff

tongyx361 · tongyx361 · commit 3cebe256233b · 2024-12-01T21:17:44.000+08:00
diff --git a/README.md b/README.md
@@ -402,7 +402,7 @@ baseline** in the paper, just set
 
 <summary>
 
-The off-the-shelf command to reproduce the data synthesis of the Vanilla
+The off-the-shelf command to reproduce the synthesis of the Vanilla
 Rejection Tuning (VRT) baseline in the paper
 </summary>
 
@@ -419,6 +419,31 @@ CUDA_VISIBLE_DEVICES="0" python pipeline/gen.py \
 
 </details>
 
+<details>
+
+<summary>
+
+So sorry that it still need some manual efforts to reproduce the data
+synthesis of `DART-Math-Prop2Diff`. For now, please follow the
+instructions in the paper
+</summary>
+
+1.  Calculate “fail rate” (`1-pass_rate`) for each query in MATH and
+    GSM8K training sets (see the `pass_rate` field of query information
+    in
+    [MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info)
+    and
+    [GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info)).
+2.  Calculate the target number of correct responses for each query in
+    the final training set. Note that we try to ensure at least one
+    correct response for each query in the `DART-Math` datasets, which
+    you could implement by rounding **up** when calculating the response
+    number for each query.
+3.  Sample responses for each query until the target number of correct
+    ones is met (thus proportional to its “fail rate”).
+
+</details>
+
 After the synthesis, you can use the [curation
 script](pipeline/curate.py) to curate the final dataset.
 
diff --git a/nbs/index.ipynb b/nbs/index.ipynb
@@ -404,7 +404,7 @@
     "To reproduce the data synthesis of the **Vanilla Rejection Tuning (VRT) baseline** in the paper, just set `--max_n_trials 52 --min_n_corrects 0`.\n",
     "\n",
     "<details>\n",
-    "<summary>The off-the-shelf command to reproduce the data synthesis of the Vanilla Rejection Tuning (VRT) baseline in the paper</summary>\n",
+    "<summary>The off-the-shelf command to reproduce the synthesis of the Vanilla Rejection Tuning (VRT) baseline in the paper</summary>\n",
     "```shell\n",
     "CUDA_VISIBLE_DEVICES=\"0\" python pipeline/gen.py \\\n",
     "    --gen_save_path \"data/res/dart-math-uniform.jsonl\" \\\n",
@@ -418,6 +418,15 @@
     "\n",
     "</details>\n",
     "\n",
+    "<details>\n",
+    "<summary>So sorry that it still need some manual efforts to reproduce the data synthesis of `DART-Math-Prop2Diff`. For now, please follow the instructions in the paper</summary>\n",
+    "\n",
+    "1. Calculate \"fail rate\" (`1-pass_rate`) for each query in MATH and GSM8K training sets (see the `pass_rate` field of query information in [MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info) and [GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info)).\n",
+    "2. Calculate the target number of correct responses for each query in the final training set. Note that we try to ensure at least one correct response for each query in the `DART-Math` datasets, which you could implement by rounding **up** when calculating the response number for each query.\n",
+    "3. Sample responses for each query until the target number of correct ones is met (thus proportional to its \"fail rate\").\n",
+    "\n",
+    "</details>\n",
+    "\n",
     "After the synthesis, you can use the [curation script](pipeline/curate.py) to curate the final dataset.\n"
    ]
   },