Scripts to collect evaluation results and create LaTeX table

mseeger · mseeger · commit 5eb59b0bffe7 · 2026-05-16T20:31:05.000+02:00
diff --git a/ai_dev/create_result_table.md b/ai_dev/create_result_table.md
@@ -0,0 +1,131 @@
+# Write `create_result_table.py` script
+
+## Initial prompt
+
+I need a Python script which collects evaluation results for a number of datasets
+and cases, computes averages, and write LaTeX code for a table containing these
+average results.
+
+* I started a script in `~/git/keys_values/keys_values/scripts/create_result_table.py`.
+  Extend this file by adding the `main` function.
+* For each `(dataset, case), dataset in datasets,
+  case in [x[0] for x in cases]`, there is a CSV
+  file `base_path / dataset / case / EVAL_METRICS_ALL_FILENAME`.
+  The result file has columns `idx, task, sub_exact_match`. The `sub_exact_match`
+  column contains metric values: average them over different `idx` for the same
+  `task`.
+* The result table has one column for each `dataset`, one row for each `case`.
+* A cell can have several rows, one for each distinct `task` value contained in
+  the result file for `(dataset, case)`. These `task` values can be different for
+  every cell.
+* The rows in a cell for `(dataset, case)` should be
+  f"{short_task}:{avg_sub_exact_match:.4f}", where `avg_sub_exact_match` is the
+  average `sub_exact_match` value for some `task`, and `short_task = "fin"` if
+  `task == "final"`, `short_task = "120"` if `task == "step-000120"` (the last 3
+  digits).
+* If there is no result file for a combination `(dataset, case)`, the corresponding
+  cell is left empty.
+* Label the columns with the `dataset` name, stripping off "helmet_"
+* Label the rows with `x[1]` for the tuple `x` in `cases`.
+* The `main` function should write LaTeX code for the table into the file
+  `result_path`.
+
+### Answer
+
+Done. The main function:
+
+* Derives base_path from result_path.parent
+* For each `(dataset, case)` pair, reads the CSV if it exists, groups by task, averages sub_exact_match, and formats each row as short_task:avg
+* Maps task names via _short_task: "final" → "fin", others → last 3 digits
+* Uses `\makecell{...}` for cells with multiple task rows (requires \usepackage{makecell} in the LaTeX preamble)
+* Writes the full `tabular` environment to `result_path`
+
+
+## Comment
+
+The column and row labels may contain "_", which you need to change to "\_" in LaTeX. Could you please fix this?
+
+### Answer
+
+Fixed — underscores in both column and row labels are now escaped as \_.
+
+
+## Comment
+
+I'd like to have a few changes:
+
+* Rows (for `case` values) should be separated by horizontal lines. The top row
+  with labels should be separated from the others by a double horizontal line, and
+  some space below this double line.
+* For the column labels, please also strip away f"_{dataset_size}" from the right.
+* The multi-row content of a cell should not be centered, but aligned on top, the
+  same for the row labels.
+* The different rows in a cell should be aligned. Both values before and after ":"
+  should be aligned on the left. Also, the row with "fin" should always be last, and
+  others should be sorted by the number before ":"
+* Please use a smaller fontsize for text in cells than for the row and column labels.
+
+### Answer
+
+Here's what changed:
+
+* Row separators: each data row is followed by \hline; the header is separated by \hline\hline with \noalign{\smallskip} below it for extra space.
+* Column label stripping: uses rsplit("_", 1)[0] to remove the trailing _{dataset_size} suffix before also removing the helmet_ prefix.
+* Top alignment: row labels use \makecell[lt]{...} (top-left); multi-row cells use \begin{tabular}[t]{...} (top-aligned).
+* Cell alignment: multi-row cells render as a nested tabular with two columns (@{}l@{}l@{}), so the short task keys and the :value parts are left-aligned independently.
+* Sort order: non-fin entries are sorted numerically by their 3-digit suffix; fin is always last.
+* Font size: cell content is wrapped in {\small ...} while labels remain at normal size.
+
+
+## Comment
+
+Just a few extra changes:
+
+* Add some space above and below each horizontal line.
+* Multiply each average value (number after ":") by 100, so they become percentages.
+  The format should be f"{avg_sub_exact_match:.2f}"
+* For the rows in cells, insert a small space before and after the ":"
+
+### Answer
+
+Three changes made:
+
+* Spacing around \hline: every rule is now wrapped with \noalign{\smallskip} on both sides; the double rule after the header keeps the same pattern.
+* Percentages: values are multiplied by 100 and formatted as :.2f.
+* Space around :: single-entry cells use " : ", and multi-row cells use { : } (braces prevent LaTeX from treating it as a column separator inside the nested tabular).
+
+
+## Comment
+
+One more change would be great.
+
+Each cell has rows such as "fin : 45.10" or "100 : 9.95".
+
+* I'd like the first values (before ":") to be left-aligned, the second values
+  (after ":") to be right-aligned.
+* Can you ensure that these alignments extend across cells? So that ALL first
+  values in all cells of a column are left-aligned, and ALL second values in all
+  cells of a column are right-aligned?
+
+### Answer
+
+The approach:
+
+* Each dataset column is now two outer-table columns (l for task, r for value), with the header spanned via \multicolumn{2}{c}{...}.
+* Each (case, dataset) cell's task rows become separate outer-table rows, so l/r alignment is shared across all cells in the same column.
+* Case labels spanning multiple sub-rows use \multirow{n}{*}{...} (requires \usepackage{multirow}).
+* The : separator is rendered as \,:\, (thin spaces) inside each value cell, keeping it visually between the two aligned columns.
+
+
+## Comment:
+
+* Now, rows "fin : 45.10" are formatted by grouping "fin" and ":45.10". Can you
+  change this so that the grouping is "fin:" and "45.10", so the space is between
+  ":" and the right value?
+* Also, please shrink the amount of space between first and second values in a cell.
+  It should be quite a bit less than the space between different columns.
+
+### Answer
+
+The : moves to the end of the left cell (st:) and the right cell starts with \! (a negative thin space), pulling the number close to the colon while the normal inter-column gap still separates adjacent dataset pairs.
+
diff --git a/keys_values/data/helmet.py b/keys_values/data/helmet.py
@@ -116,12 +116,14 @@ def state_dict(self) -> Dict[str, torch.Tensor]:
 
     def load_state_dict(self, state_dict: Dict[str, torch.Tensor]):
         super().load_state_dict(state_dict)
-        train_ind = state_dict.get("train_target_choice")
-        val_ind = state_dict.get("val_target_choice")
-        test_ind = state_dict.get("test_target_choice")
-        self.train_target_choice = None if train_ind is None else train_ind.tolist()
-        self.val_target_choice = None if val_ind is None else val_ind.tolist()
-        self.test_target_choice = None if test_ind is None else test_ind.tolist()
+        train_choice = state_dict.get("train_target_choice")
+        val_choice = state_dict.get("val_target_choice")
+        test_choice = state_dict.get("test_target_choice")
+        self.train_target_choice = (
+            None if train_choice is None else train_choice.tolist()
+        )
+        self.val_target_choice = None if val_choice is None else val_choice.tolist()
+        self.test_target_choice = None if test_choice is None else test_choice.tolist()
 
 
 class Helmet(SequenceLengthFilteredDataModule):
diff --git a/keys_values/data/module.py b/keys_values/data/module.py
@@ -314,8 +314,8 @@ def setup(self, stage: str = "") -> None:
                 f"Split development set into training ({len(train_ind)}) and validation ({len(val_ind)})"
             )
         else:
-            train_ind = self.training_state.train_ind
-            val_ind = self.training_state.val_ind
+            train_ind = self.training_state.train_data_index
+            val_ind = self.training_state.val_data_index
             train_data = Subset(data, train_ind)
             val_data = Subset(data, val_ind)
             print(
diff --git a/keys_values/finetune/longcontext_full.py b/keys_values/finetune/longcontext_full.py
@@ -1875,8 +1875,8 @@ def validate(
         if k >= eval.max_iters:
             break
         batch = batch_transform(batch)
-        sum_loss += model(batch[INPUT_IDS_NAME], batch["targets"]).mean().item()
         num_entries += 1
+        sum_loss += model(batch[INPUT_IDS_NAME], batch["targets"]).mean().item()
     model.train()
     return sum_loss / num_entries, num_entries
 
@@ -1939,7 +1939,10 @@ def generate_example(
         raise IndexError("model.gpt_model must have KV caches assigned")
     model.eval()
 
-    max_returned_tokens = len(encoded) + eval.max_new_tokens
+    max_returned_tokens = eval.max_new_tokens
+    if max_returned_tokens is None:
+        max_returned_tokens = 50
+    max_returned_tokens += len(encoded)
 
     if max_returned_tokens < gpt_model.max_seq_length:
         output = generate(
diff --git a/keys_values/finetune/utils.py b/keys_values/finetune/utils.py
@@ -166,7 +166,7 @@ def validate_args(train: TrainArgs, eval: EvalArgs) -> None:
                 issues.append(
                     f"{__file__} doesn't support the {name!r} argument. This is set in {args}"
                 )
-    required = [(train, ["epochs"]), (eval, ["max_new_tokens"])]
+    required = [(train, ["epochs"])]
     for args, names in required:
         for name in names:
             if getattr(args, name) is None:
diff --git a/keys_values/scripts/collect_eval_results.py b/keys_values/scripts/collect_eval_results.py
@@ -18,6 +18,10 @@
 
 from keys_values.evaluation.tasks import EvaluationTasks
 
+EVAL_METRICS_ALL_FILENAME = "eval_metrics_all.csv"
+
+SWEEP_TAR_FILENAME = "eval_metrics_transfer_{dataset_size}.tgz"
+
 
 def main(
     out_dir: Path,
@@ -53,17 +57,22 @@ def main(
         print(f"    {column_names[-1]} = {(sum_vals / num_vals):.3f}")
 
     print(f"Total number of records: {len(all_data)}")
-    combined_path = out_dir / "eval_metrics_all.csv"
-    with open(combined_path, "w") as fp:
-        writer = csv.writer(fp, delimiter=",")
-        writer.writerow(column_names)
-        for row in sorted(all_data, key=lambda x: (x[1], int(x[0]))):
-            writer.writerow(row)
+    if all_data:
+        combined_path = out_dir / EVAL_METRICS_ALL_FILENAME
+        with open(combined_path, "w") as fp:
+            writer = csv.writer(fp, delimiter=",")
+            writer.writerow(column_names)
+            for row in sorted(all_data, key=lambda x: (x[1], int(x[0]))):
+                writer.writerow(row)
 
 
 if __name__ == "__main__":
     base_path = Path.home() / "out/finetune/neurips_exp/lora/qwen3_4b"
-    dataset_size = "64k"
+
+    mode = "collect"
+    # mode = "sweep"
+    # dataset_size = "64k"
+    dataset_size = "128k"
     datasets = [
         f"helmet_nq_{dataset_size}",
         f"helmet_trivia_qa_{dataset_size}",
@@ -74,12 +83,32 @@ def main(
         "lr_4gpu_cs2048_lr5",
         "h2o_4gpu_cs2048_lr5",
         "slr_4gpu_cs2048_lr5",
-        #    "qh2o_4gpu_cs2048_lr5",
-        #    "h2onorm_4gpu_cs2048_lr5",
-        #    "qh2onorm_4gpu_cs2048_lr5",
+        # "qh2o_4gpu_cs2048_lr5",
+        # "h2onorm_4gpu_cs2048_lr5",
+        # "qh2onorm_4gpu_cs2048_lr5",
+        # "lr_4gpu_cs1024_lr5",
+        # "h2o_4gpu_cs1024_lr5",
     ]
     model_type = "lora"
-    for dataset, case in product(datasets, cases):
-        out_dir = base_path / dataset / case
-        if out_dir.exists():
-            main(out_dir, model_type)
+    if mode == "collect":
+        for dataset, case in product(datasets, cases):
+            out_dir = base_path / dataset / case
+            if out_dir.exists():
+                main(out_dir, model_type)
+            else:
+                print(f"\nResults for {dataset}/{case} do not exist")
+    elif mode == "sweep":
+        names = []
+        for dataset, case in product(datasets, cases):
+            name = "/".join((dataset, case, EVAL_METRICS_ALL_FILENAME))
+            if (base_path / name).exists():
+                names.append(name)
+        print(
+            f"\nCollected {len(names)} result files. Run at {base_path}:\n"
+            + "tar cfz "
+            + SWEEP_TAR_FILENAME.format(dataset_size=dataset_size)
+            + " "
+            + " ".join(names)
+        )
+    else:
+        raise NotImplementedError(f"Unknown mode: {mode}")
diff --git a/keys_values/scripts/create_result_table.py b/keys_values/scripts/create_result_table.py
@@ -0,0 +1,114 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License").
+# You may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from pathlib import Path
+
+import pandas as pd
+
+EVAL_METRICS_ALL_FILENAME = "eval_metrics_all.csv"
+
+
+def _short_task(task: str) -> str:
+    return "fin" if task == "final" else task[-3:]
+
+
+def _sort_entries(entries):
+    non_fin = sorted(
+        [(st, v) for st, v in entries if st != "fin"], key=lambda x: int(x[0])
+    )
+    return non_fin + [(st, v) for st, v in entries if st == "fin"]
+
+
+def main(datasets, cases, result_path):
+    base_path = result_path.parent
+    col_labels = [
+        d.removeprefix("helmet_").rsplit("_", 1)[0].replace("_", r"\_")
+        for d in datasets
+    ]
+    case_labels = [x[1].replace("_", r"\_") for x in cases]
+
+    # table[i][j] = sorted list of (short_task, avg_value) tuples (empty if no file)
+    table = []
+    for case_key, _ in cases:
+        row = []
+        for dataset in datasets:
+            csv_path = base_path / dataset / case_key / EVAL_METRICS_ALL_FILENAME
+            if not csv_path.exists():
+                row.append([])
+            else:
+                df = pd.read_csv(csv_path)
+                avg = df.groupby("task")["sub_exact_match"].mean()
+                row.append(_sort_entries([(_short_task(t), v) for t, v in avg.items()]))
+        table.append(row)
+
+    # Each dataset gets 2 sub-columns (l for task, r for value) for cross-cell alignment.
+    N = len(datasets)
+    col_spec = "l" + "lr" * N
+    tex_lines = [
+        r"\begin{tabular}{" + col_spec + "}",
+        r"\noalign{\smallskip}\hline\noalign{\smallskip}",
+        " & ".join([""] + [r"\multicolumn{2}{c}{" + lbl + "}" for lbl in col_labels])
+        + r" \\",
+        r"\noalign{\smallskip}\hline\hline\noalign{\smallskip}",
+    ]
+    for i, case_label in enumerate(case_labels):
+        row_entries = table[i]
+        max_rows = max((len(e) for e in row_entries), default=0)
+        max_rows = max(max_rows, 1)
+        for k in range(max_rows):
+            if k == 0 and max_rows > 1:
+                label_cell = r"\multirow{" + str(max_rows) + r"}{*}{" + case_label + "}"
+            elif k == 0:
+                label_cell = case_label
+            else:
+                label_cell = ""
+            cells = [label_cell]
+            for entries in row_entries:
+                if k < len(entries):
+                    st, v = entries[k]
+                    cells.append(r"{\small " + st + r":}")
+                    cells.append(r"{\small\!" + f"{v * 100:.2f}" + "}")
+                else:
+                    cells.append("")
+                    cells.append("")
+            tex_lines.append(" & ".join(cells) + r" \\")
+        tex_lines.append(r"\noalign{\smallskip}\hline\noalign{\smallskip}")
+    tex_lines.append(r"\end{tabular}")
+
+    result_path.write_text("\n".join(tex_lines) + "\n")
+
+
+if __name__ == "__main__":
+    base_path = Path.home() / "out/finetune/neurips_exp/lora/qwen3_4b"
+
+    # dataset_size = "64k"
+    dataset_size = "128k"
+    datasets = [
+        f"helmet_nq_{dataset_size}",
+        f"helmet_trivia_qa_{dataset_size}",
+        f"helmet_hotpot_qa_{dataset_size}",
+        f"helmet_pop_qa_{dataset_size}",
+    ]
+    cases = [
+        ("lr_4gpu_cs2048_lr5", "lr_2048"),
+        ("h2o_4gpu_cs2048_lr5", "h2o_2048"),
+        ("slr_4gpu_cs2048_lr5", "slr_2048"),
+        #     ("qh2o_4gpu_cs2048_lr5", "qh2o_2048"),
+        #     ("h2onorm_4gpu_cs2048_lr5", "h2onorm_2048"),
+        #     ("qh2onorm_4gpu_cs2048_lr5", "qh2onorm_2048"),
+        #     ("lr_4gpu_cs1024_lr5", "lr_1024"),
+        #     ("h2o_4gpu_cs1024_lr5", "h2o_1024"),
+    ]
+    result_path = base_path / f"results_{dataset_size}.tex"
+
+    main(datasets, cases, result_path)
diff --git a/keys_values/scripts/dataset_token_counts.py b/keys_values/scripts/dataset_token_counts.py
diff --git a/keys_values/utils.py b/keys_values/utils.py

Original file line number	Diff line number	Diff line change
`@@ -166,7 +166,7 @@ def validate_args(train: TrainArgs, eval: EvalArgs) -> None:`
`166`	`166`	`issues.append(`
`167`	`167`	`f"{__file__} doesn't support the {name!r} argument. This is set in {args}"`
`168`	`168`	`)`
`169`		`- required = [(train, ["epochs"]), (eval, ["max_new_tokens"])]`
	`169`	`+ required = [(train, ["epochs"])]`
`170`	`170`	`for args, names in required:`
`171`	`171`	`for name in names:`
`172`	`172`	`if getattr(args, name) is None:`