Skip to content

Commit 5eb59b0

Browse files
committed
Scripts to collect evaluation results and create LaTeX table
1 parent e7bb0f6 commit 5eb59b0

9 files changed

Lines changed: 308 additions & 25 deletions

File tree

ai_dev/create_result_table.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Write `create_result_table.py` script
2+
3+
## Initial prompt
4+
5+
I need a Python script which collects evaluation results for a number of datasets
6+
and cases, computes averages, and write LaTeX code for a table containing these
7+
average results.
8+
9+
* I started a script in `~/git/keys_values/keys_values/scripts/create_result_table.py`.
10+
Extend this file by adding the `main` function.
11+
* For each `(dataset, case), dataset in datasets,
12+
case in [x[0] for x in cases]`, there is a CSV
13+
file `base_path / dataset / case / EVAL_METRICS_ALL_FILENAME`.
14+
The result file has columns `idx, task, sub_exact_match`. The `sub_exact_match`
15+
column contains metric values: average them over different `idx` for the same
16+
`task`.
17+
* The result table has one column for each `dataset`, one row for each `case`.
18+
* A cell can have several rows, one for each distinct `task` value contained in
19+
the result file for `(dataset, case)`. These `task` values can be different for
20+
every cell.
21+
* The rows in a cell for `(dataset, case)` should be
22+
f"{short_task}:{avg_sub_exact_match:.4f}", where `avg_sub_exact_match` is the
23+
average `sub_exact_match` value for some `task`, and `short_task = "fin"` if
24+
`task == "final"`, `short_task = "120"` if `task == "step-000120"` (the last 3
25+
digits).
26+
* If there is no result file for a combination `(dataset, case)`, the corresponding
27+
cell is left empty.
28+
* Label the columns with the `dataset` name, stripping off "helmet_"
29+
* Label the rows with `x[1]` for the tuple `x` in `cases`.
30+
* The `main` function should write LaTeX code for the table into the file
31+
`result_path`.
32+
33+
### Answer
34+
35+
Done. The main function:
36+
37+
* Derives base_path from result_path.parent
38+
* For each `(dataset, case)` pair, reads the CSV if it exists, groups by task, averages sub_exact_match, and formats each row as short_task:avg
39+
* Maps task names via _short_task: "final" → "fin", others → last 3 digits
40+
* Uses `\makecell{...}` for cells with multiple task rows (requires \usepackage{makecell} in the LaTeX preamble)
41+
* Writes the full `tabular` environment to `result_path`
42+
43+
44+
## Comment
45+
46+
The column and row labels may contain "_", which you need to change to "\_" in LaTeX. Could you please fix this?
47+
48+
### Answer
49+
50+
Fixed — underscores in both column and row labels are now escaped as \_.
51+
52+
53+
## Comment
54+
55+
I'd like to have a few changes:
56+
57+
* Rows (for `case` values) should be separated by horizontal lines. The top row
58+
with labels should be separated from the others by a double horizontal line, and
59+
some space below this double line.
60+
* For the column labels, please also strip away f"_{dataset_size}" from the right.
61+
* The multi-row content of a cell should not be centered, but aligned on top, the
62+
same for the row labels.
63+
* The different rows in a cell should be aligned. Both values before and after ":"
64+
should be aligned on the left. Also, the row with "fin" should always be last, and
65+
others should be sorted by the number before ":"
66+
* Please use a smaller fontsize for text in cells than for the row and column labels.
67+
68+
### Answer
69+
70+
Here's what changed:
71+
72+
* Row separators: each data row is followed by \hline; the header is separated by \hline\hline with \noalign{\smallskip} below it for extra space.
73+
* Column label stripping: uses rsplit("_", 1)[0] to remove the trailing _{dataset_size} suffix before also removing the helmet_ prefix.
74+
* Top alignment: row labels use \makecell[lt]{...} (top-left); multi-row cells use \begin{tabular}[t]{...} (top-aligned).
75+
* Cell alignment: multi-row cells render as a nested tabular with two columns (@{}l@{}l@{}), so the short task keys and the :value parts are left-aligned independently.
76+
* Sort order: non-fin entries are sorted numerically by their 3-digit suffix; fin is always last.
77+
* Font size: cell content is wrapped in {\small ...} while labels remain at normal size.
78+
79+
80+
## Comment
81+
82+
Just a few extra changes:
83+
84+
* Add some space above and below each horizontal line.
85+
* Multiply each average value (number after ":") by 100, so they become percentages.
86+
The format should be f"{avg_sub_exact_match:.2f}"
87+
* For the rows in cells, insert a small space before and after the ":"
88+
89+
### Answer
90+
91+
Three changes made:
92+
93+
* Spacing around \hline: every rule is now wrapped with \noalign{\smallskip} on both sides; the double rule after the header keeps the same pattern.
94+
* Percentages: values are multiplied by 100 and formatted as :.2f.
95+
* Space around :: single-entry cells use " : ", and multi-row cells use { : } (braces prevent LaTeX from treating it as a column separator inside the nested tabular).
96+
97+
98+
## Comment
99+
100+
One more change would be great.
101+
102+
Each cell has rows such as "fin : 45.10" or "100 : 9.95".
103+
104+
* I'd like the first values (before ":") to be left-aligned, the second values
105+
(after ":") to be right-aligned.
106+
* Can you ensure that these alignments extend across cells? So that ALL first
107+
values in all cells of a column are left-aligned, and ALL second values in all
108+
cells of a column are right-aligned?
109+
110+
### Answer
111+
112+
The approach:
113+
114+
* Each dataset column is now two outer-table columns (l for task, r for value), with the header spanned via \multicolumn{2}{c}{...}.
115+
* Each (case, dataset) cell's task rows become separate outer-table rows, so l/r alignment is shared across all cells in the same column.
116+
* Case labels spanning multiple sub-rows use \multirow{n}{*}{...} (requires \usepackage{multirow}).
117+
* The : separator is rendered as \,:\, (thin spaces) inside each value cell, keeping it visually between the two aligned columns.
118+
119+
120+
## Comment:
121+
122+
* Now, rows "fin : 45.10" are formatted by grouping "fin" and ":45.10". Can you
123+
change this so that the grouping is "fin:" and "45.10", so the space is between
124+
":" and the right value?
125+
* Also, please shrink the amount of space between first and second values in a cell.
126+
It should be quite a bit less than the space between different columns.
127+
128+
### Answer
129+
130+
The : moves to the end of the left cell (st:) and the right cell starts with \! (a negative thin space), pulling the number close to the colon while the normal inter-column gap still separates adjacent dataset pairs.
131+

keys_values/data/helmet.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -116,12 +116,14 @@ def state_dict(self) -> Dict[str, torch.Tensor]:
116116

117117
def load_state_dict(self, state_dict: Dict[str, torch.Tensor]):
118118
super().load_state_dict(state_dict)
119-
train_ind = state_dict.get("train_target_choice")
120-
val_ind = state_dict.get("val_target_choice")
121-
test_ind = state_dict.get("test_target_choice")
122-
self.train_target_choice = None if train_ind is None else train_ind.tolist()
123-
self.val_target_choice = None if val_ind is None else val_ind.tolist()
124-
self.test_target_choice = None if test_ind is None else test_ind.tolist()
119+
train_choice = state_dict.get("train_target_choice")
120+
val_choice = state_dict.get("val_target_choice")
121+
test_choice = state_dict.get("test_target_choice")
122+
self.train_target_choice = (
123+
None if train_choice is None else train_choice.tolist()
124+
)
125+
self.val_target_choice = None if val_choice is None else val_choice.tolist()
126+
self.test_target_choice = None if test_choice is None else test_choice.tolist()
125127

126128

127129
class Helmet(SequenceLengthFilteredDataModule):

keys_values/data/module.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -314,8 +314,8 @@ def setup(self, stage: str = "") -> None:
314314
f"Split development set into training ({len(train_ind)}) and validation ({len(val_ind)})"
315315
)
316316
else:
317-
train_ind = self.training_state.train_ind
318-
val_ind = self.training_state.val_ind
317+
train_ind = self.training_state.train_data_index
318+
val_ind = self.training_state.val_data_index
319319
train_data = Subset(data, train_ind)
320320
val_data = Subset(data, val_ind)
321321
print(

keys_values/finetune/longcontext_full.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1875,8 +1875,8 @@ def validate(
18751875
if k >= eval.max_iters:
18761876
break
18771877
batch = batch_transform(batch)
1878-
sum_loss += model(batch[INPUT_IDS_NAME], batch["targets"]).mean().item()
18791878
num_entries += 1
1879+
sum_loss += model(batch[INPUT_IDS_NAME], batch["targets"]).mean().item()
18801880
model.train()
18811881
return sum_loss / num_entries, num_entries
18821882

@@ -1939,7 +1939,10 @@ def generate_example(
19391939
raise IndexError("model.gpt_model must have KV caches assigned")
19401940
model.eval()
19411941

1942-
max_returned_tokens = len(encoded) + eval.max_new_tokens
1942+
max_returned_tokens = eval.max_new_tokens
1943+
if max_returned_tokens is None:
1944+
max_returned_tokens = 50
1945+
max_returned_tokens += len(encoded)
19431946

19441947
if max_returned_tokens < gpt_model.max_seq_length:
19451948
output = generate(

keys_values/finetune/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,7 @@ def validate_args(train: TrainArgs, eval: EvalArgs) -> None:
166166
issues.append(
167167
f"{__file__} doesn't support the {name!r} argument. This is set in {args}"
168168
)
169-
required = [(train, ["epochs"]), (eval, ["max_new_tokens"])]
169+
required = [(train, ["epochs"])]
170170
for args, names in required:
171171
for name in names:
172172
if getattr(args, name) is None:

keys_values/scripts/collect_eval_results.py

Lines changed: 43 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@
1818

1919
from keys_values.evaluation.tasks import EvaluationTasks
2020

21+
EVAL_METRICS_ALL_FILENAME = "eval_metrics_all.csv"
22+
23+
SWEEP_TAR_FILENAME = "eval_metrics_transfer_{dataset_size}.tgz"
24+
2125

2226
def main(
2327
out_dir: Path,
@@ -53,17 +57,22 @@ def main(
5357
print(f" {column_names[-1]} = {(sum_vals / num_vals):.3f}")
5458

5559
print(f"Total number of records: {len(all_data)}")
56-
combined_path = out_dir / "eval_metrics_all.csv"
57-
with open(combined_path, "w") as fp:
58-
writer = csv.writer(fp, delimiter=",")
59-
writer.writerow(column_names)
60-
for row in sorted(all_data, key=lambda x: (x[1], int(x[0]))):
61-
writer.writerow(row)
60+
if all_data:
61+
combined_path = out_dir / EVAL_METRICS_ALL_FILENAME
62+
with open(combined_path, "w") as fp:
63+
writer = csv.writer(fp, delimiter=",")
64+
writer.writerow(column_names)
65+
for row in sorted(all_data, key=lambda x: (x[1], int(x[0]))):
66+
writer.writerow(row)
6267

6368

6469
if __name__ == "__main__":
6570
base_path = Path.home() / "out/finetune/neurips_exp/lora/qwen3_4b"
66-
dataset_size = "64k"
71+
72+
mode = "collect"
73+
# mode = "sweep"
74+
# dataset_size = "64k"
75+
dataset_size = "128k"
6776
datasets = [
6877
f"helmet_nq_{dataset_size}",
6978
f"helmet_trivia_qa_{dataset_size}",
@@ -74,12 +83,32 @@ def main(
7483
"lr_4gpu_cs2048_lr5",
7584
"h2o_4gpu_cs2048_lr5",
7685
"slr_4gpu_cs2048_lr5",
77-
# "qh2o_4gpu_cs2048_lr5",
78-
# "h2onorm_4gpu_cs2048_lr5",
79-
# "qh2onorm_4gpu_cs2048_lr5",
86+
# "qh2o_4gpu_cs2048_lr5",
87+
# "h2onorm_4gpu_cs2048_lr5",
88+
# "qh2onorm_4gpu_cs2048_lr5",
89+
# "lr_4gpu_cs1024_lr5",
90+
# "h2o_4gpu_cs1024_lr5",
8091
]
8192
model_type = "lora"
82-
for dataset, case in product(datasets, cases):
83-
out_dir = base_path / dataset / case
84-
if out_dir.exists():
85-
main(out_dir, model_type)
93+
if mode == "collect":
94+
for dataset, case in product(datasets, cases):
95+
out_dir = base_path / dataset / case
96+
if out_dir.exists():
97+
main(out_dir, model_type)
98+
else:
99+
print(f"\nResults for {dataset}/{case} do not exist")
100+
elif mode == "sweep":
101+
names = []
102+
for dataset, case in product(datasets, cases):
103+
name = "/".join((dataset, case, EVAL_METRICS_ALL_FILENAME))
104+
if (base_path / name).exists():
105+
names.append(name)
106+
print(
107+
f"\nCollected {len(names)} result files. Run at {base_path}:\n"
108+
+ "tar cfz "
109+
+ SWEEP_TAR_FILENAME.format(dataset_size=dataset_size)
110+
+ " "
111+
+ " ".join(names)
112+
)
113+
else:
114+
raise NotImplementedError(f"Unknown mode: {mode}")
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License").
4+
# You may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from pathlib import Path
15+
16+
import pandas as pd
17+
18+
EVAL_METRICS_ALL_FILENAME = "eval_metrics_all.csv"
19+
20+
21+
def _short_task(task: str) -> str:
22+
return "fin" if task == "final" else task[-3:]
23+
24+
25+
def _sort_entries(entries):
26+
non_fin = sorted(
27+
[(st, v) for st, v in entries if st != "fin"], key=lambda x: int(x[0])
28+
)
29+
return non_fin + [(st, v) for st, v in entries if st == "fin"]
30+
31+
32+
def main(datasets, cases, result_path):
33+
base_path = result_path.parent
34+
col_labels = [
35+
d.removeprefix("helmet_").rsplit("_", 1)[0].replace("_", r"\_")
36+
for d in datasets
37+
]
38+
case_labels = [x[1].replace("_", r"\_") for x in cases]
39+
40+
# table[i][j] = sorted list of (short_task, avg_value) tuples (empty if no file)
41+
table = []
42+
for case_key, _ in cases:
43+
row = []
44+
for dataset in datasets:
45+
csv_path = base_path / dataset / case_key / EVAL_METRICS_ALL_FILENAME
46+
if not csv_path.exists():
47+
row.append([])
48+
else:
49+
df = pd.read_csv(csv_path)
50+
avg = df.groupby("task")["sub_exact_match"].mean()
51+
row.append(_sort_entries([(_short_task(t), v) for t, v in avg.items()]))
52+
table.append(row)
53+
54+
# Each dataset gets 2 sub-columns (l for task, r for value) for cross-cell alignment.
55+
N = len(datasets)
56+
col_spec = "l" + "lr" * N
57+
tex_lines = [
58+
r"\begin{tabular}{" + col_spec + "}",
59+
r"\noalign{\smallskip}\hline\noalign{\smallskip}",
60+
" & ".join([""] + [r"\multicolumn{2}{c}{" + lbl + "}" for lbl in col_labels])
61+
+ r" \\",
62+
r"\noalign{\smallskip}\hline\hline\noalign{\smallskip}",
63+
]
64+
for i, case_label in enumerate(case_labels):
65+
row_entries = table[i]
66+
max_rows = max((len(e) for e in row_entries), default=0)
67+
max_rows = max(max_rows, 1)
68+
for k in range(max_rows):
69+
if k == 0 and max_rows > 1:
70+
label_cell = r"\multirow{" + str(max_rows) + r"}{*}{" + case_label + "}"
71+
elif k == 0:
72+
label_cell = case_label
73+
else:
74+
label_cell = ""
75+
cells = [label_cell]
76+
for entries in row_entries:
77+
if k < len(entries):
78+
st, v = entries[k]
79+
cells.append(r"{\small " + st + r":}")
80+
cells.append(r"{\small\!" + f"{v * 100:.2f}" + "}")
81+
else:
82+
cells.append("")
83+
cells.append("")
84+
tex_lines.append(" & ".join(cells) + r" \\")
85+
tex_lines.append(r"\noalign{\smallskip}\hline\noalign{\smallskip}")
86+
tex_lines.append(r"\end{tabular}")
87+
88+
result_path.write_text("\n".join(tex_lines) + "\n")
89+
90+
91+
if __name__ == "__main__":
92+
base_path = Path.home() / "out/finetune/neurips_exp/lora/qwen3_4b"
93+
94+
# dataset_size = "64k"
95+
dataset_size = "128k"
96+
datasets = [
97+
f"helmet_nq_{dataset_size}",
98+
f"helmet_trivia_qa_{dataset_size}",
99+
f"helmet_hotpot_qa_{dataset_size}",
100+
f"helmet_pop_qa_{dataset_size}",
101+
]
102+
cases = [
103+
("lr_4gpu_cs2048_lr5", "lr_2048"),
104+
("h2o_4gpu_cs2048_lr5", "h2o_2048"),
105+
("slr_4gpu_cs2048_lr5", "slr_2048"),
106+
# ("qh2o_4gpu_cs2048_lr5", "qh2o_2048"),
107+
# ("h2onorm_4gpu_cs2048_lr5", "h2onorm_2048"),
108+
# ("qh2onorm_4gpu_cs2048_lr5", "qh2onorm_2048"),
109+
# ("lr_4gpu_cs1024_lr5", "lr_1024"),
110+
# ("h2o_4gpu_cs1024_lr5", "h2o_1024"),
111+
]
112+
result_path = base_path / f"results_{dataset_size}.tex"
113+
114+
main(datasets, cases, result_path)
File renamed without changes.

0 commit comments

Comments
 (0)