fix: recover flattened benchmark tables in markdown output#262
fix: recover flattened benchmark tables in markdown output#262StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
Conversation
|
Steven Vincent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 397d908308
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| List<String> metrics = new ArrayList<>(); | ||
| int j = i + 1; | ||
| while (j < tokens.length && metrics.size() < metricCount && NUMERIC_TOKEN_RE.matcher(tokens[j]).matches()) { | ||
| metrics.add(tokens[j]); | ||
| j += 1; |
There was a problem hiding this comment.
Parse full model labels before consuming metric tokens
recoverBenchmarkRowsFromText treats a row label as exactly one token and then immediately consumes numeric tokens as metrics, so flattened tables with multi-token model names (for example DeepSeek V3 or Llama 3 70B) get misparsed: the label is truncated and numeric parts of the name are shifted into metric columns. Because the recovered table is then emitted as Markdown, this silently corrupts benchmark results instead of preserving the original text.
Useful? React with 👍 / 👎.
| if (consumed > 0) { | ||
| i += consumed - 1; | ||
| continue; |
There was a problem hiding this comment.
Do not skip non-table objects when advancing consumed range
After a recovery succeeds, the page loop advances by consumed, but consumed is based on the absolute index of the last recovered text node while recovery scanning explicitly allows multiple intervening non-text nodes. In mixed-content pages, supported objects like images/formulas/lists that fall between recovered text paragraphs are skipped entirely and never written, causing content loss whenever flattened table text is interleaved with other objects.
Useful? React with 👍 / 👎.
Summary
Closes #259.