Skip to content

fix: recover flattened benchmark tables in markdown output#262

Open
StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
StevenVincentOne:fix/deepseek-table-recovery
Open

fix: recover flattened benchmark tables in markdown output#262
StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
StevenVincentOne:fix/deepseek-table-recovery

Conversation

@StevenVincentOne
Copy link
Contributor

@StevenVincentOne StevenVincentOne commented Mar 6, 2026

Summary

  • recover DeepSeek-style flattened benchmark tables that are emitted as paragraph runs instead of native table structures
  • handle caption-first, metric-first, and benchmark pre-header variants while preventing prose bleed-through and keeping fallback behavior conservative
  • add focused MarkdownGeneratorTest coverage for duplicate metric headers, partial row handling, percent values, and multi-row flattened recovery cases

Closes #259.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Steven Vincent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 397d908308

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +942 to +946
List<String> metrics = new ArrayList<>();
int j = i + 1;
while (j < tokens.length && metrics.size() < metricCount && NUMERIC_TOKEN_RE.matcher(tokens[j]).matches()) {
metrics.add(tokens[j]);
j += 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse full model labels before consuming metric tokens

recoverBenchmarkRowsFromText treats a row label as exactly one token and then immediately consumes numeric tokens as metrics, so flattened tables with multi-token model names (for example DeepSeek V3 or Llama 3 70B) get misparsed: the label is truncated and numeric parts of the name are shifted into metric columns. Because the recovered table is then emitted as Markdown, this silently corrupts benchmark results instead of preserving the original text.

Useful? React with 👍 / 👎.

Comment on lines +95 to +97
if (consumed > 0) {
i += consumed - 1;
continue;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not skip non-table objects when advancing consumed range

After a recovery succeeds, the page loop advances by consumed, but consumed is based on the absolute index of the last recovered text node while recovery scanning explicitly allows multiple intervening non-text nodes. In mixed-content pages, supported objects like images/formulas/lists that fall between recovered text paragraphs are skipped entirely and never written, causing content loss whenever flattened table text is interleaved with other objects.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recover flattened benchmark tables emitted as paragraph runs

2 participants