fix: recover flattened benchmark tables in markdown output by StevenVincentOne · Pull Request #262 · opendataloader-project/opendataloader-pdf

StevenVincentOne · 2026-03-06T03:04:01Z

Summary

recover DeepSeek-style flattened benchmark tables that are emitted as paragraph runs instead of native table structures
handle caption-first, metric-first, and benchmark pre-header variants while preventing prose bleed-through and keeping fallback behavior conservative
add focused MarkdownGeneratorTest coverage for duplicate metric headers, partial row handling, percent values, and multi-row flattened recovery cases

Closes #259.

CLAassistant · 2026-03-06T03:04:26Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Steven Vincent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 397d908308

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T03:09:01Z

...opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java

+            List<String> metrics = new ArrayList<>();
+            int j = i + 1;
+            while (j < tokens.length && metrics.size() < metricCount && NUMERIC_TOKEN_RE.matcher(tokens[j]).matches()) {
+                metrics.add(tokens[j]);
+                j += 1;


Parse full model labels before consuming metric tokens

recoverBenchmarkRowsFromText treats a row label as exactly one token and then immediately consumes numeric tokens as metrics, so flattened tables with multi-token model names (for example DeepSeek V3 or Llama 3 70B) get misparsed: the label is truncated and numeric parts of the name are shifted into metric columns. Because the recovered table is then emitted as Markdown, this silently corrupts benchmark results instead of preserving the original text.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-06T03:09:01Z

...opendataloader-pdf-core/src/main/java/org/opendataloader/pdf/markdown/MarkdownGenerator.java

+                    if (consumed > 0) {
+                        i += consumed - 1;
+                        continue;


Do not skip non-table objects when advancing consumed range

After a recovery succeeds, the page loop advances by consumed, but consumed is based on the absolute index of the last recovered text node while recovery scanning explicitly allows multiple intervening non-text nodes. In mixed-content pages, supported objects like images/formulas/lists that fall between recovered text paragraphs are skipped entirely and never written, causing content loss whenever flattened table text is interleaved with other objects.

Useful? React with 👍 / 👎.

fix: recover flattened benchmark tables from DeepSeek outputs

397d908

StevenVincentOne requested review from LonelyMidoriya, MaximPlusov, bundolee and hhjojojo as code owners March 6, 2026 03:04

chatgpt-codex-connector bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover flattened benchmark tables in markdown output#262

fix: recover flattened benchmark tables in markdown output#262
StevenVincentOne wants to merge 1 commit intoopendataloader-project:mainfrom
StevenVincentOne:fix/deepseek-table-recovery

StevenVincentOne commented Mar 6, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

StevenVincentOne commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

CLAassistant commented Mar 6, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

StevenVincentOne commented Mar 6, 2026 •

edited

Loading