fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

HuangruiChu · 2025-10-07T21:18:19Z

Fix #2307, Follow the instruction of #2307 (comment).
Add a pre-scan step to detect the true last non-empty row and column (including merged cells), then use those bounds to limit all subsequent scans. This will avoid iterating over millions of empty cells and dramatically improve performance on large, sparse sheets.

Issue resolved by this Pull Request:
Resolves #2307

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment). Signed-off-by: Richard (Huangrui) Chu <[email protected]>

github-actions · 2025-10-07T21:18:28Z

✅ DCO Check Passed

Thanks @HuangruiChu, all your commits are properly signed off. 🎉

dosubot · 2025-10-07T21:18:31Z

Related Documentation

Checked 2 published document(s). No updates required.

^{You have 5 draft document(s). Publish docs to keep them always up-to-date}

^{How did I do? Any feedback?}

mergify · 2025-10-07T21:18:52Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Fix error Signed-off-by: Richard (Huangrui) Chu <[email protected]>

HuangruiChu · 2025-10-08T13:41:50Z

Need to fix the "Ruff formatter...........................................................Failed".

Signed-off-by: Richard (Huangrui) Chu <[email protected]>

codecov · 2025-10-08T19:08:22Z

Codecov Report

❌ Patch coverage is 4.54545% with 21 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/msexcel_backend.py	4.54%	21 Missing ⚠️

📢 Thoughts on this report? Let us know!

…[email protected]>) Signed-off-by: Richard (Huangrui) Chu <[email protected]>

cau-git · 2025-10-10T07:39:55Z

@HuangruiChu thanks for your contribution. Can you please restore the test_backend_msexcel test unit, which apparently was deleted in this PR?

ceberam

Thanks @HuangruiChu for your contribution. Here are some comments:

Did you test and have any evidence that this PR solves the intended issue #2307 ? I tested the PR against large spreadsheets with empty cells and the processing time was more than double compared to the current implementation.
The new function _find_true_data_bounds is computationally expensive but it limits the scan range in _find_data_tables and overall it may pay off using it. However, it should also be leveraged in _find_table_bottom and _find_table_right, which still have some unbounded scans. This could explain the increase of processing time in this PR.
Restore the test_backend_msexcel.py to ensure regression tests on this backend pass, as pointed out by @cau-git .
Remove some unnecessary comments like # Example integration in MsExcelDocumentBackend._find_data_tables

HuangruiChu · 2025-10-12T01:16:20Z

Thank you for your reply. Sorry I mistakenly deleted the content of the "test_backend_msexcel.py". I was planning to add a test for "_find_true_data_bounds" under "test_backend_msexcel.py".

"it should also be leveraged in _find_table_bottom and _find_table_right," Thank you for remind me I am working on update this.
"Remove some unnecessary comments" Sure, I will remove that.
"Did you test and have any evidence that this PR solves the intended issue Large spreadsheets - very slow processing #2307 ?" I create a version where the max_row is close to 1M by copy "test-01.xlsx". The processing time for the whole datasheet1 is around 2.0 seconds
.

Update msexcel_backend.py

1073e99

Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment). Signed-off-by: Richard (Huangrui) Chu <[email protected]>

HuangruiChu changed the title ~~Update msexcel_backend.py~~ fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly Oct 8, 2025

HuangruiChu marked this pull request as draft October 8, 2025 13:22

Update msexcel_backend.py

576b36f

Fix error Signed-off-by: Richard (Huangrui) Chu <[email protected]>

HuangruiChu marked this pull request as ready for review October 8, 2025 13:26

dolfim-ibm requested a review from PeterStaar-IBM October 8, 2025 13:36

dolfim-ibm assigned PeterStaar-IBM Oct 8, 2025

HuangruiChu closed this Oct 8, 2025

HuangruiChu reopened this Oct 8, 2025

Fix linting issues

9e5aa2b

Signed-off-by: Richard (Huangrui) Chu <[email protected]>

HuangruiChu force-pushed the patch-1 branch from f7d2c5b to 9e5aa2b Compare October 8, 2025 14:14

Add test files and data (Signed-off-by: Huangrui Chu <huangrui.chu.19…

d4b5225

…[email protected]>) Signed-off-by: Richard (Huangrui) Chu <[email protected]>

ceberam requested changes Oct 10, 2025

View reviewed changes

dolfim-ibm assigned ceberam and unassigned PeterStaar-IBM Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

Uh oh!

HuangruiChu commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

dosubot bot commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 7, 2025 •

edited

Loading

Uh oh!

HuangruiChu commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

cau-git commented Oct 10, 2025

Uh oh!

ceberam left a comment

Uh oh!

HuangruiChu commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

Are you sure you want to change the base?

fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404

Uh oh!

Conversation

HuangruiChu commented Oct 7, 2025

Uh oh!

github-actions bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dosubot bot commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

HuangruiChu commented Oct 8, 2025

Uh oh!

codecov bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cau-git commented Oct 10, 2025

Uh oh!

ceberam left a comment

Choose a reason for hiding this comment

Uh oh!

HuangruiChu commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Oct 7, 2025 •

edited

Loading

mergify bot commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 8, 2025 •

edited

Loading