Skip to content

Conversation

HuangruiChu
Copy link

Fix #2307, Follow the instruction of #2307 (comment).
Add a pre-scan step to detect the true last non-empty row and column (including merged cells), then use those bounds to limit all subsequent scans. This will avoid iterating over millions of empty cells and dramatically improve performance on large, sparse sheets.

Issue resolved by this Pull Request:
Resolves #2307

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment).

Signed-off-by: Richard (Huangrui) Chu <[email protected]>
Copy link
Contributor

github-actions bot commented Oct 7, 2025

DCO Check Passed

Thanks @HuangruiChu, all your commits are properly signed off. 🎉

Copy link

dosubot bot commented Oct 7, 2025

Related Documentation

Checked 2 published document(s). No updates required.

You have 5 draft document(s). Publish docs to keep them always up-to-date

How did I do? Any feedback?  Join Discord

Copy link

mergify bot commented Oct 7, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@HuangruiChu HuangruiChu changed the title Update msexcel_backend.py fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly Oct 8, 2025
@HuangruiChu HuangruiChu marked this pull request as draft October 8, 2025 13:22
Fix error

Signed-off-by: Richard (Huangrui) Chu <[email protected]>
@HuangruiChu HuangruiChu marked this pull request as ready for review October 8, 2025 13:26
@HuangruiChu HuangruiChu closed this Oct 8, 2025
@HuangruiChu
Copy link
Author

Need to fix the "Ruff formatter...........................................................Failed".

@HuangruiChu HuangruiChu reopened this Oct 8, 2025
Signed-off-by: Richard (Huangrui) Chu <[email protected]>
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

❌ Patch coverage is 4.54545% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msexcel_backend.py 4.54% 21 Missing ⚠️

📢 Thoughts on this report? Let us know!

@cau-git
Copy link
Contributor

cau-git commented Oct 10, 2025

@HuangruiChu thanks for your contribution. Can you please restore the test_backend_msexcel test unit, which apparently was deleted in this PR?

Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HuangruiChu for your contribution. Here are some comments:

  • Did you test and have any evidence that this PR solves the intended issue #2307 ? I tested the PR against large spreadsheets with empty cells and the processing time was more than double compared to the current implementation.
  • The new function _find_true_data_bounds is computationally expensive but it limits the scan range in _find_data_tables and overall it may pay off using it. However, it should also be leveraged in _find_table_bottom and _find_table_right, which still have some unbounded scans. This could explain the increase of processing time in this PR.
  • Restore the test_backend_msexcel.py to ensure regression tests on this backend pass, as pointed out by @cau-git .
  • Remove some unnecessary comments like # Example integration in MsExcelDocumentBackend._find_data_tables

@HuangruiChu
Copy link
Author

Thank you for your reply. Sorry I mistakenly deleted the content of the "test_backend_msexcel.py". I was planning to add a test for "_find_true_data_bounds" under "test_backend_msexcel.py".

  1. "it should also be leveraged in _find_table_bottom and _find_table_right," Thank you for remind me I am working on update this.
  2. "Remove some unnecessary comments" Sure, I will remove that.
  3. "Did you test and have any evidence that this PR solves the intended issue Large spreadsheets - very slow processing #2307 ?" I create a version where the max_row is close to 1M by copy "test-01.xlsx". The processing time for the whole datasheet1 is around 2.0 seconds
    .

@dolfim-ibm dolfim-ibm assigned ceberam and unassigned PeterStaar-IBM Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large spreadsheets - very slow processing

4 participants