-
Notifications
You must be signed in to change notification settings - Fork 2.9k
fix: add a pre-scan step to detect the true last non-empty row/column and limit the scan range accordingly #2404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fix docling-project#2307, Follow the instruction of docling-project#2307 (comment). Signed-off-by: Richard (Huangrui) Chu <[email protected]>
✅ DCO Check Passed Thanks @HuangruiChu, all your commits are properly signed off. 🎉 |
Related Documentation Checked 2 published document(s). No updates required. You have 5 draft document(s). Publish docs to keep them always up-to-date |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Fix error Signed-off-by: Richard (Huangrui) Chu <[email protected]>
Need to fix the "Ruff formatter...........................................................Failed". |
Signed-off-by: Richard (Huangrui) Chu <[email protected]>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…[email protected]>) Signed-off-by: Richard (Huangrui) Chu <[email protected]>
@HuangruiChu thanks for your contribution. Can you please restore the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HuangruiChu for your contribution. Here are some comments:
- Did you test and have any evidence that this PR solves the intended issue #2307 ? I tested the PR against large spreadsheets with empty cells and the processing time was more than double compared to the current implementation.
- The new function
_find_true_data_bounds
is computationally expensive but it limits the scan range in_find_data_tables
and overall it may pay off using it. However, it should also be leveraged in_find_table_bottom
and_find_table_right
, which still have some unbounded scans. This could explain the increase of processing time in this PR. - Restore the
test_backend_msexcel.py
to ensure regression tests on this backend pass, as pointed out by @cau-git . - Remove some unnecessary comments like # Example integration in MsExcelDocumentBackend._find_data_tables
Thank you for your reply. Sorry I mistakenly deleted the content of the "test_backend_msexcel.py". I was planning to add a test for "_find_true_data_bounds" under "test_backend_msexcel.py".
|
Fix #2307, Follow the instruction of #2307 (comment).
Add a pre-scan step to detect the true last non-empty row and column (including merged cells), then use those bounds to limit all subsequent scans. This will avoid iterating over millions of empty cells and dramatically improve performance on large, sparse sheets.
Issue resolved by this Pull Request:
Resolves #2307
Checklist: