-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
I ran some profiling of an XLSX file with 1M rows, and it took over 3 minutes to process the file. It appears the processing was all CPU bound on the client side, which won't scale.
Findings
openpyxl reported Data sheet dimension A1:BK1048501 with max_row=1,048,501; this is an inflated range.
Docling spent 163.35s in find_tables_in_sheet across 12 sheets; the vast majority is on “Data”.
Hotspot counts (across sheets):
_find_table_bounds: 24.00s over 320 calls
_find_table_bottom: 12.67s over 320 calls
_find_table_right: 11.26s over 320 calls
_find_images_in_sheet: 0.00s (no images)
Total backend convert: ~184s; stall was on iterating the “Data” sheet’s huge used range.
Conclusion
Root cause: openpyxl sees the Data sheet as having 1,048,501 rows, so Docling’s table discovery scans a massive range. That’s why you observed “Processing sheet: Data” with long delays.
If there's any optimization we can make for large files like this, that would be super helpful.