Skip to content

Large spreadsheets - very slow processing #2307

@devinbost

Description

@devinbost

I ran some profiling of an XLSX file with 1M rows, and it took over 3 minutes to process the file. It appears the processing was all CPU bound on the client side, which won't scale.

Findings
openpyxl reported Data sheet dimension A1:BK1048501 with max_row=1,048,501; this is an inflated range.
Docling spent 163.35s in find_tables_in_sheet across 12 sheets; the vast majority is on “Data”.
Hotspot counts (across sheets):
_find_table_bounds: 24.00s over 320 calls
_find_table_bottom: 12.67s over 320 calls
_find_table_right: 11.26s over 320 calls
_find_images_in_sheet: 0.00s (no images)
Total backend convert: ~184s; stall was on iterating the “Data” sheet’s huge used range.
Conclusion
Root cause: openpyxl sees the Data sheet as having 1,048,501 rows, so Docling’s table discovery scans a massive range. That’s why you observed “Processing sheet: Data” with long delays.

If there's any optimization we can make for large files like this, that would be super helpful.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingxlsxissue related to xlsx backend

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions