Large spreadsheets - very slow processing

I ran some profiling of an XLSX file with 1M rows, and it took over 3 minutes to process the file. It appears the processing was all CPU bound on the client side, which won't scale. 

Findings
openpyxl reported Data sheet dimension A1:BK1048501 with max_row=1,048,501; this is an inflated range.
Docling spent 163.35s in find_tables_in_sheet across 12 sheets; the vast majority is on “Data”.
Hotspot counts (across sheets):
_find_table_bounds: 24.00s over 320 calls
_find_table_bottom: 12.67s over 320 calls
_find_table_right: 11.26s over 320 calls
_find_images_in_sheet: 0.00s (no images)
Total backend convert: ~184s; stall was on iterating the “Data” sheet’s huge used range.
Conclusion
Root cause: openpyxl sees the Data sheet as having 1,048,501 rows, so Docling’s table discovery scans a massive range. That’s why you observed “Processing sheet: Data” with long delays.

If there's any optimization we can make for large files like this, that would be super helpful. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Large spreadsheets - very slow processing #2307

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large spreadsheets - very slow processing #2307

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions