Description
Summary
A DOCX with a large number of sections combined with a large number of paragraphs triggers an O(N^2) process to determine the paragraphs in each section. This greatly slows partitioning of these documents.
To Reproduce
The original document in this case had 470 sections and 31,723 paragraphs. Partitioning took in excess of 40 minutes.
Expected behavior
Partitioning time should be O(N) with the number of paragraphs.
Diagnosis
The upstream python-docx
SectBlockIterator
algorithm is effectively O(N^2) when called once for each section.
https://github.com/python-openxml/python-docx/blob/master/src/docx/oxml/section.py#L437
It gets all the block items before the section end and then subtracts all the block items before the prior section end. This is effectively N^2 with the number of block-items (paragraphs/tables) in the document.
Work out an algorithm to get only the block items between the prior section end and the current section end. Something roughly like:
section_p = Section._sectPr
prior_section_p = section_p.xpath(./[prior-sibling]::w:sectPr)
block_item = prior_section_p
while block_item := block_item.next():
yield block_item
if block_item = section_p:
break
There will be some edge-cases to work out in tests because of w:p/w:sectPr
and w:sectPr
diversity between body and document-end w:sectPr
s.