bug(docx): O(N^2) time on large section+paragraph count

**Summary**
A DOCX with a large number of sections combined with a large number of paragraphs triggers an O(N^2) process to determine the paragraphs in each section. This greatly slows partitioning of these documents.

**To Reproduce**
The original document in this case had 470 sections and 31,723 paragraphs. Partitioning took in excess of 40 minutes.

**Expected behavior**
Partitioning time should be O(N) with the number of paragraphs.

**Diagnosis**
The upstream `python-docx` `SectBlockIterator` algorithm is effectively O(N^2) when called once for each section.
https://github.com/python-openxml/python-docx/blob/master/src/docx/oxml/section.py#L437

It gets all the block items before the section end and then subtracts all the block items before the prior section end. This is effectively N^2 with the number of block-items (paragraphs/tables) in the document.

Work out an algorithm to get only the block items between the prior section end and the current section end. Something roughly like:
```python
section_p = Section._sectPr
prior_section_p = section_p.xpath(./[prior-sibling]::w:sectPr)
block_item = prior_section_p
while block_item := block_item.next():
    yield block_item
    if block_item = section_p:
        break
```
There will be some edge-cases to work out in tests because of `w:p/w:sectPr` and `w:sectPr` diversity between body and document-end `w:sectPr`s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug(docx): O(N^2) time on large section+paragraph count #3592

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug(docx): O(N^2) time on large section+paragraph count #3592

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions