Skip to content

bug(docx): O(N^2) time on large section+paragraph count #3592

Open
@scanny

Description

@scanny

Summary
A DOCX with a large number of sections combined with a large number of paragraphs triggers an O(N^2) process to determine the paragraphs in each section. This greatly slows partitioning of these documents.

To Reproduce
The original document in this case had 470 sections and 31,723 paragraphs. Partitioning took in excess of 40 minutes.

Expected behavior
Partitioning time should be O(N) with the number of paragraphs.

Diagnosis
The upstream python-docx SectBlockIterator algorithm is effectively O(N^2) when called once for each section.
https://github.com/python-openxml/python-docx/blob/master/src/docx/oxml/section.py#L437

It gets all the block items before the section end and then subtracts all the block items before the prior section end. This is effectively N^2 with the number of block-items (paragraphs/tables) in the document.

Work out an algorithm to get only the block items between the prior section end and the current section end. Something roughly like:

section_p = Section._sectPr
prior_section_p = section_p.xpath(./[prior-sibling]::w:sectPr)
block_item = prior_section_p
while block_item := block_item.next():
    yield block_item
    if block_item = section_p:
        break

There will be some edge-cases to work out in tests because of w:p/w:sectPr and w:sectPr diversity between body and document-end w:sectPrs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocxRelated to Microsoft Word (.docx) file format

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions