Skip to content

docx: partitioner finds text nested in revision-marks #1821

Open
@scanny

Description

@scanny

Currently DOCX content nested in revision-marks is skipped when partitioning a .docx file.

Add an "accept-all-revisions" step before partitioning to bring the document to the state most likely intended by the author, such that inserted or modified text is included and deleted text is not.

Additional context
Microsoft Word has features that support document review and revision. An author can turn on the "Track Changes" option, send the document to an editor (person) and then any changes made by the editor are clearly marked as suggested revisions. The revisions can be accepted or rejected individually or as a group.

These revisions, when not yet accepted, cause the affected text to be "nested" in revision-mark elements like <w:ins> and <w:del> in the document XML. This causes that text to be skipped by python-docx because it is "beneath" the level it goes looking for paragraphs etc. Further, it's not immediately obvious what the expected behavior should be because simply including all that text will not only show insertions, but also deletions and perhaps duplicate moved text or place it in a different location.

The common solution to this problem is to add an "Accept all revisions" step before processing which removes all revision mark "container" envelopes, adding text in w:ins(ert) elements and removing text in w:del(ete) elements, etc. This is a reasonable assumption of the author's intent because by default this is how the text in the document appears if you forget to accept revisions and turn off the Track-Changes option.

Metadata

Metadata

Assignees

No one assigned

    Labels

    docxRelated to Microsoft Word (.docx) file formatenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions