Description
Currently DOCX content nested in revision-marks is skipped when partitioning a .docx file.
Add an "accept-all-revisions" step before partitioning to bring the document to the state most likely intended by the author, such that inserted or modified text is included and deleted text is not.
Additional context
Microsoft Word has features that support document review and revision. An author can turn on the "Track Changes" option, send the document to an editor (person) and then any changes made by the editor are clearly marked as suggested revisions. The revisions can be accepted or rejected individually or as a group.
These revisions, when not yet accepted, cause the affected text to be "nested" in revision-mark elements like <w:ins>
and <w:del>
in the document XML. This causes that text to be skipped by python-docx
because it is "beneath" the level it goes looking for paragraphs etc. Further, it's not immediately obvious what the expected behavior should be because simply including all that text will not only show insertions, but also deletions and perhaps duplicate moved text or place it in a different location.
The common solution to this problem is to add an "Accept all revisions" step before processing which removes all revision mark "container" envelopes, adding text in w:ins
(ert) elements and removing text in w:del
(ete) elements, etc. This is a reasonable assumption of the author's intent because by default this is how the text in the document appears if you forget to accept revisions and turn off the Track-Changes option.