fix: updated caption references in add_node_items#514
fix: updated caption references in add_node_items#514lukesan48 wants to merge 5 commits intodocling-project:mainfrom
Conversation
Signed-off-by: luke <lukesan48@gmail.com>
|
✅ DCO Check Passed Thanks @lukesan48, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
docling_core/types/doc/document.py
Outdated
| captions = getattr(item, "captions", None) | ||
| if captions: |
There was a problem hiding this comment.
instead of using getattr and hasattr, we prefer using the isinstance(item, ...)
There was a problem hiding this comment.
Thanks for the review! I've updated the logic to use isinstance(item, FloatingItem) as requested.
I also added a new test case (test_add_node_items_updates_captions) to the test suite.
Ready for another look when you have a moment!
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Signed-off-by: luke <lukesan48@gmail.com>
Signed-off-by: luke <lukesan48@gmail.com>
ceberam
left a comment
There was a problem hiding this comment.
Thanks @lukesan48 for catching this bug!
The fix solves the issue 🏆
However, I realize that FloatingItem has also other fields that would lead to the same issue: references, footnotes, and maybe comments from its parent too.
It would be helpful if you could address those cases in this PR. Otherwise, we can merge it as it is and create a new one.
…: references, footnotes, and comments Signed-off-by: luke <lukesan48@gmail.com>
|
Hi @ceberam thanks for pointing that out! I went ahead and addressed the same issue for |
ceberam
left a comment
There was a problem hiding this comment.
Thanks @lukesan48 for addressing those extra cases.
I just had some minor style comments that would be good to tackle too.
| if isinstance(item, DocItem): | ||
| if item.comments: | ||
| if isinstance(item_copy, DocItem): | ||
| item_copy.comments = self._copy_and_reindex_refs(item.comments, doc=doc, parent_ref=parent_ref) |
There was a problem hiding this comment.
For readability, can we keep it more compact? (this one and the other similar patterns)
| if isinstance(item, DocItem): | |
| if item.comments: | |
| if isinstance(item_copy, DocItem): | |
| item_copy.comments = self._copy_and_reindex_refs(item.comments, doc=doc, parent_ref=parent_ref) | |
| if isinstance(item, DocItem) and item.comments and isinstance(item_copy, DocItem): | |
| item_copy.comments = self._copy_and_reindex_refs(item.comments, doc=doc, parent_ref=parent_ref) |
| """Helper to copy referenced items and return their new indices | ||
|
|
||
| :param ref_list: list[Any]: The list of references (e.g., captions, footnotes, comments) to be copied | ||
| :param doc: "DoclingDocument": The document from which the NodeItems are taken | ||
| :param parent_ref: RefItem: The reference of the parent item in the current document where copies will be appended to | ||
|
|
||
| :returns: list[Any]: A new list of references pointing to the newly appended items in the current document |
There was a problem hiding this comment.
Even though we have not been consistent in the past, we want to stick to the google docstring conventions at least on new code, as we specify it on pyproject.toml
|
|
||
| return new_refs | ||
|
|
||
| def _copy_and_reindex_refs(self, ref_list: list[Any], doc: "DoclingDocument", parent_ref: RefItem) -> list[Any]: |
There was a problem hiding this comment.
Fine grain comment: can we be more precise with the type hints? Would list[Any] rather be list[NodeItem] ?
There was a problem hiding this comment.
Because item.comments expects a list[FineRef] while other fields use list[RefItem], MyPy throws a list invariance error if I type-hint the helper method strictly as list[RefItem] or list[FineRef]. Would you prefer I keep the type hint as list[Any] to handle both cases or split this into two separate helper functions?
There was a problem hiding this comment.
@ceberam Just a gentle ping on this when you have a moment.
If list[Any] is a bit too loose for the project's typing standards, I can implement a generic TypeVar (e.g., T_Ref = TypeVar("T_Ref", bound="RefItem") alongside Sequence[T_Ref]). This approach should satisfy MyPy's strict list invariance rules while keeping the exact typing intact for both FineRef and RefItem outputs.
Let me know if you'd like me to push that update, or if you're comfortable moving forward with it as-is!
Description
Fixes a bug where a
TableItemcaption reference was not correctly updated during document-to-document node transfers. The index was carried over exactly instead of updating.When using
add_node_itemsto copy elements (like a Table) from a source document to a destination document, the caption text was being copied, but the parent element's internalcaptionslist still pointed to the index in the original documen. This can result in broken references orIndexErrorin the new document.Related Issues
Attempt to fix docling-project/docling#2298
Changes
_append_item_copiesto recursively resolve, copy, and re-index captions for FloatingItems.getattr/setattrpattern to ensure compatibility withNodeItembase class and satisfy MyPy type checking.Verification
Verified with a reproduction script extracting a table from a large document:
#/texts/0(new doc index).