Skip to content

fix: updated caption references in add_node_items#514

Open
lukesan48 wants to merge 5 commits intodocling-project:mainfrom
lukesan48:fix/issue-2298-caption-ref
Open

fix: updated caption references in add_node_items#514
lukesan48 wants to merge 5 commits intodocling-project:mainfrom
lukesan48:fix/issue-2298-caption-ref

Conversation

@lukesan48
Copy link

Description

Fixes a bug where a TableItem caption reference was not correctly updated during document-to-document node transfers. The index was carried over exactly instead of updating.

When using add_node_items to copy elements (like a Table) from a source document to a destination document, the caption text was being copied, but the parent element's internal captions list still pointed to the index in the original documen. This can result in broken references or IndexError in the new document.

Related Issues

Attempt to fix docling-project/docling#2298

Changes

  • Modified _append_item_copies to recursively resolve, copy, and re-index captions for FloatingItems.
  • Used getattr/setattr pattern to ensure compatibility with NodeItem base class and satisfy MyPy type checking.

Verification

Verified with a reproduction script extracting a table from a large document:

  • Before: New Table Caption Ref pointed to out-of-bounds index (original doc index).
  • After: New Table Caption Ref correctly points to #/texts/0 (new doc index).

Signed-off-by: luke <lukesan48@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 13, 2026

DCO Check Passed

Thanks @lukesan48, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Feb 13, 2026

Related Documentation

Checked 14 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Feb 13, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Comment on lines +4566 to +4567
captions = getattr(item, "captions", None)
if captions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using getattr and hasattr, we prefer using the isinstance(item, ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've updated the logic to use isinstance(item, FloatingItem) as requested.

I also added a new test case (test_add_node_items_updates_captions) to the test suite.

Ready for another look when you have a moment!

@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/types/doc/document.py 88.88% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: luke <lukesan48@gmail.com>
Signed-off-by: luke <lukesan48@gmail.com>
@ceberam ceberam self-requested a review February 20, 2026 20:01
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lukesan48 for catching this bug!
The fix solves the issue 🏆

However, I realize that FloatingItem has also other fields that would lead to the same issue: references, footnotes, and maybe comments from its parent too.

It would be helpful if you could address those cases in this PR. Otherwise, we can merge it as it is and create a new one.

@lukesan48
Copy link
Author

Hi @ceberam thanks for pointing that out! I went ahead and addressed the same issue for FloatingItem fields references, footnotes, and comments on this PR. Ready for another review whenever you get a chance!

Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lukesan48 for addressing those extra cases.
I just had some minor style comments that would be good to tackle too.

Comment on lines +4566 to +4569
if isinstance(item, DocItem):
if item.comments:
if isinstance(item_copy, DocItem):
item_copy.comments = self._copy_and_reindex_refs(item.comments, doc=doc, parent_ref=parent_ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, can we keep it more compact? (this one and the other similar patterns)

Suggested change
if isinstance(item, DocItem):
if item.comments:
if isinstance(item_copy, DocItem):
item_copy.comments = self._copy_and_reindex_refs(item.comments, doc=doc, parent_ref=parent_ref)
if isinstance(item, DocItem) and item.comments and isinstance(item_copy, DocItem):
item_copy.comments = self._copy_and_reindex_refs(item.comments, doc=doc, parent_ref=parent_ref)

Comment on lines +4606 to +4612
"""Helper to copy referenced items and return their new indices

:param ref_list: list[Any]: The list of references (e.g., captions, footnotes, comments) to be copied
:param doc: "DoclingDocument": The document from which the NodeItems are taken
:param parent_ref: RefItem: The reference of the parent item in the current document where copies will be appended to

:returns: list[Any]: A new list of references pointing to the newly appended items in the current document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we have not been consistent in the past, we want to stick to the google docstring conventions at least on new code, as we specify it on pyproject.toml


return new_refs

def _copy_and_reindex_refs(self, ref_list: list[Any], doc: "DoclingDocument", parent_ref: RefItem) -> list[Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine grain comment: can we be more precise with the type hints? Would list[Any] rather be list[NodeItem] ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because item.comments expects a list[FineRef] while other fields use list[RefItem], MyPy throws a list invariance error if I type-hint the helper method strictly as list[RefItem] or list[FineRef]. Would you prefer I keep the type hint as list[Any] to handle both cases or split this into two separate helper functions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ceberam Just a gentle ping on this when you have a moment.

If list[Any] is a bit too loose for the project's typing standards, I can implement a generic TypeVar (e.g., T_Ref = TypeVar("T_Ref", bound="RefItem") alongside Sequence[T_Ref]). This approach should satisfy MyPy's strict list invariance rules while keeping the exact typing intact for both FineRef and RefItem outputs.

Let me know if you'd like me to push that update, or if you're comfortable moving forward with it as-is!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add_node_items doesn't update caption reference

3 participants