Skip to content

feat: add page_no to TableCell#467

Open
ryyhan wants to merge 1 commit intodocling-project:mainfrom
ryyhan:fix/issue-410-add-page-no-to-table-cell
Open

feat: add page_no to TableCell#467
ryyhan wants to merge 1 commit intodocling-project:mainfrom
ryyhan:fix/issue-410-add-page-no-to-table-cell

Conversation

@ryyhan
Copy link

@ryyhan ryyhan commented Jan 8, 2026

Description

Resolves #410.

Adds page_no attribute to the TableCell class to support cross-page table data as requested. This allows upstream parsers to store page location information for individual cells in multi-page tables.

Changes

  • Modified TableCell in docling_core/types/doc/document.py to include optional page_no.
  • Updated regression test gold file test/data/docling_document/unit/TableItem.yaml.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • New and existing unit tests pass locally with my changes

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

DCO Check Passed

Thanks @ryyhan, all your commits are properly signed off. 🎉

@dosubot
Copy link

dosubot bot commented Jan 8, 2026

Related Documentation

Checked 6 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify
Copy link

mergify bot commented Jan 8, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@ceberam
Copy link
Member

ceberam commented Jan 9, 2026

@ryyhan Thanks for this PR. I think it makes sense to keep track of the page number (when available) in table cells.
Could you please check that the PR passes the code and style checks?
Simply run uv run pre-commit run --all-files. You may also want to install pre-commit locally with uv run pre-commit install to set up the git hook scripts.

@ryyhan ryyhan force-pushed the fix/issue-410-add-page-no-to-table-cell branch from 54fd775 to 67365e1 Compare January 9, 2026 11:35
@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ceberam
ceberam previously approved these changes Jan 9, 2026
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ryyhan
Copy link
Author

ryyhan commented Jan 9, 2026

LGTM

Thank you for the review and approval, @ceberam! I'm glad the changes look good.

@ryyhan
Copy link
Author

ryyhan commented Jan 12, 2026

@dolfim-ibm / @PeterStaar-IBM could you also please take a look when you have a moment? We need a second approval for the test updates to merge this.

@PeterStaar-IBM
Copy link
Member

@ryyhan I am not sure about this, it looks like a patch instead of a proper solution

@ryyhan
Copy link
Author

ryyhan commented Jan 16, 2026

@ryyhan I am not sure about this, it looks like a patch instead of a proper solution

Hi @PeterStaar-IBM, thank you for your review.

I appreciate your feedback regarding the implementation. To address your concern about the current approach and to ensure better structural alignment with the rest of the codebase (similar to DocItem), I would like to propose the following changes:

  1. Introduce prov: Instead of adding a standalone page_no field, I propose adding prov: list[ProvenanceItem] to TableCell. This would strictly follow the standard pattern used elsewhere for tracking page numbers and location.
  2. Computed bbox: To avoid data redundancy (having both prov and bbox store coordinates) while maintaining backward compatibility, I suggest converting bbox into a property that dynamically calculates the enclosing bounding box from the prov list. This ensures a single source of truth.

Could you please let me know if this aligns with the solution you envisioned? I would be happy to proceed with this refactor if you agree.

@cau-git
Copy link
Member

cau-git commented Jan 16, 2026

@ryyhan Having a prov: List[ProvenanceItem] on TableCell would indeed be a proper way. However to ensure backward-compatibility, it would require to introduce a getter/setter for bbox as a property that syncs with the provenance. Lots of code we don't control is reading from and assigning to the bbox field. You can try that but it will be delicate to test, so this might take a bit of time for us to validate.

@ryyhan
Copy link
Author

ryyhan commented Jan 18, 2026

@ryyhan Having a prov: List[ProvenanceItem] on TableCell would indeed be a proper way. However to ensure backward-compatibility, it would require to introduce a getter/setter for bbox as a property that syncs with the provenance. Lots of code we don't control is reading from and assigning to the bbox field. You can try that but it will be delicate to test, so this might take a bit of time for us to validate.

Thanks @cau-git for the guidance.

I understand the backward compatibility requirement. I will implement prov as the source of truth and add a bbox property with both a getter (calculating the union of provenance bboxes) and a setter (updating the provenance to reflect the new box) to ensure existing code continues to work seamlessly.

I will also include comprehensive tests to verify that reading/writing bbox behaves exactly as expected. I'll proceed with this implementation now.

…erty

Signed-off-by: ryyhan <dayel.rehan@gmail.com>
@ryyhan ryyhan force-pushed the fix/issue-410-add-page-no-to-table-cell branch from 613e0f8 to d4c7bc8 Compare January 18, 2026 13:42
@ryyhan
Copy link
Author

ryyhan commented Jan 18, 2026

@cau-git, I have implemented the changes as discussed to ensure full backward compatibility.

  • Replaced the direct bbox field with a Python property (with both getter and setter).
  • Added prov: list[ProvenanceItem] as the source of truth for location data.
  • The bbox property now dynamically syncs with prov:
    • Getter: Returns the enclosing bbox of the prov items (or falls back to a private _bbox for legacy data).
    • Setter: Updates the prov item's bbox if it exists, ensuring writes to cell.bbox are reflected in the provenance.
  • Overrode init to handle legacy initialization (e.g., TableCell(bbox=...)), ensuring it correctly populates the internal state.

I've verified this locally with tests covering legacy creation, new creation, and setter updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bee] need page_no on TableCell to support cross page table data

4 participants