Skip to content

Feat: Preserve multi-page chunk metadata and document page stats#159

Open
sqhyz55 wants to merge 6 commits into
xorbitsai:mainfrom
sqhyz55:feature/chunk-multipage-metadata
Open

Feat: Preserve multi-page chunk metadata and document page stats#159
sqhyz55 wants to merge 6 commits into
xorbitsai:mainfrom
sqhyz55:feature/chunk-multipage-metadata

Conversation

@sqhyz55
Copy link
Copy Markdown
Collaborator

@sqhyz55 sqhyz55 commented Mar 16, 2026

No description provided.

@sqhyz55 sqhyz55 requested a review from rogercloud March 16, 2026 09:51
@sqhyz55 sqhyz55 changed the title Preserve multi-page chunk metadata and document page stats Feat: Preserve multi-page chunk metadata and document page stats Mar 16, 2026
@sqhyz55 sqhyz55 force-pushed the feature/chunk-multipage-metadata branch from 3d65051 to 7d6e52d Compare April 7, 2026 10:14
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Preserve multi-page chunk metadata and document page stats

Found 3 critical, 7 major, and 10 minor issues. All verified with live reproduction against the PR code.

Critical (must fix before merge)

  • C1: metadata["positions"] key collision destroys PDF bounding box data. The deepdoc PDF parser already uses metadata["positions"] for bounding box visualization data ([[page_num, col_id, left, right, top, bottom], ...]). This PR overwrites it with page numbers [1, 2, 3]. Fix: use a different key (e.g., page_positions or spanning_pages).

  • C2: Shared metadata dict mutation. _create_chunk_record stores a reference (not a copy) to the source paragraph's metadata dict. When overlapping windows share the same source paragraph, _apply_positions_to_record mutates the shared dict, causing the last window's positions to overwrite all earlier ones. Verified: chunk with only page-1 content incorrectly gets positions=[1, 2]. Fix: metadata = dict((source_paragraph or {}).get("metadata", {})).

  • C3: Header paragraph page lost in markdown strategy. _split_by_headers_with_positions sets current_start_pos to after the header line, but the header IS included in the section text. The header paragraph's interval doesn't overlap the section range, so its page number is excluded. Verified: # Title on page 1 + content on page 2 produces positions=[2] (page 1 lost).

Major

  • M1: _validate_positions_record crashes with TypeError: unhashable type: 'list' on existing bounding-box positions. Currently masked by C1.
  • M2: _split_by_headers_with_positions duplicates ~80 lines from _split_by_headers.
  • M3: 8+ new functions with zero dedicated tests.
  • M4: Performance tests with hard timing assertions are inherently flaky on CI.
  • M5: apply_fixed_size_strategy has no positions tracking.
  • M6: Markdown strategy discards per-window positions; all chunks in a section get identical section-level positions.
  • M7: Validation runs on every chunk but only logs at debug -- effectively a no-op.

Minor (see inline comments)

m1-m10: _derived injected after validation, page_number=0 silently overridden, source paragraphs permanently mutated, Chinese test comments, misleading variable name, double text join, inconsistent thread-safety docs, page 0 silently dropped, loose type hint, inconsistent logging levels.

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py Outdated
sqhyz55 added a commit to sqhyz55/xagent that referenced this pull request May 11, 2026
Critical fixes:
- C1: rename metadata["positions"] → metadata["spanning_pages"] to avoid
  collision with deepdoc PDF parser's bounding-box positions
- C2: shallow-copy metadata in _create_chunk_record so overlapping windows
  do not mutate shared source paragraph dicts
- C3: fix _split_by_headers_with_positions start_pos to include the header
  line itself, preventing header page number loss

Major fixes:
- M1: resolved by C1 key separation + isinstance guard
- M2: _split_by_headers now delegates to _split_by_headers_with_positions,
  eliminating ~80 lines of duplicated logic
- M4: performance tests marked @pytest.mark.slow with relaxed thresholds
- M5: apply_fixed_size_strategy now tracks spanning_pages via
  _window_with_overlap_and_metadata
- M6: markdown strategy computes per-window spanning_pages using global
  paragraph intervals instead of section-level page sets; overlap-aware
  window offset calculation prevents page attribution drift
- M7: remove no-op _validate_spanning_pages_record call from hot path

Minor fixes (m1–m10):
- _derived page_stats injected inside _write_parse_to_db, not caller params
- page_number=0 handled correctly (>= 0, is None check)
- current_line_number → current_char_offset
- single _join_paragraphs_with_metadata call replaces double join
- remove per-function thread-safety docs
- type hint Optional[List[int]] for spanning_pages parameter
@sqhyz55 sqhyz55 force-pushed the feature/chunk-multipage-metadata branch from c13d857 to c145e10 Compare May 11, 2026 07:05
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings from re-review. These are separate from the already-resolved earlier threads.

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/parse/parse_document.py Outdated
Comment thread CHANGELOG.md Outdated
sqhyz55 added a commit to sqhyz55/xagent that referenced this pull request May 12, 2026
- Deep-copy metadata in _create_chunk_record to isolate nested objects
- Add per-window spanning_pages for custom-separator markdown path
- Restore \n\n paragraph separators in fixed-size chunking
- Normalise PyPDFLoader 0-based page numbers to 1-based
- Unify page number filters to >= 1 across chunk and parse paths
- Fix CHANGELOG metadata key: positions → spanning_pages
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional validated follow-up findings from re-review.

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py Outdated
Comment thread src/xagent/core/tools/core/RAG_tools/parse/parse_document.py
Comment thread tests/core/tools/core/RAG_tools/chunk/test_chunk_strategies.py Outdated
Copy link
Copy Markdown
Collaborator

@rogercloud rogercloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional validated review findings from the follow-up pass.

Comment thread src/xagent/core/tools/core/RAG_tools/chunk/chunk_strategies.py
Comment thread src/xagent/core/tools/core/RAG_tools/parse/parse_document.py Outdated
sqhyz55 added 6 commits May 19, 2026 11:47
Implement comprehensive improvements to chunk metadata handling and performance:

- Preserve multi-page chunk origins via metadata["positions"] list
- Derive page_number from first page in positions when missing
- Add document-level page statistics (page_count, page_numbers) to parse params

- Replace O(n*m) character-level metadata with O(n) interval mapping
- Reduce memory footprint from O(n) to O(k) where k << n
- Optimize paragraph collection with direct iteration instead of generators

- Fix interval overlap check logic (was checking same condition twice)
- Fix position tracking in apply_markdown_strategy with accurate ranges
- Add section_end validation to prevent empty range queries

- Add _validate_positions_record() for metadata consistency checks
- Add fallback logic in apply_markdown_strategy for error resilience
- Add thread-safety and determinism documentation

- Add performance benchmark tests for large documents (100+ pages)
- Add multi-page positions validation tests
- All 152 related tests passing

- 1000-page documents processed in <2 seconds (previously much slower)
- Memory usage reduced by ~80% for large documents
- Backward compatible with existing chunk data
Critical fixes:
- C1: rename metadata["positions"] → metadata["spanning_pages"] to avoid
  collision with deepdoc PDF parser's bounding-box positions
- C2: shallow-copy metadata in _create_chunk_record so overlapping windows
  do not mutate shared source paragraph dicts
- C3: fix _split_by_headers_with_positions start_pos to include the header
  line itself, preventing header page number loss

Major fixes:
- M1: resolved by C1 key separation + isinstance guard
- M2: _split_by_headers now delegates to _split_by_headers_with_positions,
  eliminating ~80 lines of duplicated logic
- M4: performance tests marked @pytest.mark.slow with relaxed thresholds
- M5: apply_fixed_size_strategy now tracks spanning_pages via
  _window_with_overlap_and_metadata
- M6: markdown strategy computes per-window spanning_pages using global
  paragraph intervals instead of section-level page sets; overlap-aware
  window offset calculation prevents page attribution drift
- M7: remove no-op _validate_spanning_pages_record call from hot path

Minor fixes (m1–m10):
- _derived page_stats injected inside _write_parse_to_db, not caller params
- page_number=0 handled correctly (>= 0, is None check)
- current_line_number → current_char_offset
- single _join_paragraphs_with_metadata call replaces double join
- remove per-function thread-safety docs
- type hint Optional[List[int]] for spanning_pages parameter
- Deep-copy metadata in _create_chunk_record to isolate nested objects
- Add per-window spanning_pages for custom-separator markdown path
- Restore \n\n paragraph separators in fixed-size chunking
- Normalise PyPDFLoader 0-based page numbers to 1-based
- Unify page number filters to >= 1 across chunk and parse paths
- Fix CHANGELOG metadata key: positions → spanning_pages
Extract DeepDoc bbox and page_number merging into utils/paragraph_page_utils
so chunking and parse page_stats stay consistent. Skip _derived only when
validating persisted parse params. Replace wall-clock chunk tests with
correctness checks, add contract tests, and run spanning_pages validation
after apply (warnings only).
…ages on re-window

Reject system-only _derived in user parse requests; validate persisted params_json
via a separate path. Strip _derived from parse hash and parser kwargs. Merge
precomputed spanning_pages in _window_with_overlap_and_metadata to fix markdown
secondary windowing attribution.
@sqhyz55 sqhyz55 force-pushed the feature/chunk-multipage-metadata branch from a70831a to 53dcd44 Compare May 19, 2026 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants