Commit c145e10
committed
fix(chunk): address all 20 PR #159 review comments
Critical fixes:
- C1: rename metadata["positions"] → metadata["spanning_pages"] to avoid
collision with deepdoc PDF parser's bounding-box positions
- C2: shallow-copy metadata in _create_chunk_record so overlapping windows
do not mutate shared source paragraph dicts
- C3: fix _split_by_headers_with_positions start_pos to include the header
line itself, preventing header page number loss
Major fixes:
- M1: resolved by C1 key separation + isinstance guard
- M2: _split_by_headers now delegates to _split_by_headers_with_positions,
eliminating ~80 lines of duplicated logic
- M4: performance tests marked @pytest.mark.slow with relaxed thresholds
- M5: apply_fixed_size_strategy now tracks spanning_pages via
_window_with_overlap_and_metadata
- M6: markdown strategy computes per-window spanning_pages using global
paragraph intervals instead of section-level page sets; overlap-aware
window offset calculation prevents page attribution drift
- M7: remove no-op _validate_spanning_pages_record call from hot path
Minor fixes (m1–m10):
- _derived page_stats injected inside _write_parse_to_db, not caller params
- page_number=0 handled correctly (>= 0, is None check)
- current_line_number → current_char_offset
- single _join_paragraphs_with_metadata call replaces double join
- remove per-function thread-safety docs
- type hint Optional[List[int]] for spanning_pages parameter1 parent cb51c1d commit c145e10
3 files changed
Lines changed: 357 additions & 215 deletions
File tree
- src/xagent/core/tools/core/RAG_tools
- chunk
- parse
- tests/core/tools/core/RAG_tools/chunk
0 commit comments