Skip to content

Commit c145e10

Browse files
committed
fix(chunk): address all 20 PR #159 review comments
Critical fixes: - C1: rename metadata["positions"] → metadata["spanning_pages"] to avoid collision with deepdoc PDF parser's bounding-box positions - C2: shallow-copy metadata in _create_chunk_record so overlapping windows do not mutate shared source paragraph dicts - C3: fix _split_by_headers_with_positions start_pos to include the header line itself, preventing header page number loss Major fixes: - M1: resolved by C1 key separation + isinstance guard - M2: _split_by_headers now delegates to _split_by_headers_with_positions, eliminating ~80 lines of duplicated logic - M4: performance tests marked @pytest.mark.slow with relaxed thresholds - M5: apply_fixed_size_strategy now tracks spanning_pages via _window_with_overlap_and_metadata - M6: markdown strategy computes per-window spanning_pages using global paragraph intervals instead of section-level page sets; overlap-aware window offset calculation prevents page attribution drift - M7: remove no-op _validate_spanning_pages_record call from hot path Minor fixes (m1–m10): - _derived page_stats injected inside _write_parse_to_db, not caller params - page_number=0 handled correctly (>= 0, is None check) - current_line_number → current_char_offset - single _join_paragraphs_with_metadata call replaces double join - remove per-function thread-safety docs - type hint Optional[List[int]] for spanning_pages parameter
1 parent cb51c1d commit c145e10

3 files changed

Lines changed: 357 additions & 215 deletions

File tree

0 commit comments

Comments
 (0)