Task 5 Critical Fixes

Date: 2025-10-27
Status: ✅ FIXED

Issues Identified

High Priority: Chunk Index Collision Across Pages 🔴

Location: src/services/text-chunker.service.ts:64, 122

Problem:

Each page's chunks started with chunkIndex = 0
Database unique constraint on (document_id, chunk_index)
Result: Page 2's chunk 0 overwrites page 1's chunk 0, page 3 overwrites both, etc.
Impact: Only last page's chunks survive, all previous pages' data is lost silently
Severity: CRITICAL - Complete data corruption for multi-page documents

Root Cause:

// BEFORE (BROKEN):
private generateSentenceAwareChunks(...) {
  let chunkIndex = 0; // ❌ Resets to 0 for every page!
  // ...
}

private generateSlidingWindowChunks(...) {
  let chunkIndex = 0; // ❌ Same issue
  // ...
}

Fix Applied:

Added startIndex parameter to ChunkingOptions interface
Modified both chunking methods to accept and use startIndex
Updated worker to track global chunk index across pages
Worker now passes cumulative index to each page's chunk generation

// AFTER (FIXED):
interface ChunkingOptions {
  // ... existing options
  startIndex?: number; // ✅ Document-global starting index
}

private generateSentenceAwareChunks(..., startIndex: number = 0) {
  let chunkIndex = startIndex; // ✅ Continues from previous pages
  // ...
}

// Worker now tracks global index:
let globalChunkIndex = 0;
for (const ocrResult of ocrResults.rows) {
  const chunks = await chunkerService.generateChunks(..., {
    startIndex: globalChunkIndex // ✅ Pass current global index
  });
  allChunks.push(...chunks);
  globalChunkIndex += chunks.length; // ✅ Increment for next page
}

Medium Priority: Stale Chunks Not Cleaned on Reprocessing 🟡

Location: src/services/chunk-storage.service.ts:18-45

Problem:

saveChunks() uses UPSERT logic (insert or update on conflict)
If reprocessing produces fewer chunks than original run, high-index chunks from first run remain
Result: getChunkStats() returns inflated totals including orphaned chunks
Impact: Incorrect statistics, stale data in database

Example:

First run:  Generates 10 chunks (indices 0-9)
Second run: Generates 5 chunks (indices 0-4)
Result:     Chunks 5-9 from first run still exist
            Stats show 10 chunks instead of 5

Fix Applied: Added explicit deletion before chunking in worker:

// Worker now deletes existing chunks first:
if (stage === 'chunking') {
  // ...
  
  // ✅ Delete existing chunks for this document (for reprocessing)
  await chunkStorage.deleteChunks(documentId);
  
  // Then generate and save new chunks
  const allChunks = [];
  // ...
}

Alternative Considered: Could have modified saveChunks() to delete first, but current approach:

Keeps storage service stateless
Makes worker responsible for orchestration logic
Clearer intent in worker code

Files Modified

1. src/services/text-chunker.service.ts

Changes:

Added startIndex?: number to ChunkingOptions interface
Modified generateChunks() to extract and use startIndex
Updated generateSentenceAwareChunks() to accept startIndex parameter (default 0)
Updated generateSlidingWindowChunks() to accept startIndex parameter (default 0)
Fixed infinite loop detection in sliding window to account for startIndex

Lines Changed: ~17 additions, ~10 deletions

2. src/workers/document-processor.worker.ts

Changes:

Added globalChunkIndex variable to track cumulative chunk count
Added await chunkStorage.deleteChunks(documentId) before chunk generation
Modified loop to pass { startIndex: globalChunkIndex } to each page
Updated globalChunkIndex += chunks.length after each page

Lines Changed: ~9 additions, ~2 deletions

3. src/services/text-chunker.service.test.ts

New Tests Added:

describe('Multi-page document handling', () => {
  it('should generate unique chunk indices across multiple pages', async () => {
    // Simulates 3-page document
    // Verifies:
    // - All chunk indices are unique
    // - Indices are sequential (0 to n-1)
    // - Page numbers are correct (1, 2, 3)
  });

  it('should handle startIndex parameter correctly', async () => {
    // Verifies:
    // - Chunks start from provided startIndex (10)
    // - Sequential numbering from startIndex
  });
});

Lines Changed: ~79 additions

Verification

Build Status

✅ npm run build
   Clean compilation, 0 errors

Test Results

✅ npm test
   Test Suites: 9 passed, 9 total
   Tests:       119 passed, 119 total (was 117, +2 new tests)

New Test Coverage

Test 1: Multi-page unique indices

Generates chunks for 3 pages with sequential indexing
Verifies all indices are unique across pages
Confirms page numbers are preserved correctly
✅ PASSING

Test 2: startIndex parameter

Tests chunk generation starting from index 10
Verifies sequential numbering continues from startIndex
✅ PASSING

Impact Assessment

Before Fixes

Multi-page Document (3 pages, 10 chunks each):

Database state after processing:
┌────────────┬───────────────┬─────────┐
│ chunk_index│ page_number   │ text    │
├────────────┼───────────────┼─────────┤
│ 0          │ 3 (wrong!)    │ page 3  │ ← Overwrote pages 1,2
│ 1          │ 3 (wrong!)    │ page 3  │ ← Overwrote pages 1,2
│ ...        │ ...           │ ...     │
│ 9          │ 3 (wrong!)    │ page 3  │ ← Overwrote pages 1,2
└────────────┴───────────────┴─────────┘

Total chunks: 10 (should be 30!)
Content: Only page 3 data
Pages 1-2: LOST COMPLETELY

Reprocessing (10 chunks → 5 chunks):

After reprocessing:
┌────────────┬─────────────┬──────────┐
│ chunk_index│ updated     │ status   │
├────────────┼─────────────┼──────────┤
│ 0          │ ✅ Updated  │ Valid    │
│ 1          │ ✅ Updated  │ Valid    │
│ 2          │ ✅ Updated  │ Valid    │
│ 3          │ ✅ Updated  │ Valid    │
│ 4          │ ✅ Updated  │ Valid    │
│ 5          │ ❌ Stale    │ Orphaned │ ← From first run
│ 6          │ ❌ Stale    │ Orphaned │
│ ...        │ ...         │ ...      │
│ 9          │ ❌ Stale    │ Orphaned │
└────────────┴─────────────┴──────────┘

Stats show: 10 chunks (inflated!)
Actual: 5 chunks

After Fixes

Multi-page Document (3 pages, 10 chunks each):

Database state after processing:
┌────────────┬─────────────┬─────────┐
│ chunk_index│ page_number │ text    │
├────────────┼─────────────┼─────────┤
│ 0          │ 1 ✅        │ page 1  │
│ 1          │ 1 ✅        │ page 1  │
│ ...        │ ...         │ ...     │
│ 9          │ 1 ✅        │ page 1  │
│ 10         │ 2 ✅        │ page 2  │
│ 11         │ 2 ✅        │ page 2  │
│ ...        │ ...         │ ...     │
│ 19         │ 2 ✅        │ page 2  │
│ 20         │ 3 ✅        │ page 3  │
│ 21         │ 3 ✅        │ page 3  │
│ ...        │ ...         │ ...     │
│ 29         │ 3 ✅        │ page 3  │
└────────────┴─────────────┴─────────┘

Total chunks: 30 ✅
Content: All pages preserved
All data: INTACT

Reprocessing (10 chunks → 5 chunks):

Step 1: Delete existing chunks
  ✅ All 10 chunks removed

Step 2: Insert new chunks
┌────────────┬─────────────┬──────────┐
│ chunk_index│ status      │ data     │
├────────────┼─────────────┼──────────┤
│ 0          │ ✅ Fresh    │ Valid    │
│ 1          │ ✅ Fresh    │ Valid    │
│ 2          │ ✅ Fresh    │ Valid    │
│ 3          │ ✅ Fresh    │ Valid    │
│ 4          │ ✅ Fresh    │ Valid    │
└────────────┴─────────────┴──────────┘

Stats show: 5 chunks ✅
Actual: 5 chunks ✅
No stale data ✅

Backward Compatibility

Breaking Changes

None - These are fixes for broken functionality that never worked correctly for multi-page documents.

API Changes

generateChunks() now accepts optional startIndex in options
Default behavior unchanged (startIndex = 0)
Existing single-page code continues to work

Database Schema

No changes required - Fixes work with existing schema:

CREATE TABLE text_chunks (
    id UUID PRIMARY KEY,
    document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
    page_number INTEGER NOT NULL,
    chunk_index INTEGER NOT NULL,
    text TEXT NOT NULL,
    -- ...
    UNIQUE(document_id, chunk_index) -- This constraint now works correctly
);

Documentation Updates Required

TASK_5_SUMMARY.md

Section: "Known Limitations and Future Enhancements"

❌ Remove (no longer applicable):

Current Limitations:

Token Estimation: Simple whitespace-based, not exact tokenization

✅ Update:

Fixed Issues (2025-10-27):

✅ Chunk index collision across pages (critical data corruption)

✅ Stale chunks on reprocessing (inflated statistics)

TASK_5_QUICK_REFERENCE.md

Section: "Usage Example"

✅ Add:

// Multi-page document example
let globalIndex = 0;
for (const page of pages) {
  const chunks = await chunker.generateChunks(
    documentId,
    page.number,
    page.text,
    page.normalizedText,
    page.language,
    { startIndex: globalIndex }
  );
  allChunks.push(...chunks);
  globalIndex += chunks.length;
}

Lessons Learned

Design Flaw: Per-Page Indexing

Problem: Chunk indices were page-local when they needed to be document-global

Root Cause:

generateChunks() is called per-page
Each call created self-contained chunks without awareness of other pages
Worker collected chunks but didn't coordinate indexing

Solution:

Worker now owns global index coordination
Service accepts startIndex for continuation
Clear separation: service generates, worker orchestrates

Design Flaw: Implicit Cleanup

Problem: UPSERT logic assumed idempotent re-runs with same chunk count

Root Cause:

Different processing parameters can change chunk count
Re-chunking with different settings left orphaned data
No explicit cleanup step

Solution:

Explicit delete before insert
Worker orchestrates cleanup
Storage service remains stateless

Related Issues

Potential Future Issues Prevented

Embedding Duplication: Without fix, embeddings would be generated for wrong chunks
Search Corruption: Vector search would return results from wrong pages
Admin UI Confusion: Statistics would show incorrect document coverage
Audit Trail: Processing logs would show success while data was corrupted

Approval Checklist

Summary

Critical bugs fixed:

✅ Multi-page documents now preserve all chunks with unique indices
✅ Reprocessing now cleanly replaces all chunks without orphans

Test coverage improved:

+2 tests specifically for multi-page handling
Total: 119 tests passing (up from 117)

Zero regressions:

All existing tests continue to pass
Single-page behavior unchanged
API backward compatible

These fixes are critical for production use - without them, the chunking system silently loses data for any multi-page document.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task 5 Critical Fixes

Issues Identified

High Priority: Chunk Index Collision Across Pages 🔴

Medium Priority: Stale Chunks Not Cleaned on Reprocessing 🟡

Files Modified

1. src/services/text-chunker.service.ts

2. src/workers/document-processor.worker.ts

3. src/services/text-chunker.service.test.ts

Verification

Build Status

Test Results

New Test Coverage

Impact Assessment

Before Fixes

After Fixes

Backward Compatibility

Breaking Changes

API Changes

Database Schema

Documentation Updates Required

TASK_5_SUMMARY.md

TASK_5_QUICK_REFERENCE.md

Lessons Learned

Design Flaw: Per-Page Indexing

Design Flaw: Implicit Cleanup

Related Issues

Potential Future Issues Prevented

Approval Checklist

Summary

FilesExpand file tree

TASK_5_FIXES.md

Latest commit

History

TASK_5_FIXES.md

File metadata and controls

Task 5 Critical Fixes

Issues Identified

High Priority: Chunk Index Collision Across Pages 🔴

Medium Priority: Stale Chunks Not Cleaned on Reprocessing 🟡

Files Modified

1. src/services/text-chunker.service.ts

2. src/workers/document-processor.worker.ts

3. src/services/text-chunker.service.test.ts

Verification

Build Status

Test Results

New Test Coverage

Impact Assessment

Before Fixes

After Fixes

Backward Compatibility

Breaking Changes

API Changes

Database Schema

Documentation Updates Required

TASK_5_SUMMARY.md

TASK_5_QUICK_REFERENCE.md

Lessons Learned

Design Flaw: Per-Page Indexing

Design Flaw: Implicit Cleanup

Related Issues

Potential Future Issues Prevented

Approval Checklist

Summary