Date: 2025-10-27
Status: ✅ FIXED
Location: src/services/text-chunker.service.ts:64, 122
Problem:
- Each page's chunks started with
chunkIndex = 0 - Database unique constraint on
(document_id, chunk_index) - Result: Page 2's chunk 0 overwrites page 1's chunk 0, page 3 overwrites both, etc.
- Impact: Only last page's chunks survive, all previous pages' data is lost silently
- Severity: CRITICAL - Complete data corruption for multi-page documents
Root Cause:
// BEFORE (BROKEN):
private generateSentenceAwareChunks(...) {
let chunkIndex = 0; // ❌ Resets to 0 for every page!
// ...
}
private generateSlidingWindowChunks(...) {
let chunkIndex = 0; // ❌ Same issue
// ...
}Fix Applied:
- Added
startIndexparameter toChunkingOptionsinterface - Modified both chunking methods to accept and use
startIndex - Updated worker to track global chunk index across pages
- Worker now passes cumulative index to each page's chunk generation
// AFTER (FIXED):
interface ChunkingOptions {
// ... existing options
startIndex?: number; // ✅ Document-global starting index
}
private generateSentenceAwareChunks(..., startIndex: number = 0) {
let chunkIndex = startIndex; // ✅ Continues from previous pages
// ...
}
// Worker now tracks global index:
let globalChunkIndex = 0;
for (const ocrResult of ocrResults.rows) {
const chunks = await chunkerService.generateChunks(..., {
startIndex: globalChunkIndex // ✅ Pass current global index
});
allChunks.push(...chunks);
globalChunkIndex += chunks.length; // ✅ Increment for next page
}Location: src/services/chunk-storage.service.ts:18-45
Problem:
saveChunks()uses UPSERT logic (insert or update on conflict)- If reprocessing produces fewer chunks than original run, high-index chunks from first run remain
- Result:
getChunkStats()returns inflated totals including orphaned chunks - Impact: Incorrect statistics, stale data in database
Example:
First run: Generates 10 chunks (indices 0-9)
Second run: Generates 5 chunks (indices 0-4)
Result: Chunks 5-9 from first run still exist
Stats show 10 chunks instead of 5
Fix Applied: Added explicit deletion before chunking in worker:
// Worker now deletes existing chunks first:
if (stage === 'chunking') {
// ...
// ✅ Delete existing chunks for this document (for reprocessing)
await chunkStorage.deleteChunks(documentId);
// Then generate and save new chunks
const allChunks = [];
// ...
}Alternative Considered:
Could have modified saveChunks() to delete first, but current approach:
- Keeps storage service stateless
- Makes worker responsible for orchestration logic
- Clearer intent in worker code
Changes:
- Added
startIndex?: numbertoChunkingOptionsinterface - Modified
generateChunks()to extract and usestartIndex - Updated
generateSentenceAwareChunks()to acceptstartIndexparameter (default 0) - Updated
generateSlidingWindowChunks()to acceptstartIndexparameter (default 0) - Fixed infinite loop detection in sliding window to account for
startIndex
Lines Changed: ~17 additions, ~10 deletions
Changes:
- Added
globalChunkIndexvariable to track cumulative chunk count - Added
await chunkStorage.deleteChunks(documentId)before chunk generation - Modified loop to pass
{ startIndex: globalChunkIndex }to each page - Updated
globalChunkIndex += chunks.lengthafter each page
Lines Changed: ~9 additions, ~2 deletions
New Tests Added:
describe('Multi-page document handling', () => {
it('should generate unique chunk indices across multiple pages', async () => {
// Simulates 3-page document
// Verifies:
// - All chunk indices are unique
// - Indices are sequential (0 to n-1)
// - Page numbers are correct (1, 2, 3)
});
it('should handle startIndex parameter correctly', async () => {
// Verifies:
// - Chunks start from provided startIndex (10)
// - Sequential numbering from startIndex
});
});Lines Changed: ~79 additions
✅ npm run build
Clean compilation, 0 errors✅ npm test
Test Suites: 9 passed, 9 total
Tests: 119 passed, 119 total (was 117, +2 new tests)Test 1: Multi-page unique indices
- Generates chunks for 3 pages with sequential indexing
- Verifies all indices are unique across pages
- Confirms page numbers are preserved correctly
- ✅ PASSING
Test 2: startIndex parameter
- Tests chunk generation starting from index 10
- Verifies sequential numbering continues from startIndex
- ✅ PASSING
Multi-page Document (3 pages, 10 chunks each):
Database state after processing:
┌────────────┬───────────────┬─────────┐
│ chunk_index│ page_number │ text │
├────────────┼───────────────┼─────────┤
│ 0 │ 3 (wrong!) │ page 3 │ ← Overwrote pages 1,2
│ 1 │ 3 (wrong!) │ page 3 │ ← Overwrote pages 1,2
│ ... │ ... │ ... │
│ 9 │ 3 (wrong!) │ page 3 │ ← Overwrote pages 1,2
└────────────┴───────────────┴─────────┘
Total chunks: 10 (should be 30!)
Content: Only page 3 data
Pages 1-2: LOST COMPLETELY
Reprocessing (10 chunks → 5 chunks):
After reprocessing:
┌────────────┬─────────────┬──────────┐
│ chunk_index│ updated │ status │
├────────────┼─────────────┼──────────┤
│ 0 │ ✅ Updated │ Valid │
│ 1 │ ✅ Updated │ Valid │
│ 2 │ ✅ Updated │ Valid │
│ 3 │ ✅ Updated │ Valid │
│ 4 │ ✅ Updated │ Valid │
│ 5 │ ❌ Stale │ Orphaned │ ← From first run
│ 6 │ ❌ Stale │ Orphaned │
│ ... │ ... │ ... │
│ 9 │ ❌ Stale │ Orphaned │
└────────────┴─────────────┴──────────┘
Stats show: 10 chunks (inflated!)
Actual: 5 chunks
Multi-page Document (3 pages, 10 chunks each):
Database state after processing:
┌────────────┬─────────────┬─────────┐
│ chunk_index│ page_number │ text │
├────────────┼─────────────┼─────────┤
│ 0 │ 1 ✅ │ page 1 │
│ 1 │ 1 ✅ │ page 1 │
│ ... │ ... │ ... │
│ 9 │ 1 ✅ │ page 1 │
│ 10 │ 2 ✅ │ page 2 │
│ 11 │ 2 ✅ │ page 2 │
│ ... │ ... │ ... │
│ 19 │ 2 ✅ │ page 2 │
│ 20 │ 3 ✅ │ page 3 │
│ 21 │ 3 ✅ │ page 3 │
│ ... │ ... │ ... │
│ 29 │ 3 ✅ │ page 3 │
└────────────┴─────────────┴─────────┘
Total chunks: 30 ✅
Content: All pages preserved
All data: INTACT
Reprocessing (10 chunks → 5 chunks):
Step 1: Delete existing chunks
✅ All 10 chunks removed
Step 2: Insert new chunks
┌────────────┬─────────────┬──────────┐
│ chunk_index│ status │ data │
├────────────┼─────────────┼──────────┤
│ 0 │ ✅ Fresh │ Valid │
│ 1 │ ✅ Fresh │ Valid │
│ 2 │ ✅ Fresh │ Valid │
│ 3 │ ✅ Fresh │ Valid │
│ 4 │ ✅ Fresh │ Valid │
└────────────┴─────────────┴──────────┘
Stats show: 5 chunks ✅
Actual: 5 chunks ✅
No stale data ✅
None - These are fixes for broken functionality that never worked correctly for multi-page documents.
generateChunks()now accepts optionalstartIndexin options- Default behavior unchanged (startIndex = 0)
- Existing single-page code continues to work
No changes required - Fixes work with existing schema:
CREATE TABLE text_chunks (
id UUID PRIMARY KEY,
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
page_number INTEGER NOT NULL,
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
-- ...
UNIQUE(document_id, chunk_index) -- This constraint now works correctly
);Section: "Known Limitations and Future Enhancements"
❌ Remove (no longer applicable):
Current Limitations:
- Token Estimation: Simple whitespace-based, not exact tokenization
✅ Update:
Fixed Issues (2025-10-27):
- ✅ Chunk index collision across pages (critical data corruption)
- ✅ Stale chunks on reprocessing (inflated statistics)
Section: "Usage Example"
✅ Add:
// Multi-page document example
let globalIndex = 0;
for (const page of pages) {
const chunks = await chunker.generateChunks(
documentId,
page.number,
page.text,
page.normalizedText,
page.language,
{ startIndex: globalIndex }
);
allChunks.push(...chunks);
globalIndex += chunks.length;
}Problem: Chunk indices were page-local when they needed to be document-global
Root Cause:
generateChunks()is called per-page- Each call created self-contained chunks without awareness of other pages
- Worker collected chunks but didn't coordinate indexing
Solution:
- Worker now owns global index coordination
- Service accepts
startIndexfor continuation - Clear separation: service generates, worker orchestrates
Problem: UPSERT logic assumed idempotent re-runs with same chunk count
Root Cause:
- Different processing parameters can change chunk count
- Re-chunking with different settings left orphaned data
- No explicit cleanup step
Solution:
- Explicit delete before insert
- Worker orchestrates cleanup
- Storage service remains stateless
- Embedding Duplication: Without fix, embeddings would be generated for wrong chunks
- Search Corruption: Vector search would return results from wrong pages
- Admin UI Confusion: Statistics would show incorrect document coverage
- Audit Trail: Processing logs would show success while data was corrupted
- High priority issue fixed (chunk index collision)
- Medium priority issue fixed (stale chunks)
- Build compiles cleanly
- All existing tests pass (117/117)
- New tests added for fixes (2 tests)
- All new tests pass (119/119)
- Multi-page scenario verified in tests
- Reprocessing scenario verified in logic
- No breaking changes to API
- Documentation updated
- Impact assessment completed
Critical bugs fixed:
- ✅ Multi-page documents now preserve all chunks with unique indices
- ✅ Reprocessing now cleanly replaces all chunks without orphans
Test coverage improved:
- +2 tests specifically for multi-page handling
- Total: 119 tests passing (up from 117)
Zero regressions:
- All existing tests continue to pass
- Single-page behavior unchanged
- API backward compatible
These fixes are critical for production use - without them, the chunking system silently loses data for any multi-page document.