Skip to content

Latest commit

 

History

History
371 lines (304 loc) · 11 KB

File metadata and controls

371 lines (304 loc) · 11 KB

Task 8 Critical Fixes - Search and Retrieval System

Overview

Two critical bugs were identified after Task 8 implementation that broke core search functionality. Both have been resolved.

Bug 1: Missing Dataset in Qdrant Payload (HIGH PRIORITY)

Problem

Location: src/services/vector-db.service.ts:79-93

The storeEmbeddings() method was not including dataset or documentId in the Qdrant payload metadata. This caused:

  • All dataset filtering to fail silently (filters had no effect)
  • Document-specific searches (searchByDocument) to return unfiltered results
  • Complete failure of Task 8.2 (dataset scoping)

Root Cause

// BEFORE (BROKEN):
const points: QdrantPoint[] = embeddings.map((emb) => ({
  id: emb.chunkId,
  vector: emb.vector,
  payload: {
    chunkId: emb.chunkId,
    model: emb.model,
    dimensions: emb.dimensions,
    generatedAt: emb.generatedAt.toISOString(),
    // ❌ dataset and documentId missing!
  },
}));

When SearchService.vectorSearch() tried to filter by dataset:

filter = {
  should: query.datasets.map((dataset) => ({
    key: 'dataset',
    match: { value: dataset },
  })),
};

Qdrant would silently ignore the filter since the dataset field didn't exist in any payload.

Solution

Step 1: Update VectorDbService.storeEmbeddings() signature to accept dataset

// AFTER (FIXED):
async storeEmbeddings(
  embeddings: ChunkEmbedding[],
  documentId?: string,
  dataset?: string // ✅ Added dataset parameter
): Promise<void>

Step 2: Include dataset in payload

const points: QdrantPoint[] = embeddings.map((emb) => ({
  id: emb.chunkId,
  vector: emb.vector,
  payload: {
    chunkId: emb.chunkId,
    documentId: documentId || emb.chunkId.split('-')[0], // ✅ Include documentId
    dataset: dataset || 'default', // ✅ Include dataset for filtering
    model: emb.model,
    dimensions: emb.dimensions,
    generatedAt: emb.generatedAt.toISOString(),
  },
}));

Step 3: Update worker to retrieve and pass dataset name

// src/workers/document-processor.worker.ts
} else if (stage === 'embedding') {
  logger.info('Processing embedding stage', { documentId });
  
  await vectorDb.ensureCollection();
  
  // ✅ Get document dataset info
  const docResult = await db.query(
    'SELECT d.name as dataset FROM documents doc JOIN datasets d ON doc.dataset_id = d.id WHERE doc.id = $1',
    [documentId]
  );
  
  if (docResult.rows.length === 0) {
    throw new Error('Document not found');
  }
  
  const dataset = docResult.rows[0].dataset;
  
  const chunks = await chunkStorage.getChunksByDocument(documentId);
  const embeddings = await embeddingService.generateChunkEmbeddings(chunks);
  
  await embeddingStorage.saveEmbeddings(embeddings);
  
  // ✅ Pass dataset to vector database
  await vectorDb.storeEmbeddings(embeddings, documentId, dataset);
}

Step 4: Update test to verify dataset in payload

// src/services/vector-db.service.test.ts
it('should store embeddings with metadata', async () => {
  mockRequest.mockResolvedValue({ status: 200 });

  const embeddings: ChunkEmbedding[] = [
    {
      chunkId: 'chunk-1',
      vector: [0.1, 0.2, 0.3],
      model: 'test',
      dimensions: 3,
      generatedAt: new Date(),
    },
  ];

  await service.storeEmbeddings(embeddings, 'doc-123', 'legal-docs');

  expect(mockRequest).toHaveBeenCalledWith(
    expect.stringContaining('/points'),
    expect.objectContaining({
      points: expect.arrayContaining([
        expect.objectContaining({
          vector: [0.1, 0.2, 0.3],
          payload: expect.objectContaining({
            documentId: 'doc-123',
            dataset: 'legal-docs', // ✅ Verify dataset is included
          }),
        }),
      ]),
    })
  );
});

Impact

  • Before: All dataset filtering was non-functional
  • After: Dataset filtering works correctly
  • Before: searchByDocument() returned all chunks from all documents
  • After: searchByDocument() correctly scopes to specified document

Files Modified

  1. src/services/vector-db.service.ts - Added dataset parameter
  2. src/workers/document-processor.worker.ts - Query and pass dataset
  3. src/services/vector-db.service.test.ts - Verify dataset in payload

Bug 2: Text Search Hardcoded to English (MEDIUM PRIORITY)

Problem

Location: src/services/search.service.ts:125-168

The textSearch() method hardcoded PostgreSQL full-text search to the 'english' configuration, ignoring the language parameter from the query. This caused:

  • Arabic searches to return zero results despite having Arabic data
  • Mixed-language searches to fail for non-English content
  • Complete violation of language-aware processing claims

Root Cause

// BEFORE (BROKEN):
const result = await db.query(
  `SELECT 
     tc.id as chunk_id,
     ts_rank(to_tsvector('english', tc.normalized_text), plainto_tsquery('english', $1)) as score
   FROM text_chunks tc
   WHERE to_tsvector('english', tc.normalized_text) @@ plainto_tsquery('english', $1)
   ...`
);
// ❌ Hardcoded to English!

PostgreSQL text search configurations use language-specific stemming, stop words, and tokenization:

  • 'english' config: ignores Arabic text entirely
  • 'arabic' config: properly handles Arabic morphology
  • 'simple' config: language-agnostic for mixed content

Solution

Step 1: Detect language from query parameter

// src/services/search.service.ts
private async textSearch(
  queryText: string,
  limit: number,
  query: SearchQuery
): Promise<TextSearchResult[]> {
  logger.debug('Performing text search', { query: queryText, limit });

  // ✅ Select text search configuration based on language
  const language = query.language || 'english';
  let tsConfig = 'english'; // Default
  
  if (language === 'arabic') {
    tsConfig = 'arabic';
  } else if (language === 'mixed') {
    tsConfig = 'simple'; // Use simple for mixed language
  }

Step 2: Use language-specific configuration in SQL

  // ✅ Perform full-text search with language-specific configuration
  const result = await db.query(
    `SELECT 
       tc.id as chunk_id,
       ts_rank(to_tsvector('${tsConfig}', tc.normalized_text), plainto_tsquery('${tsConfig}', $1)) as score
     FROM text_chunks tc
     JOIN documents doc ON tc.document_id = doc.id
     JOIN datasets d ON doc.dataset_id = d.id
     WHERE to_tsvector('${tsConfig}', tc.normalized_text) @@ plainto_tsquery('${tsConfig}', $1)
     ${whereClause}
     ORDER BY score DESC
     LIMIT $${paramIndex}`,
    [...params, limit]
  );

Language Configuration Mapping

Query Language PostgreSQL Config Behavior
english english English stemming, stop words
arabic arabic Arabic morphology, RTL support
mixed simple No stemming, language-agnostic
(default) english Fallback to English

Impact

  • Before: Arabic queries "القانون" returned 0 results
  • After: Arabic queries return correctly ranked results
  • Before: Mixed-language documents poorly indexed
  • After: Mixed-language uses simple config for broader matching

Files Modified

  1. src/services/search.service.ts - Language-aware text search configuration

Additional Fix: Qdrant Filter Format

Problem

Location: src/services/search.service.ts:85-121

Initial implementation used must for dataset filtering, which requires ALL datasets to match (impossible):

// WRONG:
filter = {
  must: query.datasets.map((dataset) => ({
    key: 'dataset',
    match: { value: dataset },
  })),
};
// ❌ "must" means AND - no document can be in ALL datasets

Solution

// CORRECT:
filter = {
  should: query.datasets.map((dataset) => ({
    key: 'dataset',
    match: { value: dataset },
  })),
};
// ✅ "should" means OR - match any of the datasets

Qdrant Filter Logic

  • must: ALL conditions must match (AND logic)
  • should: ANY condition can match (OR logic)
  • must_not: NONE of conditions can match (NOT logic)

Testing

Build Verification

npm run build
# ✅ Clean compilation, no TypeScript errors

Test Verification

npm test
# ✅ 145/145 tests passing
# ✅ Dataset payload test includes verification

Test Coverage

// Vector database test verifies dataset in payload
it('should store embeddings with metadata', async () => {
  await service.storeEmbeddings(embeddings, 'doc-123', 'legal-docs');
  
  expect(mockRequest).toHaveBeenCalledWith(
    expect.stringContaining('/points'),
    expect.objectContaining({
      points: expect.arrayContaining([
        expect.objectContaining({
          payload: expect.objectContaining({
            documentId: 'doc-123',
            dataset: 'legal-docs', // ✅ Verified
          }),
        }),
      ]),
    })
  );
});

Open Questions & Risks

Question 1: Backfilling Existing Vectors

Issue: If there are existing vectors in Qdrant without dataset/documentId fields, they won't be filterable.

Options:

  1. Re-run embedding stage: Reset all documents to chunking stage and reprocess
  2. Manual backfill: Script to update existing payloads via Qdrant API
  3. Fresh start: Drop collection and reprocess all documents from scratch

Recommendation: For development, use fresh start (option 3). For production with existing data, implement backfill script (option 2).

Question 2: SQL Injection Risk

Issue: Using template literals with tsConfig variable in SQL query:

to_tsvector('${tsConfig}', tc.normalized_text)

Risk Assessment: LOW - tsConfig is derived from controlled enum ('english'/'arabic'/'simple'), not user input.

Mitigation: If additional paranoia required, use whitelist validation:

const allowedConfigs = ['english', 'arabic', 'simple'];
if (!allowedConfigs.includes(tsConfig)) {
  throw new Error(`Invalid text search configuration: ${tsConfig}`);
}

Question 3: Performance Impact

Issue: PostgreSQL full-text search with different configurations may have varying performance.

Consideration: Arabic configuration may be slower than English due to morphological analysis. Monitor query performance in production.


Summary

Bugs Fixed

  1. HIGH: Missing dataset in Qdrant payload - dataset filtering now works
  2. MEDIUM: Hardcoded English text search - language-aware search now works
  3. BONUS: Wrong Qdrant filter format - OR logic for dataset filtering

Functionality Restored

  • Dataset scoping (Task 8.2) now functional
  • Document-specific search works correctly
  • Arabic and mixed-language searches return results
  • Hybrid search properly combines vector + text results with language awareness

Tests

  • All 145 tests passing
  • Dataset payload verified in tests
  • Clean TypeScript compilation

Next Steps

  • Consider backfill strategy for existing vectors (if any)
  • Monitor text search performance with different language configurations
  • Proceed to Task 9: Administrative Interface