RAGBase Codebase Summary

Last Updated: Task 1.3 - Fast Lane Processing (Dec 2024) Token Compaction: 24,627 tokens (102,707 chars)

1. Project Structure

RAGBase/
├── apps/
│   ├── backend/                    # Node.js + Fastify API
│   │   ├── src/
│   │   │   ├── app.ts              # Fastify initialization
│   │   │   ├── middleware/
│   │   │   │   └── auth-middleware.ts   # Timing-safe API key validation
│   │   │   ├── services/
│   │   │   │   ├── database.ts     # Prisma singleton (NEW: Phase 04)
│   │   │   │   ├── hash-service.ts # MD5 hashing
│   │   │   │   └── embedding-service.ts # Vector embeddings
│   │   │   └── routes/
│   │   │       ├── documents/
│   │   │       │   ├── upload-route.ts      # File upload + path traversal protection
│   │   │       │   ├── status-route.ts      # Document status with Prisma singleton
│   │   │       │   └── list-route.ts        # List documents with SafeParse validation
│   │   │       ├── query/
│   │   │       │   └── search-route.ts      # Vector search with parameterized queries
│   │   │       └── health-route.ts          # Health check endpoint
│   │   ├── prisma/
│   │   │   └── schema.prisma       # Database schema
│   │   └── vitest.config.ts        # Test configuration
│   └── ai-worker/                  # Python worker (Phase 07)
├── tests/
│   ├── helpers/
│   │   └── api.ts                  # API test utilities
│   ├── integration/
│   │   ├── middleware/
│   │   │   └── auth-middleware.test.ts
│   │   └── routes/
│   │       ├── search-route.test.ts        # SQL injection prevention tests
│   │       ├── upload-route.test.ts        # Path traversal tests
│   │       └── status-route.test.ts
│   ├── setup/
│   │   └── global-setup.ts         # Test environment setup
│   └── fixtures/                   # Test data
├── docker/                         # Dockerfiles
├── docs/                           # Documentation (this directory)
└── plans/                          # Implementation plans

2. Core Services

2.1 Database Service (NEW: Phase 04)

File: apps/backend/src/services/database.ts

Implements Prisma Client singleton pattern to prevent connection pool exhaustion:

export function getPrismaClient(): PrismaClient {
  if (!prismaInstance) {
    prismaInstance = new PrismaClient({
      log: process.env.NODE_ENV === 'development'
        ? ['query', 'warn', 'error']
        : ['error'],
    });
  }
  return prismaInstance;
}

Why: Multiple PrismaClient instances exhaust connection pool. Singleton ensures:

Single connection pool across app
Clean shutdown via disconnectPrisma()
Environment-aware logging

Adopted by:

status-route.ts (document queries)
upload-route.ts (duplicate check)
search-route.ts (vector search)
list-route.ts (document listing)

2.2 Hash Service

File: apps/backend/src/services/hash-service.ts

Provides MD5 hashing for file deduplication. Used in upload-route.ts for:

Detecting duplicate files
Generating unique storage paths (prevents filename collisions)

2.3 Embedding Service

File: apps/backend/src/services/embedding-service.ts

Generates vector embeddings using fastembed (self-hosted ONNX-based):

Model: sentence-transformers/all-MiniLM-L6-v2 (384-dim)
Methods:
- embed(text) - Single text → vector
- embedBatch(texts) - Batch texts → vectors (parallel processing)
- cosineSimilarity(vec1, vec2) - Compute similarity score
- findSimilar(queryEmbedding, candidates, topK) - Find top-K similar vectors
Features: Lazy initialization (singleton pattern), batch processing with generators
Used by: Fast lane processing (upload-route), vector search (search-route)

NEW (Task 1.3): Full batch embedding support for fast lane processing.

2.4 Chunker Service

File: apps/backend/src/services/chunker-service.ts

Text chunking using LangChain MarkdownTextSplitter:

Config: 1000-char chunks with 200-char overlap
Returns: Chunks with metadata (charStart, charEnd, heading, page)
Heading extraction: Parses markdown headers from chunk content
Position tracking: Maintains character positions in original text
Used by: Fast lane processing (upload-route), quality validation

NEW (Task 1.3): Core component of fast lane processing pipeline.

2.5 Fast Lane Processor

File: apps/backend/src/services/fast-lane-processor.ts

High-level orchestrator for immediate JSON/TXT/MD processing:

Flow: Chunk → Quality Gate → Embed → Store
Quality gate: Validates text length and noise ratio before processing
Database: Uses raw SQL INSERT for chunk storage (pgvector compatibility)
Error handling: Marks documents as FAILED with reason codes
Status: Updates from PENDING → PROCESSING → COMPLETED/FAILED

Implementation note: upload-route.ts inlines fast lane logic for tighter control. Service provides reusable pipeline for potential future queue-based fast lane.

3. Authentication & Security

3.1 Timing-Safe Auth Middleware (Phase 04)

File: apps/backend/src/middleware/auth-middleware.ts

Prevents timing attack on API key comparison:

// Constant-time comparison using crypto.timingSafeEqual
if (apiKeyBuffer.length === expectedKeyBuffer.length) {
  try {
    timingSafeEqual(apiKeyBuffer, expectedKeyBuffer);
    isValid = true;
  } catch {
    isValid = false;
  }
}

Public Routes (no auth required):

/health - Health check
/internal/callback - Worker callback endpoint

All other routes require X-API-Key header.

3.2 Path Traversal Protection (Phase 04)

File: apps/backend/src/routes/documents/upload-route.ts

Prevents directory traversal attacks:

// Validate filename with basename() + length check
const sanitizedFilename = basename(filename);
if (sanitizedFilename !== filename || sanitizedFilename.length === 0 || sanitizedFilename.length > 255) {
  return reply.status(400).send({
    error: 'INVALID_FILENAME',
    message: 'Filename contains invalid characters or exceeds length limit',
  });
}

// Store using MD5 hash only (prevents path traversal)
const filePath = path.join(UPLOAD_DIR, md5Hash);

Why:

basename() removes path separators
MD5 hash storage prevents arbitrary filesystem paths
Length limit (255 chars) prevents filesystem issues

3.3 SQL Injection Prevention (Phase 04)

File: apps/backend/src/routes/query/search-route.ts

Prevents SQL injection in pgvector queries:

const results = await prisma.$queryRaw<...>`
  SELECT ... FROM chunks c
  ORDER BY c.embedding <=> ${JSON.stringify(queryEmbedding)}::vector
  LIMIT ${topK}
`;

Why: Prisma $queryRaw with template literals provides automatic parameter binding. Never concatenates user input directly.

4. API Routes

4.1 File Upload

Route: POST /api/documents File: apps/backend/src/routes/documents/upload-route.ts

Flow:

Validate file size (50MB max)
Detect format (pdf, docx, xlsx, json, txt, md, csv)
Validate filename (no path traversal)
Calculate MD5 hash
Check for duplicates via Prisma singleton
Save file using MD5 hash path
Create document record with file I/O error handling + rollback
FAST LANE (NEW Task 1.3): Process JSON/TXT/MD files immediately:
- Read file content
- Chunk text using MarkdownTextSplitter (LangChain)
- Generate embeddings via fastembed (self-hosted ONNX)
- Store chunks + vectors directly in PostgreSQL/pgvector
- Mark document as COMPLETED
HEAVY LANE: Queue PDF/DOCX for Python worker processing

Fast Lane Processing (Task 1.3):

Supported formats: JSON, TXT, MD
Chunking: LangChain MarkdownTextSplitter (1000 char chunks, 200 char overlap)
Embeddings: fastembed all-MiniLM-L6-v2 (384-dim vectors)
Storage: Batch insert chunks with raw SQL + pgvector type cast
Status flow: PENDING → (immediate processing) → COMPLETED/FAILED
Error handling: Quality gate validation, proper error propagation

Error Handling:

400: Invalid file, unsupported format, path traversal attempt
409: Duplicate file detected
413: File exceeds 50MB (Payload Too Large)
500: Storage error (with DB cleanup on failure), fast lane processing errors

Features (Phase 04+):

File I/O rollback on DB failure (cleanup written file)
Path traversal protection via basename() + MD5 hash
Prisma singleton for connection efficiency
NEW (Task 1.3): Immediate fast lane processing with embeddings
NEW (Task 1.3): pgvector integration for semantic search readiness

4.2 Document Status

Route: GET /api/documents/:id File: apps/backend/src/routes/documents/status-route.ts

Returns document metadata including chunk count (when completed).

New Features (Phase 04):

SafeParse validation for UUID format
Prisma singleton for queries
Proper 400 vs 404 error codes

4.3 Document Listing

Route: GET /api/documents File: apps/backend/src/routes/documents/list-route.ts

Lists all documents with pagination support.

New Features (Phase 04):

SafeParse validation for query parameters
Proper error handling (400 for validation, 500 for server errors)

4.4 Vector Search

Route: POST /api/query File: apps/backend/src/routes/query/search-route.ts

Semantic search across document chunks.

Request Body:

{
  "query": "search text",
  "topK": 5
}

Response:

{
  "results": [
    {
      "content": "chunk text",
      "score": 0.85,
      "documentId": "uuid",
      "metadata": {
        "charStart": 0,
        "charEnd": 100,
        "page": 1,
        "heading": "Section Title"
      }
    }
  ]
}

New Features (Phase 04):

SQL injection prevention via Prisma parameter binding
Proper 400 vs 503 error codes (validation vs service errors)

4.5 Health Check

Route: GET /health File: apps/backend/src/routes/health-route.ts

No authentication required. Returns {"status":"ok"}.

5. Validation Layer

File: apps/backend/src/validators/index.ts (via Zod)

5.1 Upload Validation

File size ≤ 50MB
Supported formats: pdf, docx, xlsx, json, txt, md, csv
Mime type matching

5.2 Query Validation

const QuerySchema = z.object({
  query: z.string().min(1).max(1000).trim(),
  topK: z.number().int().min(1).max(100).default(5),
});

All routes use SafeParse for proper error responses (400 with detailed messages).

6. Database Schema (Prisma)

File: apps/backend/prisma/schema.prisma

Core Models

Document

id (UUID, PK)
filename (String)
mimeType (String)
fileSize (Int)
format (Enum: pdf, docx, xlsx, json, txt, md, csv)
lane (Enum: FAST, HEAVY)
status (Enum: PENDING, PROCESSING, COMPLETED, FAILED)
filePath (String) - MD5-hashed path
md5Hash (String, unique index) - Deduplication
retryCount (Int, default: 0)
failReason (String, nullable)
createdAt, updatedAt (DateTime)
Relations: chunks (1-to-many)

Chunk

id (UUID, PK)
documentId (UUID, FK)
content (String) - Chunk text
embedding (Vector 384d) - pgvector type
charStart, charEnd (Int) - Character position in original
page (Int, nullable) - Page number
heading (String, nullable) - Markdown heading
Relations: document (many-to-1)

7. Test Infrastructure

7.1 Test Helpers

File: tests/helpers/api.ts

Provides utilities for:

Setting up test server
Making authenticated API requests
Mocking worker responses

7.2 Global Setup

File: tests/setup/global-setup.ts

Initializes:

Testcontainers (PostgreSQL + Redis)
Database migrations
Test environment variables

7.3 Integration Tests

Path Traversal Tests (upload-route.test.ts):

Reject filenames with path separators
Reject filenames exceeding 255 chars

SQL Injection Tests (search-route.test.ts):

Verify parameterized query execution
Test pgvector query safety

Timing Attack Tests (auth-middleware.test.ts):

Verify constant-time comparison
Test public route bypass

8. Configuration

Environment Variables

Database:

DATABASE_URL - PostgreSQL connection string

File Storage:

UPLOAD_DIR - Directory for file uploads (default: /tmp/uploads)

Security:

API_KEY - Shared secret for API authentication
NODE_ENV - development or production (affects logging)

Processing:

REDIS_URL - Redis connection (Phase 05+)

9. Error Handling Patterns

Validation Errors (400)

{
  "error": "VALIDATION_ERROR",
  "message": "Detailed validation issue"
}

Authentication Errors (401)

{
  "error": "UNAUTHORIZED",
  "message": "Invalid or missing API key"
}

Not Found (404)

{
  "error": "NOT_FOUND",
  "message": "Document not found"
}

Server Errors (500)

{
  "error": "STORAGE_ERROR",
  "message": "Failed to save file: ..."
}

Service Unavailable (503)

{
  "error": "EMBEDDING_SERVICE_ERROR",
  "message": "Failed to generate query embedding: ..."
}

10. Development Workflow

Running Tests

# All tests
pnpm test

# Watch mode
pnpm test:watch

# Integration only
pnpm test:integration

# Coverage
pnpm test:coverage

Database Operations

# Generate Prisma client
pnpm --filter @ragbase/backend db:generate

# Push schema to DB
pnpm --filter @ragbase/backend db:push

# Create migration
pnpm --filter @ragbase/backend db:migrate

Development Server

# Start services (Docker required)
docker compose up -d

# Run server
pnpm dev

# Verify health
curl http://localhost:3000/health

11. Key Design Decisions (Phase 04 + Task 1.3)

Decision	Rationale	Implementation
Prisma Singleton	Prevent connection pool exhaustion	`services/database.ts`
Timing-Safe Auth	Prevent timing attacks on API key	`crypto.timingSafeEqual()`
Path Traversal Protection	Prevent directory escape attacks	`basename()` + MD5 hash storage
SQL Injection Prevention	Use parameterized queries	Prisma `$queryRaw` with template literals
File I/O Rollback	Maintain consistency if DB fails	Cleanup written files on DB errors
SafeParse Validation	Proper error codes (400 vs 500)	Zod `safeParse()` in all routes
MD5 Hash Storage	Unique, collision-resistant paths	`HashService.md5()` for filenames
Fast Lane Processing (Task 1.3)	Immediate response for simple formats	Inline chunking + embedding in upload-route
Self-Hosted Embeddings (Task 1.3)	No external API dependency	fastembed ONNX model + batch processing
Raw SQL for Chunks (Task 1.3)	pgvector type compatibility	Prisma `$executeRaw` with `::vector` cast
Dual Lane Architecture (Task 1.3)	Optimize for different file types	Fast lane (JSON/TXT/MD) vs Heavy lane (PDF/DOCX)

12. Next Phases

Phase 05: Queue integration (BullMQ) with proper job retry logic & callback handling
Phase 06: E2E pipeline testing with Docling (Python worker) processing for PDF/DOCX
Phase 07: Python AI Worker deployment (Docling → markdown extraction)
Phase 08: Frontend UI (React + Vite) with upload + search interface
Phase 09: Production hardening & scaling (monitoring, alerts, load testing)

FilesExpand file tree

codebase-summary.md

Latest commit

History

codebase-summary.md

File metadata and controls

RAGBase Codebase Summary

1. Project Structure

2. Core Services

2.1 Database Service (NEW: Phase 04)

2.2 Hash Service

2.3 Embedding Service

2.4 Chunker Service

2.5 Fast Lane Processor

3. Authentication & Security

3.1 Timing-Safe Auth Middleware (Phase 04)

3.2 Path Traversal Protection (Phase 04)

3.3 SQL Injection Prevention (Phase 04)

4. API Routes

4.1 File Upload

4.2 Document Status

4.3 Document Listing

4.4 Vector Search

4.5 Health Check

5. Validation Layer

5.1 Upload Validation

5.2 Query Validation

6. Database Schema (Prisma)

Core Models

7. Test Infrastructure

7.1 Test Helpers

7.2 Global Setup

7.3 Integration Tests

8. Configuration

Environment Variables

9. Error Handling Patterns

Validation Errors (400)

Authentication Errors (401)

Not Found (404)

Server Errors (500)

Service Unavailable (503)

10. Development Workflow

Running Tests

Database Operations

Development Server

11. Key Design Decisions (Phase 04 + Task 1.3)

12. Next Phases