Skip to content

Latest commit

 

History

History
409 lines (340 loc) · 17.9 KB

File metadata and controls

409 lines (340 loc) · 17.9 KB

Embedding Pipeline End-to-End Failure Analysis

Overview

This document traces the complete embedding pipeline from document upload to vector database storage, identifying ALL potential failure points in both Redis/RQ and non-Redis scenarios.


Pipeline Flow: Non-Redis Scenario

Step 1: Frontend File Upload

Location: src/lib/apis/files/index.tsuploadFile() Endpoint: POST /files/

Flow:

  1. User selects file in frontend
  2. Frontend calls uploadFile() with file and token
  3. File sent as FormData to /files/ endpoint

Potential Failure Points:

  • F1.1: Network failure during upload
  • F1.2: File size exceeds server limits (not validated in frontend)
  • F1.3: Invalid/expired authentication token
  • F1.4: CORS errors preventing upload
  • F1.5: Browser timeout for large files
  • F1.6: File is empty (0 bytes) - frontend checks but race condition possible

Step 2: Backend File Upload Handler

Location: backend/open_webui/routers/files.pyupload_file()

Flow:

  1. FastAPI receives file via UploadFile
  2. Generate UUID for file ID
  3. Sanitize filename
  4. Save file to storage via Storage.upload_file()
  5. Insert file record into database via Files.insert_new_file()
  6. Trigger background processing via process_file()

Potential Failure Points:

  • F2.1: File upload fails - disk full, permissions error, storage unavailable
  • F2.2: UUID generation fails (extremely unlikely but possible)
  • F2.3: Filename sanitization fails or creates invalid path
  • F2.4: Storage.upload_file() fails - storage backend unavailable, quota exceeded
  • F2.5: Database insert fails - connection error, constraint violation, transaction rollback
  • F2.6: File record created but file not saved to disk (partial failure)
  • F2.7: File saved but database insert fails (orphaned file)
  • F2.8: process_file() call fails before background task is added
  • F2.9: Background task addition fails silently
  • F2.10: Exception during upload handler execution - file may be partially saved

Step 3: File Processing Trigger

Location: backend/open_webui/routers/retrieval.pyprocess_file()

Flow:

  1. Validate file_id format (UUID)
  2. Fetch file from database
  3. Check Redis lock (if Redis available) or database status
  4. Update file status to "pending"
  5. Enqueue job OR add BackgroundTask
  6. Release lock

Potential Failure Points:

  • F3.1: Invalid file_id format (not UUID) - validation should catch but edge cases exist
  • F3.2: File not found in database (deleted between upload and processing)
  • F3.3: Redis lock acquisition fails - Redis unavailable, connection timeout
  • F3.4: Database status check fails - connection error, query timeout
  • F3.5: Race condition: Multiple requests process same file simultaneously
  • F3.6: Status update fails - database write error, constraint violation
  • F3.7: Lock acquired but status update fails - lock held indefinitely
  • F3.8: Job enqueue fails - Redis unavailable, queue full, serialization error
  • F3.9: BackgroundTask addition fails - task queue full, memory error
  • F3.10: Lock not released after failure - file stuck in "pending" state
  • F3.11: API key retrieval fails - config not loaded, user not found
  • F3.12: No API key configured for user - processing will fail later but not caught here

Step 4: Background Processing (Non-Redis)

Location: backend/open_webui/routers/retrieval.py_process_file_sync()

Flow:

  1. Get user object from user_id
  2. Update status to "processing"
  3. Fetch file from database
  4. Extract content (if not provided)
  5. Save content to file.data
  6. Calculate hash
  7. Generate embeddings and save to vector DB
  8. Update status to "completed"

Potential Failure Points:

  • F4.1: User retrieval fails - user deleted, database error
  • F4.2: Status update to "processing" fails - database error
  • F4.3: File not found - deleted during processing
  • F4.4: File path is None or invalid
  • F4.5: Storage.get_file() fails - file not found on disk, path resolution error
  • F4.6: Content extraction fails - Loader initialization error, file corrupted, unsupported format
  • F4.7: PDF parsing fails - PyPDFLoader error, encrypted PDF, corrupted file
  • F4.8: Tika server unavailable (if using Tika engine)
  • F4.9: Document Intelligence API fails (if using Azure)
  • F4.10: Content extraction returns empty documents
  • F4.11: Content extraction returns None (loader error)
  • F4.12: File.data update fails - database error
  • F4.13: Hash calculation fails - encoding error, empty content
  • F4.14: Hash update fails - database error
  • F4.15: Embedding API key not available in background task context
  • F4.16: Embedding function initialization fails - config error, model not found
  • F4.17: Embedding generation fails - API error, rate limit, timeout, invalid key
  • F4.18: Vector DB connection fails - database unavailable, connection pool exhausted
  • F4.19: Vector DB insert fails - constraint violation, disk full, transaction error
  • F4.20: Status update to "completed" fails - database error
  • F4.21: Exception during processing - status may not be updated to "error"
  • F4.22: Background task killed by system (OOM, timeout)

Step 5: Content Extraction

Location: backend/open_webui/retrieval/loaders/main.pyLoader.load()

Flow:

  1. Check file exists
  2. Determine loader type based on engine and file type
  3. Initialize appropriate loader (PyPDFLoader, TikaLoader, etc.)
  4. Call loader.load()
  5. Process documents with ftfy

Potential Failure Points:

  • F5.1: File does not exist at path - file deleted, path incorrect
  • F5.2: File exists but not readable - permissions error
  • F5.3: File size is 0 bytes
  • F5.4: Loader selection fails - unknown file type, engine misconfiguration
  • F5.5: PyPDFLoader fails - corrupted PDF, encrypted PDF, unsupported PDF version
  • F5.6: TikaLoader fails - Tika server unavailable, network error, timeout
  • F5.7: Document Intelligence fails - API error, invalid credentials, quota exceeded
  • F5.8: TextLoader fails - encoding error, file locked
  • F5.9: Loader returns empty list - no content extracted
  • F5.10: Loader returns None - exception in loader
  • F5.11: ftfy processing fails - encoding error
  • F5.12: Document metadata missing or invalid

Step 6: Text Splitting

Location: backend/open_webui/routers/retrieval.pysave_docs_to_vector_db() or save_docs_to_multiple_collections()

Flow:

  1. Get user chunk settings (chunk_size, chunk_overlap)
  2. Initialize text splitter (RecursiveCharacterTextSplitter or TokenTextSplitter)
  3. Split documents
  4. Validate split documents not empty

Potential Failure Points:

  • F6.1: User chunk settings retrieval fails - user not found, config error
  • F6.2: Chunk size is 0 or negative - invalid configuration
  • F6.3: Text splitter initialization fails - invalid splitter type, encoding error
  • F6.4: TokenTextSplitter fails - tiktoken encoding not found, encoding error
  • F6.5: Text splitting returns empty list - all content filtered out
  • F6.6: Text splitting fails with exception - memory error, encoding error
  • F6.7: Split documents validation fails - empty content after splitting

Step 7: Embedding Generation

Location: backend/open_webui/routers/retrieval.pysave_docs_to_vector_db() or save_docs_to_multiple_collections()

Flow:

  1. Get user API key from config
  2. Get base URL from config
  3. Initialize embedding function
  4. Generate embeddings for all text chunks
  5. Handle fallback if single batch fails

Potential Failure Points:

  • F7.1: User API key not found - admin not configured key, user not in group
  • F7.2: API key is None or empty string
  • F7.3: Base URL not configured or invalid
  • F7.4: Embedding function initialization fails - model not found, config error
  • F7.5: Single batch embedding fails - API error, rate limit, timeout
  • F7.6: Fallback embedding fails - same issues as F7.5
  • F7.7: Embedding API returns invalid response - wrong format, empty embeddings
  • F7.8: Embedding API returns error - authentication failed, quota exceeded
  • F7.9: Network timeout during embedding API call
  • F7.10: Embedding dimension mismatch - model changed, config mismatch
  • F7.11: Embedding returns None or empty list
  • F7.12: Embedding count doesn't match text count - partial failure
  • F7.13: Embedding contains NaN or invalid values

Step 8: Vector Database Storage

Location: backend/open_webui/routers/retrieval.pysave_docs_to_vector_db() or save_docs_to_multiple_collections()

Flow:

  1. Check if collection exists
  2. Delete collection if overwrite=True
  3. Prepare items (id, text, vector, metadata)
  4. Insert items into vector DB
  5. Return success

Potential Failure Points:

  • F8.1: Vector DB connection fails - database unavailable, connection pool exhausted
  • F8.2: Collection check fails - query timeout, connection error
  • F8.3: Collection deletion fails - transaction error, constraint violation
  • F8.4: Item preparation fails - UUID generation error, metadata serialization error
  • F8.5: Metadata contains invalid types (datetime, dict, list) - conversion fails
  • F8.6: Vector dimension mismatch - collection expects different dimension
  • F8.7: Vector DB insert fails - constraint violation, disk full, transaction error
  • F8.8: Partial insert - some items inserted, others failed
  • F8.9: Insert succeeds but transaction not committed
  • F8.10: Collection created but items not inserted
  • F8.11: Duplicate hash check fails - query error, false positive
  • F8.12: Multiple collections insert - one succeeds, others fail

Pipeline Flow: Redis/RQ Scenario

Step 1-3: Same as Non-Redis (File Upload and Processing Trigger)

Step 4: Job Enqueue (Redis/RQ)

Location: backend/open_webui/utils/job_queue.pyenqueue_file_processing_job()

Flow:

  1. Validate inputs (file_id, etc.)
  2. Get job queue
  3. Check if job already exists
  4. Enqueue job with RQ
  5. Return job_id

Potential Failure Points:

  • F9.1: Input validation fails - file_id empty, non-serializable arguments
  • F9.2: Job queue unavailable - Redis down, ENABLE_JOB_QUEUE=False
  • F9.3: Redis connection fails - network error, authentication error
  • F9.4: Job ID collision - job with same ID exists
  • F9.5: Job enqueue fails - Redis error, serialization error, queue full
  • F9.6: Job enqueued but job_id not returned
  • F9.7: Race condition - job created between check and enqueue
  • F9.8: Embedding API key not passed to job - will fail in worker
  • F9.9: Job arguments too large - Redis memory limit

Step 5: Worker Job Processing

Location: backend/open_webui/workers/file_processor.pyprocess_file_job()

Flow:

  1. Initialize MockRequest with embedding_api_key
  2. Initialize AppConfig
  3. Initialize embedding functions
  4. Get user object
  5. Update file status to "processing"
  6. Extract content (same as Step 4 in non-Redis)
  7. Generate embeddings and save to vector DB
  8. Update status to "completed"

Potential Failure Points:

  • F10.1: MockRequest initialization fails - config error, database unavailable
  • F10.2: AppConfig initialization fails - database error, config corruption
  • F10.3: Embedding function initialization fails - model not found, API key invalid
  • F10.4: Embedding API key is None - not passed from enqueue, config error
  • F10.5: User retrieval fails - user deleted, database error
  • F10.6: Status update fails - database error, connection lost
  • F10.7: All failure points from Step 4 (F4.1-F4.22) apply here
  • F10.8: Worker process killed - OOM, timeout, system restart
  • F10.9: Worker loses Redis connection during processing
  • F10.10: Job timeout exceeded - large file processing takes too long
  • F10.11: Worker crashes - unhandled exception, segmentation fault
  • F10.12: Job retry fails - all retries exhausted
  • F10.13: Worker pod restarts - job lost if not persisted

Step 6-8: Same as Non-Redis (Content Extraction, Text Splitting, Embedding, Vector DB Storage)


Cross-Cutting Failure Points

Configuration Issues

  • F11.1: RAG_EMBEDDING_ENGINE not configured or invalid
  • F11.2: RAG_EMBEDDING_MODEL not configured or invalid
  • F11.3: RAG_OPENAI_API_KEY not configured for admin
  • F11.4: RAG_OPENAI_API_BASE_URL not configured or invalid
  • F11.5: UserScopedConfig not initialized properly
  • F11.6: Config changes during processing - race condition
  • F11.7: Config stored in database but not loaded
  • F11.8: Config cache stale - old values used

Database Issues

  • F12.1: Database connection lost during processing
  • F12.2: Database transaction timeout
  • F12.3: Database deadlock
  • F12.4: Database constraint violation
  • F12.5: Database disk full
  • F12.6: Database connection pool exhausted
  • F12.7: Database migration not applied
  • F12.8: Database schema mismatch

Vector Database Issues

  • F13.1: Vector DB connection lost
  • F13.2: Vector DB collection not created
  • F13.3: Vector DB dimension mismatch
  • F13.4: Vector DB disk full
  • F13.5: Vector DB index corruption
  • F13.6: Vector DB transaction error
  • F13.7: Vector DB connection pool exhausted

Redis Issues (RQ Scenario)

  • F14.1: Redis connection lost during job processing
  • F14.2: Redis memory full - job queue full
  • F14.3: Redis timeout - job processing exceeds timeout
  • F14.4: Redis failover - job lost if not persisted
  • F14.5: Redis authentication error
  • F14.6: Redis cluster partition - job stuck

User/Admin Configuration Issues

  • F15.1: Admin API key not set in Settings > Documents
  • F15.2: User not in group - no API key inheritance
  • F15.3: Admin API key expired or revoked
  • F15.4: User deleted during processing
  • F15.5: User role changed during processing
  • F15.6: Group membership changed - API key access revoked

File System Issues

  • F16.1: File deleted from disk during processing
  • F16.2: File moved or renamed
  • F16.3: Disk full - cannot save file
  • F16.4: File permissions changed
  • F16.5: Storage backend unavailable (S3, etc.)
  • F16.6: File corrupted after upload

Network/API Issues

  • F17.1: Embedding API unavailable
  • F17.2: Embedding API rate limit exceeded
  • F17.3: Embedding API authentication failed
  • F17.4: Embedding API timeout
  • F17.5: Network partition during API call
  • F17.6: DNS resolution failure
  • F17.7: SSL/TLS certificate error

Concurrency Issues

  • F18.1: Multiple workers process same file
  • F18.2: File deleted while being processed
  • F18.3: Status update race condition
  • F18.4: Lock not released - file stuck
  • F18.5: Job enqueued multiple times
  • F18.6: Background task added multiple times

Error Handling Issues

  • F19.1: Exception caught but status not updated
  • F19.2: Error message not logged
  • F19.3: Partial failure not detected
  • F19.4: Error status not persisted
  • F19.5: Exception swallowed silently

Summary

Total Failure Points Identified: 150+

Critical Failure Points (High Impact):

  1. F3.12 / F7.1: No API key configured - processing will always fail
  2. F4.15 / F10.4: API key not available in background/worker context
  3. F4.6-F4.10: Content extraction failures - no content to embed
  4. F7.5-F7.8: Embedding API failures - cannot generate embeddings
  5. F8.7-F8.10: Vector DB insert failures - embeddings not saved
  6. F10.8-F10.12: Worker failures - job lost or not completed
  7. F15.1-F15.3: Admin API key configuration issues

Common Failure Patterns:

  1. Configuration not loaded - API keys, URLs, models
  2. Database connection lost - during critical operations
  3. Partial failures - some steps succeed, others fail
  4. Race conditions - concurrent processing of same file
  5. Error handling gaps - exceptions not caught or not logged
  6. Context loss - user/API key not available in background tasks

Recommendations for Investigation

  1. Check logs for API key errors - Look for "No embedding API key configured"
  2. Verify API key is passed to workers - Check job enqueue includes embedding_api_key
  3. Check content extraction logs - Look for "[Content Extraction WARNING]"
  4. Verify embedding API calls - Check for rate limits, timeouts, auth errors
  5. Check vector DB connection - Verify connection pool, transactions
  6. Monitor worker logs - Check for worker crashes, timeouts
  7. Verify Redis connection - Check for connection errors, timeouts
  8. Check file status updates - Verify status transitions correctly
  9. Monitor database connections - Check for connection pool exhaustion
  10. Verify error handling - Check that exceptions are caught and logged

Next Steps

  1. Review each failure point against actual logs
  2. Add monitoring/alerting for critical failure points
  3. Improve error handling and logging
  4. Add retry logic for transient failures
  5. Add validation at each step
  6. Improve status tracking and reporting
  7. Add health checks for dependencies
  8. Implement circuit breakers for external APIs