Cortex ETL is an automated knowledge base creation system for manufacturing CPQ (Configure, Price, Quote) systems. It processes multi-format data (primarily PDFs) into structured, queryable databases with complete tenant isolation.
- Frontend: React + TypeScript + Vite
- Backend: FastAPI (Python)
- Database: PostgreSQL via Supabase
- AI Services: LiteLLM + Google Gemini API
- Deployment: Docker Compose
- Development: Local Supabase stack
Location: frontend/src/
- LoginPage: Authentication interface
- AdminPage: Admin dashboard for system management
- DocumentPage: Document viewing and management
- ClusterVisualizationPage: Visual representation of classification clusters
classification.hooks.tsx- Manage classification categoriesfiles.hooks.tsx- File upload and managementmigrations.hooks.tsx- Database migration trackingpatternRecognition.hooks.tsx- Pattern recognition operationspreprocess.hooks.tsx- Preprocessing job managementuseRealtimeSubscription.tsx- Real-time updates from Supabase
- AuthContext: Session management and authentication
- QueryContext: React Query configuration for data fetching
Location: backend/app/
-
Classification Routes (
/api/classification)- GET classifications for tenant
- POST create/update classifications
- POST classify individual files
-
Preprocess Routes (
/api/preprocess)- POST retry extraction for failed files
- Enqueues PDF processing jobs
-
Webhook Routes (
/api/webhooks)- POST extract_data - Triggered by Supabase on file upload
- Validates webhook secret for security
-
Pattern Recognition Routes (
/api/pattern-recognition)- POST analyze relationships between classifications
- Returns discovered patterns and relationships
-
Migration Routes (
/api/migrations)- GET migrations for tenant
- POST create new migration
- POST execute pending migrations
Preprocess Service
- Manages PDF extraction pipeline
- Downloads files from Supabase Storage
- Orchestrates LiteLLM + Gemini for data extraction
- Generates vector embeddings for semantic search
- Tracks extraction status (queued → processing → complete/failed)
Classification Service
- CRUD operations for classification categories
- Assigns files to classifications
- Manages classification lifecycle
- Handles file unlinking when classifications are deleted
Pattern Recognition Service
- Analyzes relationships between classification categories
- Uses AI to detect patterns in document types
- Calculates relationship cardinality (one-to-one, one-to-many, many-to-many)
- Stores relationships with confidence scores
Schema Generation Service
- Pure function-based schema generation
- Converts classifications to SQL table names
- Generates CREATE TABLE migrations
- Creates foreign key constraints from relationships
- Handles PostgreSQL naming constraints (63 char limit)
- Deterministic and idempotent
Migration Service
- Stores migration SQL in database
- Executes migrations in sequence order
- Creates tenant-specific schemas
- Maintains migration history
Data access layer that abstracts Supabase operations:
ExtractionRepository- extracted_files table operationsClassificationRepository- classifications table operationsRelationshipRepository- relationships table operationsMigrationRepository- migrations table operationsSchemaRepository- Raw SQL execution for DDL
- Async job queue for PDF processing
- Prevents blocking on large file uploads
- Status tracking and monitoring
- Worker-based processing model
- Uses LiteLLM to orchestrate AI providers
- Sends PDFs to Google Gemini for structured extraction
- Returns JSON-formatted data
- Handles extraction errors and retries
- Generates vector embeddings for document content
- Enables semantic search capabilities
- Stores embeddings in PostgreSQL vector column
- AI-powered relationship detection
- Analyzes document content and classifications
- Determines relationship types and cardinality
- Calculates confidence scores
tenants
- Stores tenant organizations
- Each tenant gets isolated schema
file_uploads
- Tracks uploaded PDF files
- Links to tenant and classification
- References file in Supabase Storage
extracted_files
- Stores extracted structured data (JSONB)
- Contains vector embeddings
- Tracks extraction status and errors
- One-to-one with file_uploads
classifications
- User-defined document categories
- Tenant-scoped
- Examples: "Robot Specifications", "Product Brochure", "Safety Manual"
relationships
- AI-discovered relationships between classifications
- Includes cardinality and confidence score
- Used to generate foreign keys in tenant schemas
migrations
- SQL migration history per tenant
- Sequence-ordered execution
- Enables schema versioning
Each tenant gets a dedicated PostgreSQL schema (e.g., tenant_kawasaki_robotics)
Dynamic Table Creation
- Each classification becomes a table
- Table name derived from classification name
- Example: "Robot Specifications" →
robot_specifications
Foreign Key Relationships
- Relationships table defines foreign keys
- Example:
robot_specifications.product_brochure_id→product_brochure.id
Data Storage
- Extracted JSONB data stored in tables
- Queryable via SQL
- Tenant-isolated by schema
tenant-files
- Stores original PDF files
- Organized by tenant_id/filename
- Row-Level Security (RLS) enforces tenant isolation
- User uploads PDF via Frontend
- File stored in Supabase Storage bucket (
tenant-files) - Record created in
file_uploadstable - Database webhook triggers on new
file_uploadinsert - Webhook calls Backend
/api/webhooks/extract_dataendpoint
- Preprocessing Queue receives job, creates
extracted_filesentry with status "queued" - Worker picks up job, updates status to "processing"
- PDF downloaded from Supabase Storage
- LiteLLM + Gemini extracts structured data from PDF
- Embedding vector generated for document content
- Results stored in
extracted_filestable with status "complete"
- Admin/User creates classification categories (e.g., "Robot Specs", "Product Brochure")
- Classifications stored in
classificationstable - User manually assigns files to classifications
- file_uploads.classification_id updated
- User triggers pattern recognition analysis
- System fetches all classifications and extracted files with embeddings
- AI analyzes relationships between classification categories
- Relationships stored in
relationshipstable- Example: "Robot Specs" → "Product Brochure" (one-to-many)
- User requests schema generation
- System reads classifications and relationships
- Generates SQL migrations to create tenant-specific tables
- Each classification becomes a table (e.g.,
robot_specifications) - Relationships become foreign keys
- Migrations stored in
migrationstable - Migrations executed to create actual database schema
- Tenant can now query their custom schema via SQL
- Supabase Realtime subscriptions push updates to Frontend
- Status changes, new files, completed extractions update UI instantly
- No polling required
Schema-per-Tenant
- Each tenant gets a dedicated PostgreSQL schema
- Complete data isolation
- No cross-tenant queries possible
- Scales to thousands of tenants
Row-Level Security (RLS)
- Supabase RLS policies on shared tables
- Enforces tenant_id filtering
- Storage bucket access control
Job Queue Pattern
- PDF extraction is async and non-blocking
- Status tracking (queued → processing → complete/failed)
- Enables retry logic
- Prevents timeout on large files
Database Webhooks
- Supabase triggers webhook on file upload
- Decouples upload from processing
- Enables horizontal scaling
Realtime Subscriptions
- Frontend subscribes to database changes
- Instant UI updates
- No polling overhead
Data Access Abstraction
- Repositories encapsulate Supabase operations
- Services depend on repositories, not raw client
- Testable and mockable
- Clean separation of concerns
Deterministic Schema Generation
SchemaGenerationService.create_migrations()is a pure function- Input: classifications + relationships + existing migrations
- Output: new migration SQL
- No side effects, fully testable
- Idempotent - same input always produces same output
- Supabase Auth for user management
- JWT-based session tokens
- Role-based access control (Admin vs Tenant)
- Schema-per-tenant prevents data leakage
- RLS policies on shared tables
- Storage bucket policies
- Webhook secret validation
- HMAC signature verification
- Prevents unauthorized extraction requests
- CORS middleware configured
- Dependency injection for auth checks
- Service role key for backend operations
npm run freshThis command:
- Generates environment variables
- Starts local Supabase stack
- Builds and runs frontend/backend containers
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- Supabase Studio: http://localhost:54323
- Admin: admin@cortex.com / password
- Tenants: eng@kawasaki-robotics.com / password (etc.)
Services
backend- FastAPI container (port 8000)frontend- Vite dev server (port 5173)- Supabase stack (via
supabase start)
Networking
- Custom network:
cortex-network - Backend can access Supabase via
host.docker.internal
Environment Variables
- Generated by
init-dev.js - Stored in
.env(gitignored) - Includes Supabase keys, webhook secrets, API keys
Cloud Deployment
- Frontend: Static hosting (Vercel, Netlify)
- Backend: Container hosting (Cloud Run, ECS)
- Database: Supabase Cloud (managed PostgreSQL)
Scaling
- Preprocessing queue can be distributed (Redis/Celery)
- Multiple backend workers for parallel processing
- Database connection pooling
- CSV/Excel Support - Extend beyond PDFs
- API Ingestion - Pull data from external APIs
- Advanced Querying - Natural language to SQL
- Visualization - Auto-generate charts from data
- Export - Export tenant schemas to other formats
- Caching - Redis for frequently accessed data
- Background Jobs - Celery for distributed processing
- Monitoring - Prometheus/Grafana for observability
- Testing - Comprehensive test coverage
- CI/CD - Automated deployment pipeline
Extraction Fails
- Check GEMINI_API_KEY is valid
- Verify PDF is not corrupted
- Check backend logs for LiteLLM errors
Webhook Not Triggering
- Verify WEBHOOK_BASE_URL is accessible from Supabase
- Check WEBHOOK_SECRET matches database config
- Ensure
configure_webhooks()ran on startup
Schema Generation Errors
- Ensure classifications exist
- Run pattern recognition before schema generation
- Check for PostgreSQL naming conflicts
Tenant Isolation Issues
- Verify RLS policies are enabled
- Check tenant_id is correctly set in session
- Ensure schema name matches tenant
Cortex ETL provides a complete, production-ready solution for automated knowledge base creation. Its multi-tenant architecture, AI-powered extraction, and dynamic schema generation make it ideal for manufacturing CPQ systems and similar use cases requiring structured data from unstructured sources.
The system is designed for:
- Scalability - Schema-per-tenant isolation
- Flexibility - Dynamic schema generation
- Reliability - Async processing with status tracking
- Security - Complete tenant isolation
- Extensibility - Clean service architecture
For questions or contributions, see the main README.md.