This prototype was built as part of a client-student partnership through Codesmith's Future Code program. It explores solutions to a real-world case study provided by an external partner. # This work does not represent employment or contracting with the partner. All intellectual property belongs to the partner. This is a time-boxed MVP and not a production system.
This is my branch from https://github.com/kevinortiz43/Customer-support-AI-powered-product.
For the original project, I architected an experimental offline/local AI branch feature (backend), drove the project's cache-aside strategy, and orchestrated its OS-agnostic ETL pipeline for dynamically seeding the PostgreSQL database.
The goal was building a responsive AI chatbot using only free, open-source models running locally. Free models aren't as powerful as paid ones. Many of them on HuggingFace have no inference providers available so can only be run if downloaded directly. The question was: how useful could they be?
The offline AI setup uses 2 models:
- Model 1: Text-to-SQL model for query translation
- Model 2: Dual-purpose model
- Generates human-friendly responses from returned results
- Evaluates result quality
This setup includes a preloading script for seamless model switching. Both models warm up when the application starts. If they load during a user request, the delay could be up to 11,873% longer (comparing 299 ms vs. 35.80 seconds).
- Local inference: No external API dependencies
- Open source model comparison: Evaluate performance of freely available models
- GPU acceleration: Optional GPU support (see docker-compose.yml)
- Model preloading: Optional Linux script to pre-load both models
- Hot-swappable models: Optional Linux script to switch models at runtime without restart
- Cache-aside pattern: Optimize for frequent queries
- Query routing: Keyword text search for simple queries vs AI path for complex queries
- Text-to-SQL model: Natural language to SQL query AI conversion
- Response generation model: SQL results to human-readable text
- LLM-as-Judge: Automated quality evaluation
- Non-blocking evaluation: Async result scoring
- Dynamic database seeding: Automated ETL pipeline (OS-agnostic)
The offline AI system uses:
- Text-to-SQL model (7B): Translates natural language to SQL queries
- Response model (7B): Formats raw results into conversational answers
- Judge model: Asynchronously evaluates SQL quality without blocking users
Request Flow:
- Cache check: Exact-match caching (5-minute TTL)
- Complexity routing: Simple queries use keyword search; complex queries trigger SQL generation
- Response generation: Results formatted into natural language
- Non-blocking evaluation: SQL quality scored and logged asynchronously
See AI Architecture Deep Dive for flow diagrams and component details.
This system was built in under 2 weeks to demonstrate architectural patterns, not to be production-ready. Below is an honest assessment of where it stands and what a production version would require, such a hybrid approach with RAG and fine-tuned LLMs. Although this is an offline system, perhaps some security and privacy guardrails should still be in place:
| Layer | Current Implementation | What Production Would Add |
|---|---|---|
| Models | Model 1 handles SQL generation, Model 2 handles response formatting & evaluation | Specialized fine-tuned models for each task with higher accuracy |
| Context Strategy | In-context learning (schema + examples in prompt) | RAG for dynamic/large schemas |
| Security | Basic SQL execution with SELECT-only enforcement | AI gateway with prompt injection detection, SQL injection prevention |
| Validation | LLM-as-Judge (asynchronous) with results count verification | Human-in-the-loop evaluation, handling complex or edge cases + semantic correctness metrics |
| Caching | Exact query match and in-memory API response caching (5-min TTL) | Semantic caching (cache by meaning) + partial result caching |
| Post-processing | Regex-based SQL cleaning to handle model hallucinations | Fine-tuning reduces need for post-processing |
| Scalability | In-memory cache, single-node, rule-based routing | Distributed cache (Redis), horizontal scaling |
What Works Now:
- Complete end-to-end pipeline from query to response
- Model specialization (option for separate models for SQL generation, response generation & evaluation)
- Non-blocking evaluation preserves user experience
- Local-first ensures no API costs and protects data privacy
| Priority | Direction | Why It Matters |
|---|---|---|
| Immediate | Determine which RAG design is best for use case (likely involves hybrid search, reranking layer, structured metadata filtering) | RAG would enable handling new data without retraining but has trade-offs (memory drift, increased token usage, extra latency, extra costs and complexity) |
| Immediate | Fine-tune specialized models (determine how small these can be) | Replace generic models with versions trained on actual usage patterns for higher accuracy |
| Immediate | Add observability layer, possibly governance | Need tracing, logging, and way to ensure only updated, authorized docs are in vector DB |
| Immediate | Improve evaluation layer | Flag low-confidence SQL for human review; use corrections to continuously improve |
| Near-term | Semantic caching | Cache based on query meaning rather than exact text to improve hit rates at scale |
| Ongoing | Scalability | Distributed caching, horizontal scaling |
The PostgreSQL database can be dynamically seeded from raw Relay-like JSON files. Run bun run setup to execute the OS-agnostic ETL pipeline:
- Extract: Parse ALL JSON files from
src/server/data/ - Transform: Convert JSON to CSV format
- Load: Import data into PostgreSQL
- Generate: Create TypeScript schemas from database types
Requirements: Input JSON must be relatively flat and follow consistent structure.
- Setup Guide - Comprehensive setup guide
- AI Architecture Deep Dive -Detailed AI flow documentation
- GPU + Model Notes - Model specifications and VRAM requirements