The AI implementation is accessible at: http://localhost:3000/api/ai/query
| Step | Component | Description |
|---|---|---|
| 1. Query Reception | Frontend → API | User submits natural language question |
| 2. Cache Check | Exact-match cache | Returns cached response if exact normalized query seen within 5 min |
| 3. Complexity Scoring | requiresAIPath() function |
Scores query based on aggregations, entity mentions, phrase patterns; threshold ≥4 triggers AI path |
| 4a. Simple Path | Keyword Search | Extracts keywords, checks keyword cache, then searches searchText using PostgreSQL full-text |
| 4b. Complex Path | Text-to-SQL Model | Generates SQL → post-processing fixes → executes against database |
| 5. Response Generation | Response Model | Formats results into conversational answer |
| 6. Evaluation | Judge Model | Asynchronously scores SQL quality and logs results |
When a user interacts with the AI chatbot by sending a query, the application executes the following decision tree.
The application first checks if the exact query was sent within the past 5 minutes:
- Cache HIT: Returns cached results immediately
- Cache MISS: Proceeds to complexity analysis
The application determines whether the user prompt requires AI assistance:
A. Keyword Search Path (Low Complexity) If the query is not complex enough to need the TEXT2SQL_MODEL:
- Check for cached table data
- If cache MISS, query searchText fields (pre-optimized for rapid lookup)
- Store both query (as key) and results in cache
- Proceed to response generation
B. AI Path (High Complexity) If the TEXT2SQL_MODEL is required:
- Send user query to TEXT2SQL_MODEL
- Convert natural language to SQL
- Execute generated SQL against PostgreSQL
- Proceed to response generation
All results paths converge at the AI_RESPONSE_MODEL:
- Keyword search results
- AI-generated query results
- Cached query results
The response model converts raw data into human-friendly format before returning to the frontend chatbot.
Note: This flow could be improved if cache were set AFTER human-friendly responses were generated and validated, so valid responses could immediately be returned to the user and responses would not need to be generated every time. However, a more robust validation step would be necessary to prevent invalid responses from being cached.
The JUDGE_MODEL operates asynchronously, not blocking user response:
Evaluation Logic:
- Test set matches: If query matches
src/server/aiTest/test-questions.json, compare resultsCount against expected count from ground truth. (See also:src/server/aiTest/test_questions.mdfor quick copy/paste testing) - No match: LLM-as-Judge autonomously evaluates quality
Output: Scores and explanations are saved to: src/server/aiTest/judgements/
| Aspect | Current Implementation | Production Target |
|---|---|---|
| Purpose | Convert natural language to PostgreSQL queries | Same |
| Model | Generic 7B with prompt engineering | Fine-tuned on company-specific query patterns |
| Post-processing | Extensive regex fixes: SQL extraction, quote fixing, UNION correction, identifier validation | Fine-tuning would eliminate most post-processing |
| Limitations | May hallucinate tables/columns; requires aggressive cleaning; not all models follow instructions | Specialized model would generate cleaner SQL |
| Aspect | Current Implementation | Production Target |
|---|---|---|
| Purpose | Transform raw data into natural language answers | Same |
| Model | Generic 7B with instruction prompts | Fine-tuned on company tone and terminology |
| Note | Same model also serves as Judge (dual-purpose) | Would use specialized models in production |
| Aspect | Current Implementation | Production Target |
|---|---|---|
| Purpose | Asynchronous quality evaluation | Same + human validation |
| Logic | Test set match → compare counts; No match → LLM-as-Judge | Add semantic correctness checks, human sampling |
| Limitations | Count comparison doesn't verify data correctness; LLM-judge accuracy bounded by model capabilities | Human validation would catch subtle errors |
| Decision | Why | Trade-off |
|---|---|---|
| Model specialization | Each task (SQL, response, evaluation) has different requirements | More complex orchestration |
| Non-blocking evaluation | Users get immediate responses | Quality feedback is delayed |
| Local-only models | Data privacy, no API costs | Limited to 7B parameter models |
| 5-minute TTL cache | Simple implementation | Misses semantic similarities |
| Dual-purpose response/judge model | Reduces resource requirements | Likely compromises performance of both tasks |
| In-memory caching | Fast, simple for prototype | Does not scale beyond small datasets (RAM limits, no distribution) |
| Rule-based complexity scoring | Zero-cost, transparent routing | Brittle at scale - misses novel query patterns; would need better classifier for production |
| Aspect | Current Implementation | Production Enhancement |
|---|---|---|
| Type | In-memory, exact-match | Semantic + partial caching |
| TTL | 5 minutes | Variable based on data freshness needs |
| Scope | Full query results | Partial results, embeddings |
| Scale Limitation | Works only with small datasets | Would need Redis or Memcached, sharding, and eviction policies |
If this were moving to production, I'd prioritize:
| Phase | Focus | Key Improvements |
|---|---|---|
| Phase 1 | Accuracy | Add RAG and fine-tune models on actual query logs; add confidence scoring; basic human review for edge cases |
| Phase 2 | Reliability | Add observability, better evaluation, governance. Log everything, catch errors, regenerate; monitoring and alerting |
| Phase 3 | Consider adding Security & Privacy layer | Add PII detection/redaction; output guardrails if any chance data might be exposed to internet, even though this application runs locally and offline |
| Phase 4 | Continuous Improvement | Use human feedback to augment training data |
| Phase 5 | Scale | Replace in-memory cache with Redis; add semantic caching; implement query logging for analysis |
This project sparked curiosity about what comes next for future projects. I'm actively exploring:
| Topic | Why It Matters |
|---|---|
| Semantic caching | Moving beyond exact-match to cache by meaning - critical for production scale |
| Agentic AI | Moving from single-turn SQL generation to using an AI agent that can plan, decompose tasks, interact with filesystems, spawn sub-agents, call tools and use skills in a self-correcting loop |
| Agent observability | Understanding why an agent failed or what decision it made |
| PII handling | Protecting user data isn't optional - learning detection/redaction patterns |
| Guardrails | Preventing prompt injection and inappropriate outputs before they reach users |
| Human-in-the-loop | Designing efficient review workflows for edge cases |
| Evaluation metrics | Result counts aren't enough - need semantic correctness |
| Scaling AI systems | Understanding where this prototype would break |
- Main README - Project overview
- Setup Guide - Comprehensive setup guide
- GPU + Model Notes - Model specifications and VRAM requirements
