Skip to content

musemod/Offline_AI_Chatbot

Repository files navigation

Client Project: Experimental AI Chat Application with Local Model Integration

React · Vite · Docker · Bun · Elysia · pgAdmin · Ollama · Postgres

Required Disclaimer

This prototype was built as part of a client-student partnership through Codesmith's Future Code program. It explores solutions to a real-world case study provided by an external partner. # This work does not represent employment or contracting with the partner. All intellectual property belongs to the partner. This is a time-boxed MVP and not a production system.

Table of Contents

Overview

This is my branch from https://github.com/kevinortiz43/Customer-support-AI-powered-product.

For the original project, I architected an experimental offline/local AI branch feature (backend), drove the project's cache-aside strategy, and orchestrated its OS-agnostic ETL pipeline for dynamically seeding the PostgreSQL database.

The goal was building a responsive AI chatbot using only free, open-source models running locally. Free models aren't as powerful as paid ones. Many of them on HuggingFace have no inference providers available so can only be run if downloaded directly. The question was: how useful could they be?

Model Architecture

The offline AI setup uses 2 models:

  • Model 1: Text-to-SQL model for query translation
  • Model 2: Dual-purpose model
    • Generates human-friendly responses from returned results
    • Evaluates result quality

Performance Optimization

This setup includes a preloading script for seamless model switching. Both models warm up when the application starts. If they load during a user request, the delay could be up to 11,873% longer (comparing 299 ms vs. 35.80 seconds).

Why This Exists

  • Local inference: No external API dependencies
  • Open source model comparison: Evaluate performance of freely available models
  • GPU acceleration: Optional GPU support (see docker-compose.yml)
  • Model preloading: Optional Linux script to pre-load both models
  • Hot-swappable models: Optional Linux script to switch models at runtime without restart

Architecture Summary

High-Level Patterns

  • Cache-aside pattern: Optimize for frequent queries
  • Query routing: Keyword text search for simple queries vs AI path for complex queries
  • Text-to-SQL model: Natural language to SQL query AI conversion
  • Response generation model: SQL results to human-readable text
  • LLM-as-Judge: Automated quality evaluation
  • Non-blocking evaluation: Async result scoring
  • Dynamic database seeding: Automated ETL pipeline (OS-agnostic)

AI Implementation

The offline AI system uses:

  • Text-to-SQL model (7B): Translates natural language to SQL queries
  • Response model (7B): Formats raw results into conversational answers
  • Judge model: Asynchronously evaluates SQL quality without blocking users

Request Flow:

  1. Cache check: Exact-match caching (5-minute TTL)
  2. Complexity routing: Simple queries use keyword search; complex queries trigger SQL generation
  3. Response generation: Results formatted into natural language
  4. Non-blocking evaluation: SQL quality scored and logged asynchronously

See AI Architecture Deep Dive for flow diagrams and component details.

Prototype Status & Production Considerations

This system was built in under 2 weeks to demonstrate architectural patterns, not to be production-ready. Below is an honest assessment of where it stands and what a production version would require, such a hybrid approach with RAG and fine-tuned LLMs. Although this is an offline system, perhaps some security and privacy guardrails should still be in place:

Layer Current Implementation What Production Would Add
Models Model 1 handles SQL generation, Model 2 handles response formatting & evaluation Specialized fine-tuned models for each task with higher accuracy
Context Strategy In-context learning (schema + examples in prompt) RAG for dynamic/large schemas
Security Basic SQL execution with SELECT-only enforcement AI gateway with prompt injection detection, SQL injection prevention
Validation LLM-as-Judge (asynchronous) with results count verification Human-in-the-loop evaluation, handling complex or edge cases + semantic correctness metrics
Caching Exact query match and in-memory API response caching (5-min TTL) Semantic caching (cache by meaning) + partial result caching
Post-processing Regex-based SQL cleaning to handle model hallucinations Fine-tuning reduces need for post-processing
Scalability In-memory cache, single-node, rule-based routing Distributed cache (Redis), horizontal scaling

What Works Now:

  • Complete end-to-end pipeline from query to response
  • Model specialization (option for separate models for SQL generation, response generation & evaluation)
  • Non-blocking evaluation preserves user experience
  • Local-first ensures no API costs and protects data privacy

What I'd Explore Next

Priority Direction Why It Matters
Immediate Determine which RAG design is best for use case (likely involves hybrid search, reranking layer, structured metadata filtering) RAG would enable handling new data without retraining but has trade-offs (memory drift, increased token usage, extra latency, extra costs and complexity)
Immediate Fine-tune specialized models (determine how small these can be) Replace generic models with versions trained on actual usage patterns for higher accuracy
Immediate Add observability layer, possibly governance Need tracing, logging, and way to ensure only updated, authorized docs are in vector DB
Immediate Improve evaluation layer Flag low-confidence SQL for human review; use corrections to continuously improve
Near-term Semantic caching Cache based on query meaning rather than exact text to improve hit rates at scale
Ongoing Scalability Distributed caching, horizontal scaling

Dynamic Database Seeding

The PostgreSQL database can be dynamically seeded from raw Relay-like JSON files. Run bun run setup to execute the OS-agnostic ETL pipeline:

  1. Extract: Parse ALL JSON files from src/server/data/
  2. Transform: Convert JSON to CSV format
  3. Load: Import data into PostgreSQL
  4. Generate: Create TypeScript schemas from database types

Requirements: Input JSON must be relatively flat and follow consistent structure.

Documentation

About

From bunOllama branch (offline local AI branch)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors