Client Project: Experimental AI Chat Application with Local Model Integration

React · Vite · Docker · Bun · Elysia · pgAdmin · Ollama · Postgres

Required Disclaimer

This prototype was built as part of a client-student partnership through Codesmith's Future Code program. It explores solutions to a real-world case study provided by an external partner. # This work does not represent employment or contracting with the partner. All intellectual property belongs to the partner. This is a time-boxed MVP and not a production system.

Overview

This is my branch from https://github.com/kevinortiz43/Customer-support-AI-powered-product.

For the original project, I architected an experimental offline/local AI branch feature (backend), drove the project's cache-aside strategy, and orchestrated its OS-agnostic ETL pipeline for dynamically seeding the PostgreSQL database.

The goal was building a responsive AI chatbot using only free, open-source models running locally. Free models aren't as powerful as paid ones. Many of them on HuggingFace have no inference providers available so can only be run if downloaded directly. The question was: how useful could they be?

Model Architecture

The offline AI setup uses 2 models:

Model 1: Text-to-SQL model for query translation
Model 2: Dual-purpose model
- Generates human-friendly responses from returned results
- Evaluates result quality

Performance Optimization

This setup includes a preloading script for seamless model switching. Both models warm up when the application starts. If they load during a user request, the delay could be up to 11,873% longer (comparing 299 ms vs. 35.80 seconds).

Why This Exists

Local inference: No external API dependencies
Open source model comparison: Evaluate performance of freely available models
GPU acceleration: Optional GPU support (see docker-compose.yml)
Model preloading: Optional Linux script to pre-load both models
Hot-swappable models: Optional Linux script to switch models at runtime without restart

Architecture Summary

High-Level Patterns

Cache-aside pattern: Optimize for frequent queries
Query routing: Keyword text search for simple queries vs AI path for complex queries
Text-to-SQL model: Natural language to SQL query AI conversion
Response generation model: SQL results to human-readable text
LLM-as-Judge: Automated quality evaluation
Non-blocking evaluation: Async result scoring
Dynamic database seeding: Automated ETL pipeline (OS-agnostic)

AI Implementation

The offline AI system uses:

Text-to-SQL model (7B): Translates natural language to SQL queries
Response model (7B): Formats raw results into conversational answers
Judge model: Asynchronously evaluates SQL quality without blocking users

Request Flow:

Cache check: Exact-match caching (5-minute TTL)
Complexity routing: Simple queries use keyword search; complex queries trigger SQL generation
Response generation: Results formatted into natural language
Non-blocking evaluation: SQL quality scored and logged asynchronously

See AI Architecture Deep Dive for flow diagrams and component details.

Prototype Status & Production Considerations

This system was built in under 2 weeks to demonstrate architectural patterns, not to be production-ready. Below is an honest assessment of where it stands and what a production version would require, such a hybrid approach with RAG and fine-tuned LLMs. Although this is an offline system, perhaps some security and privacy guardrails should still be in place:

Layer	Current Implementation	What Production Would Add
Models	Model 1 handles SQL generation, Model 2 handles response formatting & evaluation	Specialized fine-tuned models for each task with higher accuracy
Context Strategy	In-context learning (schema + examples in prompt)	RAG for dynamic/large schemas
Security	Basic SQL execution with SELECT-only enforcement	AI gateway with prompt injection detection, SQL injection prevention
Validation	LLM-as-Judge (asynchronous) with results count verification	Human-in-the-loop evaluation, handling complex or edge cases + semantic correctness metrics
Caching	Exact query match and in-memory API response caching (5-min TTL)	Semantic caching (cache by meaning) + partial result caching
Post-processing	Regex-based SQL cleaning to handle model hallucinations	Fine-tuning reduces need for post-processing
Scalability	In-memory cache, single-node, rule-based routing	Distributed cache (Redis), horizontal scaling

What Works Now:

Complete end-to-end pipeline from query to response
Model specialization (option for separate models for SQL generation, response generation & evaluation)
Non-blocking evaluation preserves user experience
Local-first ensures no API costs and protects data privacy

What I'd Explore Next

Priority	Direction	Why It Matters
Immediate	Determine which RAG design is best for use case (likely involves hybrid search, reranking layer, structured metadata filtering)	RAG would enable handling new data without retraining but has trade-offs (memory drift, increased token usage, extra latency, extra costs and complexity)
Immediate	Fine-tune specialized models (determine how small these can be)	Replace generic models with versions trained on actual usage patterns for higher accuracy
Immediate	Add observability layer, possibly governance	Need tracing, logging, and way to ensure only updated, authorized docs are in vector DB
Immediate	Improve evaluation layer	Flag low-confidence SQL for human review; use corrections to continuously improve
Near-term	Semantic caching	Cache based on query meaning rather than exact text to improve hit rates at scale
Ongoing	Scalability	Distributed caching, horizontal scaling

Dynamic Database Seeding

The PostgreSQL database can be dynamically seeded from raw Relay-like JSON files. Run bun run setup to execute the OS-agnostic ETL pipeline:

Extract: Parse ALL JSON files from src/server/data/
Transform: Convert JSON to CSV format
Load: Import data into PostgreSQL
Generate: Create TypeScript schemas from database types

Requirements: Input JSON must be relatively flat and follow consistent structure.

Documentation

Setup Guide - Comprehensive setup guide
AI Architecture Deep Dive -Detailed AI flow documentation
GPU + Model Notes - Model specifications and VRAM requirements

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
docs		docs
src		src
supabase/.temp		supabase/.temp
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
index.html		index.html
package.json		package.json
start-dev.sh		start-dev.sh
switch-model.sh		switch-model.sh
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Client Project: Experimental AI Chat Application with Local Model Integration

Required Disclaimer

Table of Contents

Overview

Model Architecture

Performance Optimization

Why This Exists

Architecture Summary

High-Level Patterns

AI Implementation

Prototype Status & Production Considerations

What I'd Explore Next

Dynamic Database Seeding

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Client Project: Experimental AI Chat Application with Local Model Integration

Required Disclaimer

Table of Contents

Overview

Model Architecture

Performance Optimization

Why This Exists

Architecture Summary

High-Level Patterns

AI Implementation

Prototype Status & Production Considerations

What I'd Explore Next

Dynamic Database Seeding

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages