Skip to content

Latest commit

 

History

History
1025 lines (757 loc) · 22.7 KB

File metadata and controls

1025 lines (757 loc) · 22.7 KB

Developer Guide

This guide provides step-by-step instructions for developing and testing NeuralNav.

Table of Contents

Development Environment Setup

Prerequisites

Ensure you have all required tools installed:

make check-prereqs

This checks for:

  • Docker or Podman (running)
  • Python 3.11+
  • Ollama
  • kubectl
  • KIND

Container Runtime Support

NeuralNav supports both Docker and Podman as container runtimes.

Compatibility Matrix

Component Docker Podman Notes
PostgreSQL (db-* targets) Works with either
Simulator build/push/pull Works with either
KIND cluster (cluster-* targets) KIND requires Docker
Docker Compose (docker-* targets) ⚠️ Requires podman-compose

Auto-Detection Behavior

The Makefile automatically detects which container runtime is available and running:

  • If Docker daemon is running: Docker is used (for KIND compatibility)
  • If only Podman daemon is running: Podman is used automatically
  • If neither daemon is running: Commands will fail with a helpful error

This means if you quit Docker Desktop, the Makefile will automatically use Podman (if its machine is running), and vice versa.

Setting the Container Tool

Option 1: Per-command override

CONTAINER_TOOL=podman make db-start
CONTAINER_TOOL=podman make db-load-guidellm

Option 2: Export for your shell session

export CONTAINER_TOOL=podman
make db-start
make db-load-guidellm
# All commands in this session will use Podman

Option 3: Create a .env file (persistent)

echo "CONTAINER_TOOL=podman" >> .env

The Makefile automatically loads .env, so all subsequent make commands will use Podman. The .env file is in .gitignore so it won't affect other developers.

Using Podman on macOS

Podman on macOS requires a Linux VM to run containers:

# First time setup - initialize the Podman machine
podman machine init

# Start the Podman machine (required before each use, or after reboot)
podman machine start

Important: When starting the Podman machine, you may see this message:

Another process was listening on the default Docker API socket address.
You can still connect Docker API clients by setting DOCKER_HOST using the
following command in your terminal session:

    export DOCKER_HOST='unix:///var/folders/.../podman-machine-default-api.sock'

Do NOT set DOCKER_HOST - this redirects the Docker CLI to Podman, which causes confusion. Instead, use CONTAINER_TOOL=podman as described above.

Port Conflicts

If you switch between Docker and Podman, you may encounter port conflicts:

Error: listen tcp :5432: bind: address already in use

To resolve:

  1. Stop and remove the container in the other runtime:

    # If switching TO Podman, clean up Docker first:
    docker stop neuralnav-postgres && docker rm neuralnav-postgres
    
    # If switching TO Docker, clean up Podman first:
    podman stop neuralnav-postgres && podman rm neuralnav-postgres
  2. If the port is still in use, check what's holding it:

    lsof -i :5432
  3. You may need to restart Docker Desktop if it has a stale port binding.

Mixed Docker/Podman Setup

You can use Podman for simple containers while keeping Docker for KIND:

  1. Keep Docker Desktop installed (required for KIND clusters)
  2. Use CONTAINER_TOOL=podman or .env for database operations
  3. KIND cluster commands (make cluster-*) will use Docker automatically via scripts/kind-cluster.sh

Example workflow:

# Use Podman for database
export CONTAINER_TOOL=podman
make db-start
make db-load-guidellm

# KIND still uses Docker (no change needed)
make cluster-start

Initial Setup

Create virtual environments and install dependencies:

make setup

This creates a single shared virtual environment in venv/ (at project root) used by both the backend and UI.

Component Startup Sequence

The system consists of 4 main components that must start in order:

1. Ollama Service

Purpose: LLM inference for intent extraction

Start:

make start-ollama

Manual start:

ollama serve

Verify:

curl http://localhost:11434/api/tags
ollama list  # Should show qwen2.5:7b

2. FastAPI Backend

Purpose: Recommendation engine, workflow orchestration, API endpoints

Start:

make start-backend

Manual start:

source venv/bin/activate
uvicorn neuralnav.api.main:app --reload --host 0.0.0.0 --port 8000

Verify:

curl http://localhost:8000/health
# Should return: {"status":"healthy"}

API Documentation:

3. Streamlit UI

Purpose: Conversational interface, recommendation display

Start:

make start-ui

Manual start:

source venv/bin/activate
streamlit run ui/app.py

Access:

Note: UI runs from project root to access docs/ assets

4. KIND Cluster (Optional)

Purpose: Local Kubernetes for deployment testing

Start:

make cluster-start

Manual start:

scripts/kind-cluster.sh start

Verify:

kubectl cluster-info
kubectl get pods -A
make cluster-status

Development Workflows

Quick Development Cycle

Start all services:

make start

Make code changes, then:

  • Backend changes: Auto-reloads (uvicorn --reload flag)
  • UI changes: Refresh browser (Streamlit auto-detects changes)
  • Data changes: Restart backend to reload JSON files

Stop services:

make stop       # Stop Backend + UI (leaves Ollama and DB running)
make stop-all   # Stop everything including Ollama and DB

Working on Specific Components

Backend only:

make start-backend
make logs-backend  # Tail logs

UI only (requires backend running):

make start-ui
make logs-ui

Test API endpoints:

# Get recommendation
curl -X POST http://localhost:8000/api/v1/recommend \
  -H "Content-Type: application/json" \
  -d '{"message": "I need a chatbot for 1000 users"}'

Database Management

Benchmark data can be managed via the CLI (make targets), the REST API, or the UI's Configuration tab.

CLI (local development):

make db-load-blis         # Load BLIS benchmark data
make db-load-guidellm     # Load GuideLLM benchmark data
make db-reset             # Reset database (remove all data and reinitialize)

REST API (remote/Kubernetes deployments):

# Check database status
curl http://localhost:8000/api/v1/db/status

# Upload a benchmark JSON file
curl -X POST -F 'file=@data/benchmarks/performance/benchmarks_BLIS.json' \
  http://localhost:8000/api/v1/db/upload-benchmarks

# Reset database (remove all benchmark data)
curl -X POST http://localhost:8000/api/v1/db/reset

UI (Configuration tab):

  1. Open the UI at http://localhost:8501
  2. Go to the Configuration tab
  3. Use Upload Benchmarks to load a JSON file with a top-level benchmarks array
  4. Use Reset Database to remove all benchmark data
  5. Database statistics (total benchmarks, models, hardware types) are displayed at the top and refresh after each action

All loading methods are append-mode — duplicates (same model/hardware/traffic/load config) are silently skipped via ON CONFLICT (config_id) DO NOTHING.

Core loading logic lives in src/neuralnav/knowledge_base/loader.py and is shared by the CLI script, API endpoints, and UI.

Cluster Development

Create cluster:

make cluster-start
# Builds simulator, creates cluster, loads image

Deploy from UI:

  1. Get recommendation
  2. Generate YAML
  3. Click "Deploy to Kubernetes"
  4. Monitor in "Deployment Management" tab

Manual deployment:

# After generating YAML via UI
kubectl apply -f generated_configs/kserve-inferenceservice.yaml
kubectl get inferenceservices
kubectl get pods

Clean up deployments:

make clean-deployments  # Delete all InferenceServices

Restart cluster:

make cluster-restart  # Fresh cluster

Testing

Unit Tests

Test individual components without external dependencies:

make test-unit

Database Tests

Test PostgreSQL benchmark queries using an isolated neuralnav_test database with static fixture data (your production database is never touched):

make test-db

Requires PostgreSQL running (make db-start).

Integration Tests

Test the full recommendation workflow including LLM-powered intent extraction:

make test-integration

Requires Ollama running with qwen2.5:7b model and PostgreSQL.

Run All Tests

make test

Runs all three tiers: unit, database, and integration.

Debugging

Logging

NeuralNav implements comprehensive logging to help you debug and monitor the system. For complete logging documentation, see docs/LOGGING.md.

Quick Start:

Enable debug logging to see full LLM prompts and responses:

# Enable debug mode
export NEURALNAV_DEBUG=true
make start-backend

# Or inline:
NEURALNAV_DEBUG=true make start-backend

Log Levels:

  • INFO (default): User requests, workflow steps, LLM metadata, results
  • DEBUG: Full LLM prompts, complete responses, detailed timing

Log Locations:

  • Console output (stdout/stderr)
  • logs/backend.log - Main application logs
  • logs/neuralnav.log - Structured detailed logs

Common Log Searches:

# View all user requests
grep "\[USER MESSAGE\]" logs/backend.log

# View LLM prompts (DEBUG mode only)
grep "\[LLM PROMPT\]" logs/backend.log

# View extracted intents
grep "\[EXTRACTED INTENT\]" logs/backend.log

# Follow a complete request flow
grep -A 50 "USER REQUEST" logs/backend.log

Log Tags:

  • [USER REQUEST] - User request start
  • [USER MESSAGE] - User's actual message
  • [LLM REQUEST] - Request to LLM (metadata)
  • [LLM PROMPT] - Full prompt text (DEBUG only)
  • [LLM RESPONSE] - Response from LLM (metadata)
  • [LLM RESPONSE CONTENT] - Full response text (DEBUG only)
  • [EXTRACTED INTENT] - Parsed intent from LLM
  • Step 1, Step 2, etc. - Workflow progress

Privacy Note: DEBUG mode logs contain full user messages and LLM interactions. Only use in development/testing.

View Logs

Backend logs:

make logs-backend
# Or manually:
tail -f .pids/backend.pid.log
# Or for detailed logs:
tail -f logs/backend.log

UI logs:

make logs-ui
# Or manually:
tail -f .pids/ui.pid.log

Kubernetes pod logs:

kubectl logs -f <pod-name>
kubectl describe pod <pod-name>

Check Service Health

make health

Checks:

Debug Intent Extraction

Test LLM client directly:

source venv/bin/activate
python -c "
from neuralnav.llm.ollama_client import OllamaClient
from neuralnav.intent_extraction.extractor import IntentExtractor

client = OllamaClient()
extractor = IntentExtractor(client)

message = 'I need a chatbot for 5000 users with low latency'
intent = extractor.extract_intent(message)
print(intent)
"

Debug Recommendations

Test recommendation engine:

source venv/bin/activate
python -c "
from neuralnav.orchestration.workflow import RecommendationWorkflow

workflow = RecommendationWorkflow()
rec = workflow.generate_recommendation('I need a chatbot for 1000 users')
print(rec)
"

Debug Cluster Deployments

Check InferenceService status:

kubectl get inferenceservices
kubectl describe inferenceservice <deployment-id>

Check pod status:

kubectl get pods
kubectl describe pod <pod-name>
kubectl logs <pod-name>

Port-forward to service:

kubectl port-forward svc/<deployment-id>-predictor 8080:80
curl http://localhost:8080/health

Making Changes

Adding a New Model

  1. Add model to data/configuration/model_catalog.json:
{
  "model_id": "new-model-id",
  "name": "New Model Name",
  "size_parameters": "7B",
  "context_length": 8192,
  "supported_tasks": ["chat", "instruction_following"],
  "recommended_for": ["chatbot"],
  "domain_specialization": ["general"]
}
  1. Add benchmarks to the benchmark database
  2. Restart backend: make restart

Adding a New Use Case Template

  1. Add template to data/configuration/slo_templates.json:
{
  "use_case": "new_use_case",
  "description": "Description",
  "prompt_tokens_mean": 200,
  "generation_tokens_mean": 150,
  "ttft_p90_target_ms": 250,
  "tpot_p90_target_ms": 60,
  "e2e_p90_target_ms": 3000
}
  1. Update src/neuralnav/intent_extraction/extractor.py USE_CASE_MAP
  2. Restart backend

Modifying the UI

UI code is in ui/app.py. Changes auto-reload in the browser.

Key sections:

  • render_chat_interface() - Chat input/history
  • render_recommendation() - Recommendation tabs
  • render_deployment_management_tab() - Cluster management
  • render_configuration_tab() - Database management (in ui/components/settings.py)

Modifying the Recommendation Algorithm

Model scoring: src/neuralnav/recommendation/scorer.py

  • Scorer class - Adjust scoring weights

Capacity planning: src/neuralnav/recommendation/config_finder.py

  • plan_capacity() - GPU sizing logic
  • _calculate_required_replicas() - Scaling calculations

Traffic profiling: src/neuralnav/specification/traffic_profile.py

  • generate_profile() - Traffic estimation
  • generate_slo_targets() - SLO target generation

Code Quality

Lint code:

make lint

Format code:

make format

Both use the shared project venv at root.

Simulator Development

Building the Simulator

make build-simulator

Creates vllm-simulator:latest Docker image.

Testing the Simulator Locally

# Can use podman instead of docker
docker run -p 8080:8080 \
  -e MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.3 \
  -e GPU_TYPE=NVIDIA-L4 \
  -e TENSOR_PARALLEL_SIZE=1 \
  vllm-simulator:latest

# Test
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 10}'

Pushing to Quay.io

make push-simulator

Auto-prompts for login if not authenticated.

Clean Up

Remove generated files:

make clean

Remove everything (including venvs):

make clean-all

Remove cluster:

make cluster-stop

Code Quality

Linting and Formatting

NeuralNav uses Ruff for linting and code formatting.

Run linter:

make lint

Or manually:

source venv/bin/activate
ruff check src/ ui/

Auto-fix issues:

source venv/bin/activate
ruff check src/ ui/ --fix

Format code:

source venv/bin/activate
ruff format src/ ui/

Configuration: Ruff is configured in pyproject.toml with:

  • Line length: 100 characters
  • Python 3.11+ syntax
  • Import sorting (isort)
  • Modern Python upgrades
  • Common bug detection

Before committing: Always run make lint to catch issues early. Most issues can be auto-fixed with ruff check --fix.

Useful Commands

See all available make targets:

make help

Show configuration:

make info

Open UI in browser:

make open-ui

Open API docs:

make open-backend

Alternative Setup Methods

Manual Backend Installation

uv sync

Manual Frontend Installation

The UI shares the same virtual environment as the backend (managed by uv):

uv sync  # Same command — all deps are in pyproject.toml

Manual Ollama Model Pull

The POC uses qwen2.5:7b for intent extraction:

ollama pull qwen2.5:7b

Alternative models (if needed):

  • llama3.2:3b - Smaller/faster, less accurate
  • mistral:7b - Good balance of speed and quality

Verify Ollama Setup

# Test Ollama is working
ollama list  # Should show qwen2.5:7b

Running Services Manually

Option 1: Run Full Stack with UI (Recommended)

The easiest way to use NeuralNav:

# Terminal 1 - Start Ollama (if not already running)
ollama serve

# Terminal 2 - Start FastAPI Backend
scripts/run_api.sh

# Terminal 3 - Start Streamlit UI
scripts/run_ui.sh

Then open http://localhost:8501 in your browser.

Option 2: Test End-to-End Workflow

Test the complete recommendation workflow with demo scenarios. Requires Ollama running with qwen2.5:7b and PostgreSQL with benchmark data:

uv run pytest tests/test_recommendation_workflow.py -v

This tests all 3 demo scenarios end-to-end.

Option 3: Run FastAPI Backend Only

Start the API server:

./run_api.sh

Or manually:

scripts/run_api.sh

Test the API:

# Health check
curl http://localhost:8000/health

# Full recommendation
curl -X POST http://localhost:8000/api/v1/recommend \
  -H "Content-Type: application/json" \
  -d '{"message": "I need a chatbot for 5000 users with low latency"}'

Option 4: Test Individual Components

Test the LLM client:

source venv/bin/activate
python -c "
from neuralnav.llm.ollama_client import OllamaClient
client = OllamaClient(model='llama3.2:3b')
print('Ollama available:', client.is_available())
print('Pulling model...')
client.ensure_model_pulled()
print('Model ready!')
"

Troubleshooting

Ollama Connection Issues

# Check Ollama is running
curl http://localhost:11434/api/tags

# If not running
ollama serve

Model Not Found

ollama pull llama3.2:3b

Import Errors

# Reinstall dependencies
uv sync

Manual Kubernetes Cluster Setup

KIND Cluster Installation

Install KIND (if not already installed):

brew install kind

Create cluster with KServe:

# Ensure Docker Desktop is running

# Create cluster
kind create cluster --config config/kind-cluster.yaml

# Install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml

# Wait for cert-manager
kubectl wait --for=condition=available --timeout=300s -n cert-manager deployment/cert-manager

# Install KServe
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve.yaml
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.14.0/kserve-cluster-resources.yaml

# Wait for KServe
kubectl wait --for=condition=available --timeout=300s -n kserve deployment/kserve-controller-manager

# Configure KServe for RawDeployment mode
kubectl patch configmap/inferenceservice-config -n kserve --type=strategic -p '{"data": {"deploy": "{\"defaultDeploymentMode\": \"RawDeployment\"}"}}'

Deploy Models Through UI

  1. Get a deployment recommendation from the chat interface
  2. Click "Generate Deployment YAML" in the Actions section
  3. If cluster is accessible, click "Deploy to Kubernetes"
  4. Go to Monitoring tab to see:
    • Real Kubernetes deployment status
    • InferenceService conditions
    • Pod information
    • Performance metrics

Manual Deployment Commands

Deploy generated YAML:

# After generating YAML via UI
kubectl apply -f generated_configs/kserve-inferenceservice.yaml
kubectl get inferenceservices
kubectl get pods

View all resources:

kubectl get pods -A

View deployments:

kubectl get inferenceservices
kubectl get pods

Delete a specific deployment:

kubectl delete inferenceservice <deployment-id>

Check cluster info:

kubectl cluster-info

YAML Deployment Generation

The system automatically generates production-ready Kubernetes configurations:

  • ✅ KServe InferenceService YAML with vLLM configuration
  • ✅ HorizontalPodAutoscaler (HPA) for autoscaling
  • ✅ Prometheus ServiceMonitor for metrics collection
  • ✅ Grafana Dashboard ConfigMap
  • ✅ Full YAML validation before generation
  • ✅ Files written to generated_configs/ directory

How to use:

  1. Get a deployment recommendation from the chat interface
  2. Go to the Cost tab and click "Generate Deployment YAML"
  3. View generated YAML file paths
  4. Check generated_configs/ directory for all YAML files

vLLM Simulator Details

Deploy a Model in Simulator Mode (default)

Simulator mode is enabled by default for all deployments:

# Start the UI
scripts/run_ui.sh

# In the UI:
# 1. Get a deployment recommendation
# 2. Click "Generate Deployment YAML"
# 3. Click "Deploy to Kubernetes"
# 4. Go to Monitoring tab
# 5. Pod should become Ready in ~10-15 seconds

Test Inference

Once deployed:

  1. Go to Monitoring tab
  2. See "🧪 Inference Testing" section
  3. Enter a test prompt
  4. Click "🚀 Send Test Request"
  5. View the simulated response and metrics

Switch to Real vLLM

To use real vLLM with actual GPUs (requires GPU-enabled cluster):

# In src/neuralnav/api/routes.py
deployment_generator = DeploymentGenerator(simulator_mode=False)

Then deploy to a GPU-enabled cluster with:

  • NVIDIA GPU Operator installed
  • GPU nodes with appropriate labels
  • Sufficient GPU resources

Simulator vs Real vLLM

Feature Simulator Mode Real vLLM Mode
GPU Required ❌ No ✅ Yes
Model Download ❌ No ✅ Yes (from HuggingFace)
Inference Canned responses Real generation
Latency Simulated (from benchmarks) Actual GPU performance
Use Case Development, testing, demos Production deployment
Cluster Works on KIND (local) Requires GPU-enabled cluster

Testing Details

Quick Tests

Requires Ollama running with qwen2.5:7b and PostgreSQL with benchmark data:

# Test end-to-end workflow
uv run pytest tests/test_recommendation_workflow.py -v

# Test FastAPI endpoints
scripts/run_api.sh  # Start server in terminal 1
# In terminal 2:
curl -X POST http://localhost:8000/api/v1/test

For comprehensive testing instructions, see TESTING.md.