Reflexive Financial AI Agent with semantic routing, dynamic tool selection (ToolRAG), and MCP server deployment. |
- Reflexive Architecture: Self-correcting agent with Generator β Reflector β Revisor loop
- Semantic Routing: Intelligent model selection via vLLM Semantic Router (MoM architecture)
- ToolRAG: Dynamic tool selection - only relevant tools are bound to the LLM
- MCP Deployment: All tools executed via FastMCP (Model Context Protocol)
- LLM-as-a-Judge: Response evaluation using Gemini (configurable via
GEMINI_MODEL; defaultgemini-2.5-pro; usegemini-2.5-flashfor faster runs)
The system is organized into three layers:
-
Routing layer β The vLLM Semantic Router (MoM) sits in front of all LLM calls and directs each request to the right model: Qwen3 (llama.cpp) for financial and general tasks, and Gemini for evaluation and data-generation. The agent talks to a single router endpoint; the router chooses the backend.
-
Tooling layer β ToolRAG and the MCP server provide the agentβs tools. ToolRAG selects a subset of tools per query via semantic search (ChromaDB), and only those tools are bound to the Generator and Revisor. The MCP server (FastMCP) exposes the actual tools (e.g. yfinance-backed market data); all tool execution goes through this layer.
-
Metacognitive layer β The reflexive loop (Generator β Reflector β Revisor) implements self-correction. The Generator produces an answer using the routing and tooling layers; the Reflector (Gemini, via the router) scores it 0β10; if the score is below 8, the Revisor revises and the Reflector re-evaluates, up to three times. This layer is what makes the agent βreflexiveβ rather than single-shot.
| Step | Flow | Description |
|---|---|---|
| 1 | Query β ToolRAG | Select relevant tools via semantic search |
| 2 | Agent β Router β Qwen3 | Generate response using selected tools |
| 3 | Agent β Router β Gemini | Evaluate response quality (score 0-10) |
| 4 | Score >= 8 β Output | Pass threshold, return response |
| 4 | Score < 8 β Revisor | Revise and re-evaluate (max 3 iterations) |
Step 1 Step 2 Step 3 Step 4
βββββββ ββββββ ββββββ ββββββ
User Query βββΊ ToolRAG βββΊ Generator βββΊ Reflector βββΊ [Score >= 8?]
β β β β
β β β ββββββ΄βββββ
β β β βΌ βΌ
β β β Output Revisor
β β β β β
β β β β β (max 3)
β β βββββββββββ΄ββββββββββ
β β re-evaluate
β β
β βββ Router β Qwen3 (MCP tools)
β
βββ Semantic search; selected tools bound to Generator/Revisor
| Component | Technology | Description |
|---|---|---|
| Model Serving | llama.cpp + CUDA | GGUF quantized model serving |
| Agent Model | Qwen3-30B-A3B-Instruct | 30B MoE model (3B active params) |
| Judge Model | Configurable (GEMINI_MODEL) |
Default: gemini-2.5-pro; optional gemini-2.5-flash for speed |
| Orchestration | LangGraph | Stateful graph-based workflow |
| Market Data | yfinance | Stock fundamentals, prices, news |
| Tool Serving | FastMCP | Model Context Protocol server |
| Semantic Router | vLLM-SR | Intelligent request routing |
| Tool Selection | ChromaDB | Vector-based tool retrieval |
- Python 3.11+
- CUDA-capable GPU (DGX Spark / ZGX Nano recommended)
- ~20GB disk space for model
- HuggingFace token (model download)
- Google AI Studio API key (Gemini evaluation)
cd /home/vincent/Code/helix-financial-agent
# Make scripts executable
chmod +x scripts/*.sh
# Run setup script
./scripts/setup.sh
# Configure environment
cp .env.example .env
nano .env # Add your API keysRequired in .env:
HF_TOKEN=hf_your_token_here # https://huggingface.co/settings/tokens
GEMINI_API_KEY=your_gemini_key_here # https://aistudio.google.com/app/apikey
# Optional: GEMINI_MODEL=gemini-2.5-pro (default, best quality) or gemini-2.5-flash (faster)git clone https://github.com/ggerganov/llama.cpp.git ~/llama.cpp
cd ~/llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="90"
cmake --build build -j$(nproc)Open 3 terminal windows. All services are required.
Terminal 1 - Model Server:
./scripts/start_llama_server.sh
# Wait for: "llama server listening at http://0.0.0.0:8081"Terminal 2 - Semantic Router:
./scripts/start_router.sh
# Provides MoM routing between Qwen3 and GeminiTerminal 3 - MCP Server:
./scripts/start_mcp_server.sh
# Serves 13 financial tools via FastMCPsource .venv/bin/activate
# Random benchmark query (recommended for demo)
helix-agent --random
# Single query
helix-agent --query "What is AAPL's PE ratio?"
# Interactive mode
helix-agentAll ports are configured in .env. The same values control where services listen (on the server) and how the port-forward script maps remote β local ports, so configuration stays consistent.
Per vLLM Semantic Router docs: the router container exposes separate ports for the API and the web UI.
- Router service β The API the agent uses: 8801 (Envoy, chat completions) and 8889 (Classification API on the host; container port 8080 for health,
/v1/models, classify endpoints). This is the endpoint for model queries before they are routed to Qwen3 or Gemini. The agent talks to the router service; you do not port-forward it for normal agent use. - Router UI β The web dashboard (Hub) for inspecting the semantic router. It runs on a different port: 8700 (vLLM-SR
DASHBOARD_PORT). This is the one to port-forward if you want to open the router dashboard in your local browser.
| Service | Default port | .env variable | Description |
|---|---|---|---|
| llama.cpp | 8081 | (in LLAMA_CPP_BASE_URL) | Model inference (OpenAI-compatible) |
| MCP Server | 8000 | MCP_SERVER_PORT |
FastMCP tool server (streamable-http) |
| vLLM-SR HTTP | 8801 | ROUTER_HTTP_PORT |
Router service: semantic routing entry point (chat completions) |
| vLLM-SR Classify | 8889 | ROUTER_CLASSIFY_PORT |
Router service: health, model listing (container port 8080) |
| vLLM-SR Hub UI | 8700 | (fixed in container) | Router dashboard (web UI); separate from Classify |
| vLLM-SR Metrics | 9190 | ROUTER_METRICS_PORT |
Prometheus metrics |
| Streamlit UI | 8501 | STREAMLIT_PORT |
Eval & Run app |
| MLflow UI | 5000 | MLFLOW_PORT |
Experiment tracking |
Port forwarding local bind ports: LOCAL_STREAMLIT_PORT, LOCAL_MLFLOW_PORT, LOCAL_ROUTER_HUB_PORT (defaults 8501, 5001, 8701 to avoid macOS conflicts).
On your local machine, run (ports are read from .env):
./scripts/ssh_port_forward.sh <user>@<host>This forwards the remote server ports for Streamlit, Router UI, and MLflow to your local LOCAL_* ports. Defaults use 5001 and 8701 locally (to avoid macOS port 5000/8700 conflicts); override in .env if needed.
After port forwarding, open in your local browser (using the LOCAL_* ports from .env):
- Streamlit Eval & Run UI: http://localhost:8501 (or
LOCAL_STREAMLIT_PORT) - Semantic Router Hub UI: http://localhost:8701 (or
LOCAL_ROUTER_HUB_PORT) β forwards from server port 8700 - MLflow UI: http://localhost:5001 (or
LOCAL_MLFLOW_PORT) β forwards from server port 5000
Note: Start MLflow UI on the server with ./scripts/run_mlflow_ui.sh (uses MLFLOW_PORT from .env).
ToolRAG selects only relevant tools for each query, keeping the LLM focused and efficient.
- Embed Query: User query embedded via sentence-transformers
- Search Tools: Compare against tool embeddings in ChromaDB
- Filter: Select tools with similarity >= 0.35
- Bind: Only selected tools bound to LLM (generator + revisor)
- Fallback: Use core tools if none selected
| Parameter | Default | Environment Variable |
|---|---|---|
| Threshold | 0.35 | TOOL_RAG_THRESHOLD |
| Embedding Model | all-MiniLM-L6-v2 | EMBEDDING_MODEL |
The LLM-as-a-Judge uses Gemini. Set GEMINI_MODEL in .env to choose the model:
| Value | Use case |
|---|---|
gemini-2.5-pro (default) |
Best quality for evaluation |
gemini-2.5-flash |
Faster and cheaper; good for high-volume runs |
Ensure the same model name is exposed in your semantic router config so evaluation requests are routed to it.
| Aspect | All 13 Tools | Selected Tools |
|---|---|---|
| LLM Context | Bloated | Focused |
| Tool Selection | May pick wrong tool | Precise |
| Latency | Slower | Faster |
The router automatically selects the best model based on request content using model="MoM" (Model of Models).
| Decision | Priority | Triggers | Routes To |
|---|---|---|---|
evaluation |
15 | evaluate, judge, assess, score | Gemini (see GEMINI_MODEL) |
data_generation |
15 | generate, synthetic, dataset | Gemini 2.5 Pro |
financial_analysis |
10 | stock, price, PE ratio, dividend | Qwen3 (llama.cpp) |
general |
5 | (fallback) | Qwen3 (llama.cpp) |
Prompts include markers to help classification:
"[FINANCIAL_ANALYSIS] What is AAPL's PE ratio?" # β Qwen3
"[EVALUATE] Assess this response for accuracy..." # β Gemini# All LLM calls use MoM routing
llm = ChatOpenAI(
base_url="http://localhost:8801/v1", # Router endpoint
model="MoM", # Auto-select model
)| Node | Purpose | Routes To |
|---|---|---|
| Generator | Draft response with tools | Qwen3 |
| Reflector | Evaluate quality (0-10) | Gemini |
| Revisor | Improve based on critique | Qwen3 |
- No hardcoded models - Router decides based on content
- Centralized config - Edit
config/router_config.yaml - Observable - Metrics at port 9190
- Easy model swaps - No code changes needed
This section explains the internal architecture of the vLLM Semantic Router and the customizations required for external API routing (e.g., Gemini).
The vLLM Semantic Router (vLLM-SR) consists of three main components running inside a Docker container:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β vLLM-SR Container β
β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ β
β β Envoy βββββΊβ ExtProc βββββΊβ Classification β β
β β Proxy β β Service β β API (Python) β β
β β (8801) ββββββ (gRPC) ββββββ (8080) β β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ β
β β β β
β β Routes to: β Downloads: β
β βΌ βΌ β
β ββββββββββββββββ βββββββββββββββββββββββββββββ β
β β Upstream β β HuggingFace Models β β
β β Servers β β - Embedding models β β
β β (Gemini, β β - Classification models β β
β β llama.cpp) β βββββββββββββββββββββββββββββ β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Port | Purpose |
|---|---|---|
| Envoy Proxy | 8801 | HTTP entry point, routes requests based on x-selected-model header |
| ExtProc Service | 50051 (gRPC) | Processes requests, classifies intent, sets routing headers |
| Classification API | 8080 | Health checks, model listing, classification endpoints |
- Request arrives at Envoy (port 8801) with
model: "MoM" - ExtProc intercepts the request via gRPC External Processing filter
- Classification runs embeddings + keyword matching to determine intent
- ExtProc sets
x-selected-modelheader (e.g.,gemini-2.5-proorqwen3-30b-a3b) - Envoy routes based on
x-selected-modelto the appropriate upstream cluster - Response flows back through Envoy to the client
| File | Purpose |
|---|---|
config/router_config.yaml |
User-facing config: decisions, signals, model endpoints |
config/.vllm-sr/envoy.template.yaml |
Jinja2 template for Envoy configuration |
config/.vllm-sr/processed_config.yaml |
Runtime config with resolved secrets (auto-generated) |
config/.vllm-sr/router-config.yaml |
Generated router config (inside container) |
The start script performs several important steps:
# 1. Load environment variables from .env
source "$PROJECT_ROOT/.env"
# 2. Preprocess config to resolve access_key_env references
# Converts: access_key_env: "GEMINI_API_KEY"
# To: access_key: "actual_api_key_value"
python3 << 'PYEOF'
# ... Python preprocessing script ...
PYEOF
# 3. Start Docker container with volume mounts
docker run -d \
-v "$CONFIG_PATH":/app/config.yaml:ro \ # User config
-v "$VLLM_SR_DIR":/app/.vllm-sr \ # State directory
-v "$ENVOY_TEMPLATE":/app/cli/templates/envoy.template.yaml:ro \ # Custom template
-e GEMINI_API_KEY=$GEMINI_API_KEY \ # API key for template
...Key Points:
- Config preprocessing happens before the container starts
- Custom Envoy template is mounted over the default template
- API keys are passed as environment variables to the container
Routing to external APIs like Gemini requires special handling because:
- Path Rewriting: Gemini's OpenAI-compatible API expects requests at
/v1beta/openai/v1/chat/completions, not/v1/chat/completions - Authentication: Requires
Authorization: Bearer <api_key>header
The default vLLM-SR configuration format doesn't fully support:
base_pathorpath_prefixfor external endpoints (only documented for vLLM instances)access_key_envresolution (onlyaccess_keywith literal values)
We modified config/.vllm-sr/envoy.template.yaml to handle these cases:
1. Path Prefix Rewriting:
{% if model.path_prefix %}
# Generic path_prefix support (if router parses it)
regex_rewrite:
pattern:
google_re2: {}
regex: "^(.*)$"
substitution: "{{ model.path_prefix }}\\1"
{% elif model.endpoints[0].address == 'generativelanguage.googleapis.com' %}
# Hardcoded fallback for Gemini
regex_rewrite:
pattern:
google_re2: {}
regex: "^(.*)$"
substitution: "/v1beta/openai\\1"
{% endif %}This rewrites /v1/chat/completions β /v1beta/openai/v1/chat/completions for Gemini.
2. Authorization Header Injection:
{% if model.access_key %}
request_headers_to_add:
- header:
key: "Authorization"
value: "Bearer {{ model.access_key }}"
append_action: OVERWRITE_IF_EXISTS_OR_ADD
{% endif %}The access_key is populated at template render time from the preprocessed config.
Since vLLM-SR doesn't resolve access_key_env to actual values, start_router.sh preprocesses the config:
# In start_router.sh (embedded Python)
for model in models:
if 'access_key_env' in model:
env_var = model.pop('access_key_env') # e.g., "GEMINI_API_KEY"
value = os.environ.get(env_var) # Actual API key
if value:
model['access_key'] = value # Set for templateBefore preprocessing (router_config.yaml):
- name: "gemini-2.5-pro"
endpoints:
- endpoint: "generativelanguage.googleapis.com"
protocol: "https"
access_key_env: "GEMINI_API_KEY" # Reference, not valueAfter preprocessing (processed_config.yaml):
- name: "gemini-2.5-pro"
endpoints:
- endpoint: "generativelanguage.googleapis.com"
protocol: "https"
access_key: "AIzaSy..." # Actual API key (not committed to git)To add a new external LLM provider (e.g., Anthropic, OpenAI):
- Add to
router_config.yaml:
providers:
models:
- name: "claude-3-opus"
param_size: "unknown"
path_prefix: "/v1" # If different from standard
endpoints:
- name: "anthropic"
endpoint: "api.anthropic.com"
protocol: "https"
access_key_env: "ANTHROPIC_API_KEY"- Add environment variable to
.env:
ANTHROPIC_API_KEY=sk-ant-...- Update
start_router.shto pass the env var to Docker:
[ -n "$ANTHROPIC_API_KEY" ] && ENV_VARS="$ENV_VARS -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY"- Update Envoy template if the provider needs special handling (custom paths, headers, etc.)
Check if Envoy started correctly:
docker logs vllm-sr-container 2>&1 | grep -E "(error|critical|fatal)"View generated Envoy config:
docker exec vllm-sr-container cat /etc/envoy/envoy.yamlCheck routing headers on a request:
curl -v -X POST http://localhost:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "MoM", "messages": [{"role": "user", "content": "test"}]}' 2>&1 | grep "x-vsr"Common issues:
| Symptom | Cause | Fix |
|---|---|---|
404 Not Found |
Missing path prefix | Check regex_rewrite in Envoy config |
401 Unauthorized |
Missing API key | Check access_key in processed config |
500 Internal Server Error |
ExtProc not ready | Wait for model downloads to complete |
Connection reset |
Envoy crashed | Check logs for config errors |
End-to-end observability for the agent with automatic tracing and custom assessments.
| Component | Traced As | Details |
|---|---|---|
| Generator β ChatOpenAI | CHAT_MODEL span | Prompts, outputs, token usage |
| Tool Executor β ToolNode | TOOL spans | Tool name, arguments, outputs |
| Reflector β ChatOpenAI | CHAT_MODEL span | Evaluation prompts, responses |
| Revisor β ChatOpenAI | CHAT_MODEL span | Revision prompts, responses |
| Full graph execution | CHAIN span | End-to-end timeline |
Per-trace assessments logged for each agent run:
| Assessment | Type | Description |
|---|---|---|
tool_selection_successful |
Y/N | Did ToolRAG select the correct tools? |
model_selection_successful |
Y/N | Did the router select appropriate models? |
judge_score |
0-10 | Score from LLM-as-a-Judge evaluation |
latency_seconds |
float | Total execution time |
iteration_count |
int | Number of revision iterations |
Tracing is enabled by default. View traces at http://localhost:5000 after starting the MLflow UI:
# Start MLflow UI
mlflow ui --port $MLFLOW_PORT
# Run agent (tracing enabled by default)
helix-agent -q "What is AAPL's PE ratio?"
# Run benchmark with tracing
helix-eval --max-queries 10
# Disable tracing
helix-agent -q "query" --no-tracing| Variable | Default | Description |
|---|---|---|
MLFLOW_TRACKING_URI |
./mlruns |
MLflow tracking URI |
MLFLOW_EXPERIMENT_NAME |
helix-financial-agent |
Experiment name |
When running benchmarks, aggregate metrics are logged to MLflow:
| Metric | Description |
|---|---|
avg_correctness_score |
Average judge score for valid queries |
valid_pass_rate |
% of valid queries scoring >= 7 |
safety_pass_rate |
% of hazard queries correctly refused |
tool_selection_accuracy |
% of queries with correct tool selection |
avg_agent_time_sec |
Average execution time per query |
For production deployments, use a remote MLflow server:
# Set remote tracking URI
export MLFLOW_TRACKING_URI=http://mlflow-server:5000
# Or in .env file
MLFLOW_TRACKING_URI=http://mlflow-server:5000Detailed, real-time logging of all agent interactions for debugging and monitoring.
| Category | Details |
|---|---|
| LLM Requests | Model requested, prompt preview, timing |
| LLM Responses | Routed model, response preview, duration |
| Routing Decisions | Requested vs routed model, fallback warnings |
| Tool Calls | Tool name, arguments, outputs |
| Flow Events | Phase transitions, decisions |
| Errors | Full error details with context |
Verbose logging is enabled by default. Disable with --quiet:
# Data generation with verbose logging (default)
helix-generate --count 20
# Data generation without verbose logging
helix-generate --count 20 --quiet
# Benchmark with verbose logging (default)
helix-eval --max-queries 10
# Benchmark without verbose logging
helix-eval --max-queries 10 --quietReal-time log entries show:
[ 0.05s] π Generator Initialized
ββ model: MoM
ββ router_endpoint: http://localhost:8801/v1
[ 0.12s] π€ LLM Request [generator/fundamental_basic]
ββ model_requested: MoM
ββ prompt_preview: [GENERATE SYNTHETIC DATA]...
[ 1.85s] π€ LLM Response [generator/fundamental_basic]
ββ routed_to: gemini-2.5-pro
ββ duration: 1730ms
[ 1.86s] π Routing Decision
ββ requested: MoM
ββ routed_to: gemini-2.5-pro
After completion, a summary table is printed:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π EXECUTION SUMMARY β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β±οΈ Total Time 45.23s
π Log Entries 127
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π€ LLM Interactions:
βββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββ¬βββββββββ
β Node β Routed To β Duration β Status β
βββββββββββββββββββΌβββββββββββββββββββΌβββββββββββΌβββββββββ€
β generator/basic β gemini-2.5-pro β 1730ms β β β
β generator/adv β gemini-2.5-pro β 2150ms β β β
βββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββ΄βββββββββ
Total LLM time: 12500ms (12.50s)
Requests: 10 (β10 / β0)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Routing Summary:
β Qwen3 (local): 0
β Gemini (API): 10
When the router selects an unexpected model (e.g., routing to local Qwen when Gemini was expected for data generation), verbose logging highlights this:
[ 1.85s] π Routing Decision (FALLBACK)
ββ requested: MoM
ββ routed_to: qwen3-30b-a3b
ββ decision: fallback_to_local
[ 1.85s] π Routing fallback for fundamental_basic
ββ expected: gemini-2.5-pro
ββ got: qwen3-30b-a3b
ββ hint: Generation keywords may not be triggering data_generation decision
This helps identify when the semantic router's rules need adjustment without requiring code changes.
| Tool | Description |
|---|---|
get_stock_fundamentals |
PE ratio, market cap, dividends, beta |
get_historical_prices |
OHLCV, returns, moving averages |
get_financial_statements |
Balance sheet, income, cash flow |
get_company_news |
Recent headlines |
| Tool | Description |
|---|---|
get_options_chain |
Options data |
get_institutional_holders |
Institutional ownership |
get_insider_transactions |
Insider trading |
get_analyst_recommendations |
Analyst ratings |
get_earnings_calendar |
Earnings dates |
get_sustainability_scores |
ESG scores |
get_dividend_history |
Historical dividends |
calculate_technical_indicators |
Technical analysis |
compare_sector_performance |
Sector comparison |
| Command | Description |
|---|---|
helix-agent |
Interactive mode |
helix-agent --random |
Random benchmark query |
helix-agent -q "query" |
Single query |
helix-generate |
Generate synthetic dataset |
helix-eval |
Run evaluation benchmark |
helix-mcp |
Start MCP server |
| Option | Description |
|---|---|
--random, -r |
Random query from benchmark |
--query, -q |
Specific query |
--dataset, -d |
Custom dataset path |
--eval, -e |
Enable evaluation |
--no-tool-rag |
Use all tools |
--no-tracing |
Disable MLflow tracing |
--quiet, -q |
Disable verbose logging |
| Option | Description |
|---|---|
--count, -n |
Total queries to generate (default: 100) |
--output-dir, -o |
Output directory (default: ./data) |
--eval-ratio |
Ratio for evaluation split (default: 0.10) |
--valid-ratio |
Ratio of valid vs hazard queries (default: 0.80) |
--quiet, -q |
Disable verbose logging |
| Option | Description |
|---|---|
--dataset |
Path to JSONL dataset |
--max-queries |
Maximum queries to run |
--no-tool-rag |
Disable ToolRAG |
--no-tracing |
Disable MLflow tracing |
--quiet, -q |
Disable verbose logging |
A single-page Streamlit app lets you generate evaluation data, browse the dataset in a table, select a record, and run the agent with evaluation. After each run you can inspect model routing, tool selection (ToolRAG table), and metacognition / reflexive loop (reflection steps).
All ports (where services listen and where the port-forward script binds) are configured in .env. Use the same .env on the server and on your local machine so forwarding stays consistent with the ports services use.
From the project root (with required services started):
./scripts/run_streamlit.shThis uses STREAMLIT_PORT from .env (default 8501). Open the URL shown (e.g. http://localhost:8501). Use Generate evaluation data to create a dataset, Refresh from disk in the Dataset section to load it, select a row, and Run agent on selected record to execute and view the three insight panels.
When the app runs on a remote server (DGX Spark, ZGX Nano, etc.), use SSH port forwarding so you can open it in your local browser.
1. On the remote server (SSH into it or use its console), start the app:
cd /path/to/helix-financial-agent
./scripts/run_streamlit.shLeave this running. The app listens on STREAMLIT_PORT from .env (default 8501).
2. On your local machine, run the port-forwarding script (ensure .env has the same server port values, e.g. STREAMLIT_PORT, MLFLOW_PORT, ROUTER_HUB_PORT, so the script forwards to the correct remote ports):
./scripts/ssh_port_forward.sh <user>@<host>- user@host β SSH target (e.g.
vincent@dgx-spark.localorvincent@192.168.1.50).
Server ports (STREAMLIT_PORT, MLFLOW_PORT, ROUTER_HUB_PORT) and local bind ports (LOCAL_STREAMLIT_PORT, LOCAL_MLFLOW_PORT, LOCAL_ROUTER_HUB_PORT) are all read from .env.
3. In your browser, open:
- http://localhost:8501 (or the port you set as
LOCAL_STREAMLIT_PORTin.env).
- Open Debug Panel:
Ctrl+Shift+D - Select configuration:
Helix Agent - Single QueryHelix Agent - InteractiveMCP Server - Debug
- Set breakpoints in:
agent/nodes.py- Generator, Reflector, Revisoragent/runner.py- Main runnertool_rag/tool_selector.py- Tool selection
- Press
F5
Note: All 3 services must be running before debugging.
If debugging fails with path errors:
Ctrl+Shift+Pβ "Python: Select Interpreter"- Select
.venv/bin/python - Restart VS Code
helix-financial-agent/
βββ src/helix_financial_agent/
β βββ agent/ # LangGraph nodes, graph, runner
β βββ tools/ # Financial tools + MCP server
β βββ tool_rag/ # ChromaDB tool selection
β βββ router/ # vLLM-SR client & config generator
β βββ evaluation/ # LLM-as-a-Judge
β βββ data_generation/ # Synthetic data
βββ scripts/
β βββ start_llama_server.sh # Start llama.cpp model server
β βββ start_router.sh # Start vLLM-SR (with config preprocessing)
β βββ stop_router.sh # Stop all router containers
β βββ start_mcp_server.sh # Start FastMCP tool server
β βββ ssh_port_forward.sh # Port forwarding helper
βββ config/
β βββ router_config.yaml # User-facing router configuration
β βββ .vllm-sr/ # Router runtime state (auto-generated)
β βββ envoy.template.yaml # Custom Envoy template (path rewrite, auth)
β βββ processed_config.yaml # Config with resolved secrets
β βββ router-config.yaml # Generated router config
βββ data/ # Generated datasets (output of helix-generate)
βββ .vscode/
βββ launch.json
# Check llama.cpp
curl http://localhost:8081/health
# Check router
curl http://localhost:8889/health
# Check MCP
curl http://localhost:8000/mcpRouter takes a long time to start:
- Normal on first run - downloads ~1.5GB of ML models from HuggingFace
- Monitor progress:
docker logs -f vllm-sr-container
Router crashes immediately:
# Check for config errors
docker logs vllm-sr-container 2>&1 | grep -E "(error|critical|fatal)"
# Common cause: invalid Envoy template syntax
# Check the generated config:
docker exec vllm-sr-container cat /etc/envoy/envoy.yamlGemini returns 404:
- Path prefix not applied - check
regex_rewritein Envoy config - Verify:
docker exec vllm-sr-container cat /etc/envoy/envoy.yaml | grep -A5 "regex_rewrite"
Gemini returns 401/403:
- API key not injected - check preprocessed config
- Verify:
cat config/.vllm-sr/processed_config.yaml | grep access_key - Ensure
.envhas validGEMINI_API_KEY
Connection reset by peer:
- Envoy crashed due to config error
- Check:
docker logs vllm-sr-container 2>&1 | tail -50 - Look for "Not supported field" or similar errors
# Verify HF_TOKEN
grep HF_TOKEN .env
# Manual download
huggingface-cli download bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF \
Qwen_Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf \
--local-dir ~/llama.cpp/modelsLower the threshold in .env:
TOOL_RAG_THRESHOLD=0.25Check if score parsing is working. Scores >= 8.0 pass. The reflector uses Gemini which returns markdown scores like "Score: 8.5 / 10".
Check routing headers:
curl -v -X POST http://localhost:8801/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "MoM", "messages": [{"role": "user", "content": "your query"}]}' 2>&1 | grep "x-vsr"Adjust routing rules in config/router_config.yaml:
- Increase
priorityfor the decision you want to match - Add more keywords to the signal
- Adjust embedding
thresholdvalues
Apache 2.0

