Simulation backend implementation#73
Merged
Merged
Conversation
- Engine now takes tools[] as an argument; route loads via Supabase so AA-cron writes to latency_p50_ms and cost_model are picked up. - avgTokensPerRequest replaced with avgInputTokens + avgOutputTokens; TOKEN_DEFAULTS provides per-use-case defaults (RAG-heavy input, chatbot-heavy output). - CostModel.cost_per_call lets per_call tools project a real rate; unparameterised per_call falls under unprojected_cost.
- /simulate/results renders 5 panels: cost-over-time SVG line chart with first-breaking-point marker, latency breakdown bars, breaking points cards with recommendations, kill conditions + switch-away triggers, and a share-URL button. - lib/simulateUrl.ts encodes/decodes SimulationInput <-> query params (uc/u/r/in/out/llm/vec/fw) for shareable links; 7 tests cover round trips and invalid input. - BreakingPoint gains an optional recommendation field, populated for latency / cost / architecture rules.
- New /simulate page: use-case chip → scale sliders → stack picker, with smart defaults per use case and "Import from Genome" pre-fill from ?s= (slot-based mapping to llm/vectorDb/framework). - LogSlider primitive for the log-scale sliders (users 1k-10M, tokens 500-50k) and a linear one for requests/day. - ToolPicker dropdown filters LLMs to per_token cost_model so only modelable providers appear. - Submits to /simulate/results with avgTokens split into in/out via the per-use-case TOKEN_DEFAULTS ratio. 6 tests cover the split and default-bound invariants.
- lib/simulateDelta.ts: pure computeDelta(current, shadow) returns per-scale cost delta, per-layer latency table, crossover users, annualised savings, and a 4-state verdict (switch_now / switch_above_X / latency_only / stick) with a human-readable line. - Results page parses llm2/vec2/fw2 from URL, runs a second simulate with the shadow stack, computes delta, passes both to the client. - 5 new components: CostDeltaChart (two-line overlay with shaded delta region), LatencyDeltaTable, BreakingPointDelta, SwitchVerdict card, ShadowStackForm (inline 3-picker that updates the URL via router.push). - lib/simulateUrl.ts gains appendShadowStack / dropShadowStack / parseShadowStack; 5 new URL-helper tests + 6 delta engine tests cover all four verdict states.
Replaces the 3-step gated flow + separate /simulate/results page with one /simulate route. Inputs sit in a sticky sidebar; results redraw instantly via the pure simulate() engine as sliders move (no API round-trip). URL stays in sync (debounced 300ms) so simulations remain shareable. Token UX: - "Avg tokens" single-slider replaced with explicit input + output sliders. - Row of named presets — Quick chat, Long Q&A, RAG retrieval, Agent loop, Code generation — so users pick something concrete instead of guessing a number. Chart hover (CostChart + CostDeltaChart): - Hover bands per scale step snap a vertical guide line to the nearest data point. - SVG-native tooltip shows user count, total cost, per-tool breakdown, latency; flips left near the right edge. Wiring: - /simulate/results route deleted; ShadowStackForm now takes callbacks instead of pushing to URL (parent owns the URL). - ScaleStep/UseCaseStep/StackStep stripped of headings — the parent panel renders section labels. - splitTokens helper removed (no longer needed); SCALE_BOUNDS gains per-direction tokens; SCALE_DEFAULTS now references TOKEN_DEFAULTS.
… panels Engine - LLM latency = TTFT + (output_tokens / tokens_per_second). TTFT and throughput are separate fields; the AA cron now syncs both. - Prompt caching: per-token cost blends cached vs uncached input by cacheHitRate (90% off cached); batch pricing blends real-time and batch endpoints by batchPct (50% off). - Vector DB cost: new per_vector_query cost model with storage_cost_per_gb_month + query_cost_per_million + min_monthly_cost. Engine projects storage from vectorCount × bytes_per_vector and queries from monthlyRequests. - Embedding cost: RAG paths add per-query embedding cost (default text-embedding-3-small rate when no embedding tool selected). - Eval cost: per_event cost model for Langfuse/Helicone/Braintrust. - Model routing: optional routerCheapLlm + routerCheapPct splits LLM cost across two models. - Rate-limit modeling: max_tpm + max_rpm + peakToAverageRatio surface a rate-limit breaking point with a tier-upgrade recommendation. - Per-snapshot output adds costPerRequest, costPerUser, costByLayer, and per-stage latencyByStage (guardrails / embedding / vector / ttft / generation / framework). - Result gains a bottleneck verdict (cost / latency / rate_limit / balanced / none) with a one-line diagnosis. UI - /simulate gains new sliders for cache-hit %, batch %, stored vectors (RAG only) and new pickers for eval and guardrails. - New panels: Bottleneck verdict, Unit economics (cost/mo, /req, /user, /year), Cost composition stacked bar, Provider comparison table (re-runs simulate per per-token LLM, click to switch primary). - LatencyBreakdown renders the six-stage split with share-of-total. - CostChart hover tooltip now shows latency in addition to per-tool cost breakdown. Data - New columns: ttft_p50_ms, output_tokens_per_second, max_tpm, max_rpm, bytes_per_vector. AA cron writes ttft + throughput alongside cost_model. - tools.json: 7 AA LLMs backfilled with realistic 2026 TTFT, throughput, cached/batch pricing, and rate limits; Pinecone/Weaviate/Turbopuffer switched from usage_based to per_vector_query; Langfuse/Helicone/Braintrust switched to per_event with published per-observation rates. URL - New compact params: cr (cacheHitRate), bp (batchPct), vc (vectorCount), et (embeddingTokensPerQuery), pk (peakRatio), em (embedding), ev (eval), gd (guardrails), llmC + rcp (router). Heads-up: requires `make db-push && make seed` to populate the remote Supabase with the new columns and values; without that the /simulate page reads NULLs and falls back to default throughput (50 tok/s) for every LLM — the symptom is uniform ~10s latency across providers.
…ggers - LATENCY_CEILING_MS is now use-case-aware: chatbot 8s, RAG 12s, agent 25s, custom 10s. The old 2s ceiling fired for almost every realistic 2026 stack (Sonnet at 46 tok/s producing 600 output is 13s before anything else). - New TTFT-slow breaking point fires when ttft_p50_ms > 1500ms — decoupled from total length since streaming hides generation time. - Bottleneck verdict text drops the hardcoded "2s" and references the use-case comfort ceiling instead. - KillConditionsPanel's "Switch away when..." list now only includes rate_limit and latency triggers; cost milestones and LLM-dominance signals are tuning advice (caching, routing, batch) and stay in the Breaking points panel instead.
Clears all high-severity npm audit findings the CI gate was failing on (npm audit --audit-level=high now exits 0): - next 16.2.4 → 16.2.6 — fixes the high-severity SSRF in WebSocket upgrades, middleware/proxy bypasses, cache poisoning, and CSP-nonce XSS issues (GHSA-c4j6-fc7j-m34r and friends). - @hono/node-server, hono, fast-uri, ip-address, express-rate-limit — picked up via the dependency tree refresh; resolves the high-severity fast-uri path-traversal/host-confusion CVE (GHSA-q3j6-qgpj-74h6, GHSA-v39h-62p7-jpjc). Five moderate findings remain, all tracing to the postcss copy bundled inside next 16.2.x. The only fixes available today are either next 16.3.x (not yet released as stable) or downgrading @vercel/speed-insights to 1.0.4 (breaking change) — leaving for the next routine bump.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Type of change
Checklist
make checkpasses (lint + typecheck)feat:,fix:,chore:,docs:)Notes for reviewers