All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- CI workflow reverted to strict mode: tag pushes run the full
qualityjob beforedocker. v2.15.2 had skippedqualityon tag pushes to avoid duplicate CI runs, but that left a security gap — a tag pointing at an unvalidated commit (e.g.git tag v9.9.9 some-shadirectly) could trigger a Docker push without going through type check / lint / tests. Each release now runsqualitytwice (once on branch push, once on tag push) but guarantees Docker images are only built from validated commits
- CI: lint failures on
mainStarted/warmupCount(unused locals inbenchmarkEngine.run.test.ts) andDEFAULT_CONFIRM_DELAY_MS(unused fallback inalertNotifier.ts). Renamed the latter to_DEFAULT_CONFIRM_DELAY_MSand dropped the test locals — verification was already covered byexpect(execute).toHaveBeenCalledTimes(...) - CI workflow:
qualityjob now skips on tag pushes (the same commit was already validated on the branch push). Eliminates the duplicate CI runs that fired on every release — one forpush branches:main, one forpush tags:v*. Thedockerjob still triggers on tag pushes and no longer depends onquality(the underlying commit was already validated)
- CI:
supertestand@types/supertestmoved from rootpackage.jsontobackend/package.json. Local tests passed because TypeScript resolves up the directory tree to rootnode_modules, but CI installs each sub-package independently and could not find the module. Type check now passes in CI
- Alert confirmation now uses K-of-N voting instead of "N consecutive failures or one ok abandons cycle". A new
alertConfirmFailThresholdsetting (defaultN - 1, e.g. 4-of-5) controls how many of the N attempts must fail to fire an alert. A single transient ok no longer drops the entire confirmation chain — fixes the case where a flaky upstream that returns one healthy response between failures suppressed real outage alerts for 30+ minutes - Confirmation cycles now exit early on both directions: alert fires the moment failCount reaches the threshold (no longer waits for the full N attempts), and the cycle abandons the moment failThreshold becomes mathematically unreachable
- Health-check probe timeout reduced from 180 s (streaming) / 120 s (non-streaming) to 90 s for monitor probes and the "Test Connection" endpoint. Playground/benchmark calls retain their longer timeouts
- Confirmation probes within the same provider now run in parallel (independent API calls). Previously serialized — a single hung confirmation could delay every other model's confirmation in the same minute
- Race in confirmation queue: between
pendingConfirmations.delete(key)and the re-add afterawait confirmProbe, the dedup gatehas(key)returned false, allowing a scheduled probe landing in that 60-180 s window to spawn a duplicate parallel confirmation cycle. Added aninFlighttoken map so the gate covers the await window, and any cycle whose token is overwritten/cleared mid-await drops its result - Recovery alert no longer leaves a zombie down cycle in flight: when a scheduled probe sees
healthy/slowafterdown, the recovery alert now explicitly cancels any pending or in-flight confirmation for that target, preventing a delayed redundant down alert - Webhook delivery failures now retry instead of silently consuming the alert (bug #4): when the Feishu webhook returns non-2xx,
sendFeishuAlertthrows instead of just logging. The fire path catches the throw and re-queues the confirmation cycle rather than recordinglastAlertAt— previously a failed delivery still recorded the alert, suppressing all retries for 6 hours useMonitor.saveConfigno longer "phantom-saves" (bug #8): UI no longer mirrors the new config into local state on non-2xx responses. Returnsbooleanso callers can detect failuresusePlaygroundHistoryno longer "phantom-deletes" (bug #2):deleteEntryandclearAllnow checkres.okbefore mutating local state — failed server deletions no longer hide items locallyuseWorkflowmutations surface server errors (bug #1):cancelWorkflow/deleteWorkflow/duplicateWorkflownow checkres.okand propagate the server's error message intostate.errorinstead of silently returning false/nulluseBenchmarkrejects malformed responses (bug #3):fetchBenchmarksvalidates that the body is an array,fetchBenchmarkvalidates it's a plain object. Non-2xx and shape mismatches set an error rather than polluting React state with{error:'…'}placeholdersPUT /api/monitor/targetsnow accepts an empty array (bug #7):MonitorTargetsArraySchemadropped.min(1), letting users clear the monitor list entirelystartWorkflowcorrectly togglesisRunning(bug #6): settrueat the start of the try-block so the catch-branch'ssetIsRunning(false)is no longer a no-opproviderStore.create/updatereject duplicate model id/name within a provider (bug #5): collisions previously corrupted monitor target tracking. Routes return 400 with the conflict message
- New monitor config field
alertConfirmFailThreshold(range 1-20, clamped to[1, alertConfirmCount]server-side) - Settings UI exposes the K threshold as a
K / Nselector that auto-adjusts options when N changes - Comprehensive test coverage expansion: 906 total tests (712 backend + 194 frontend) covering alert state coordination, K-of-N decision math, multi-provider streaming token fields, route HTTP semantics via supertest, store CRUD with sqlite migrations, and full executeWorkflow integration with real benchmarkEngine
- "Writing tests" discipline section in
CLAUDE.mdcapturing the lesson from the May 2026 reverse-review: 8 bugs were silently rationalized by tests that matched current behavior instead of expected behavior
- Configurable alert confirmation: number of consecutive failures (default 5, range 1-20) and delay between checks (default 1 min, range 1-60) before sending alerts, replacing the previous fixed single 1-minute re-check
- Monitor settings UI exposes confirm count and confirm delay alongside language and reminder interval
- Alert reminder interval ignored: every save of monitor settings was wiping
last_alert_atbecausesetTargets/addTargetrebuilt the row without preserving the column, so reminders fired roughly every probe interval instead of every 6 hours - Down/very_slow status oscillation triggered spurious "new failure" alerts instead of reminders;
wasDownnow treats both as the same down state - Backend dev watcher missed source edits made by atomic-replace writes (inode changes); switched from
tsx watchtonodemon --legacy-watchpolling - Frontend dev watcher hardened with
usePollingfor parity - PUT
/api/monitor/configsilently droppedalertConfirmCountandalertConfirmDelayMinutesfrom the request body, so UI changes were not persisted - Alert confirmation probe now records a ping on error (previously failed probes left no DB trace) and re-queues on transient failures instead of silently dropping the confirmation
- Alert confirmation: down/reminder alerts now require a second probe after 1 minute to reduce false positives
- Recovery alerts are still sent immediately without confirmation
- Switch docker-compose.yml to use Docker Hub image (
idemerge/llm-api-bench) - Remove unused variables flagged by code quality analysis
- Full i18n support with Chinese/English language switcher (react-i18next)
- Feishu webhook alert notifications for monitor
- Per-target alert enable/disable toggle
- Status change detection: new failure, repeated failure (configurable interval), recovery
- DB-persisted alert state (survives restarts)
- Optional webhook signature verification
- Configurable notification language (en/zh, default en)
- Alert bell indicator on monitor model cards (color-coded by health status)
- All hardcoded UI strings replaced with i18n translation keys
- Monitor settings modal now includes alert configuration section (webhook URL, secret, language, reminder interval)
- Touch targets undersized: removed
size="small"from Settings buttons, increased model tag padding - Heading scale too flat: increased H1 from 20px to 24px
- Capability tags (T/S/V) nearly illegible: increased font from 8px to 10px with larger padding
- Mobile parameter labels overflow: responsive grid for Core Parameters section
- Playground history panel overlaps form on mobile: full-screen overlay on mobile
- Grammar: "1 models" now correctly pluralized across Monitor and History pages
- antd deprecation: replaced Alert
messageprop withtitle(5 instances) - History page duplicate heading: removed redundant H2 title (topbar already shows page name)
- Naming validation rules for Provider name, Model ID, and DisplayName (backend + frontend)
- Provider name: alphanumeric/dash/underscore, no spaces, 1-64 chars
- Model ID: alphanumeric/dash/underscore/dot/slash, 1-64 chars (LiteLLM compatible)
- DisplayName: alphanumeric/space/dash/underscore/dot, 1-64 chars
- Frontend real-time validation with error hints on Settings provider form
- Frontend validation unit tests (16 cases)
- Backend validation boundary tests (4 cases)
- Renamed project from LLM API Radar to LLM API Bench (repo, UI, docs, Docker image)
- Playground history sidebar now shows
ProviderName/DisplayNameinstead of raw model ID - Backend stores model displayName in playground history for friendly display
- Adaptive QuickButtons sizing: auto-shrink when >7 options to prevent line wrapping
- Getting Started hint no longer flashes on page refresh (waits for data load)
- Playground provider/model selectors no longer flash raw IDs before names load
- Playground history correctly resolves model displayName from provider data
- Raised max concurrency from 1000 to 5000 (frontend InputNumber, backend validation schemas, route caps)
- Raised max iterations from 1M to 10M (frontend InputNumber, backend validation schemas, route caps)
- Added quick-select buttons for 2K/5K concurrency and 5M/10M iterations
- Updated README (EN/CN) with corrected
cdpath and new concurrency/iterations limits - Fixed Quick Start instructions:
cd llm-benchmark→cd llm-api-bench
- Demo mode now masks vendor-prefixed model names (e.g.
z-ai/glm-4.7→ProviderX/glm-4.7) and workflowproviderSummaries, sharing a single id-stable letter namespace across providers and vendors - Masking is fully applied at the React hook fetch boundary (
useWorkflow,useMonitor,usePlaygroundHistory,useProviders); the legacy DOM regex redactor is now a deprecated no-op safety net - Regenerated all 6 README screenshots and
docs/demo.gifunderVITE_DEMO_MODE=true
- Workflow result table no longer leaks raw provider names through
summary.providerSummaries[*].provider(previously masked only by the DOM regex layer)
- Sensitive info redaction module (
scripts/redact-sensitive.mjs) for screenshots and GIF recording — provider names, API URLs, and keys are automatically replaced with generic labels - Screenshot script (
take-screenshots.mjs) now callsredactPage()before each capture - Demo recorder (
record-demo.mjs) installs a persistentMutationObserverto redact text as React re-renders during screencast
- Playground: disable image upload button for non-vision models and clear uploaded images when switching to a non-vision model
- Workflow SSE: fix race condition where
activeRunIdRefwas cleared afterfetchWorkflow, causing stale state — now fetches final workflow state directly before clearing ref
- Regenerated all 6 screenshots and demo GIF with redacted sensitive information
- Removed
prettierfrom frontend and backend devDependencies (unused)
- Workflow page now shows the same Mission Control header (status, duration, edit) and Live Metrics strip (avg RT, TPS, last RT) + cooldown countdown that History Detail had — exposed via
liveMetricsandcooldownfromuseWorkflow - New shared
WorkflowHeadercomponent used by both the active Workflow page and History Detail
- Refactored
HistoryDetailPageto composeWorkflowHeaderinstead of duplicating header markup (~280 line reduction) - Tightened pre-commit lint gate:
lint-stagednow runseslint --max-warnings 0on staged frontend files
- CI lint failure on v2.10.1: removed empty
catch {}block inHistoryDetailPageand silenced react-hooks warnings via targeted disables (no behavior change) - Various react-hooks lint warnings across
ConfigPanel,MonitorPage,PlaygroundPage,WorkflowConfigPanel,WorkflowProgress
- Raised concurrency limit from 200 to 1000 and iterations limit from 2000 to 1M (frontend InputNumber + backend Math.min caps)
- Updated quick-select buttons: concurrency adds 500 and 1K options, iterations adds 10K, 100K, and 1M options
- Workflow name inline editing with PATCH endpoint and edit UI in History Detail header
- Running workflow "Mission Control" experience: live metrics strip (avg RT, TPS, last RT), cooldown countdown timer between tasks, real-time elapsed timer
- Completed workflow stat-card dashboard: Duration, Tokens, Best Avg RT, Success Rate, Total T/s in a 6-column grid
- History list redesign: colored status icons, config chips (concurrency × iterations × tokens + cache rate + stream), dedicated Models column with provider-colored tags, Duration and Tokens columns
- Monitor Settings as Modal dialog (replaces inline collapsible panel) with scrollable Targets area
- CSS design system additions:
stat-card/stat-value/stat-label,section-headerwith color variants,running-card-glowanimation,running-row-activestyling, Ant Design overrides for tables, tooltips, and popconfirm
- History Detail running state: animated amber border glow, live metrics from SSE
latestResults, per-task completed summaries showing fastest RT and highest TPS providers - History Detail completed state: stat-card grid replaces flat text metrics for visual impact
- History Panel: complete rewrite with richer row content and consistent visual hierarchy
- Monitor: summary bar uses
stat-cardCSS class, threshold inputs use Ant DesignInputNumber, chart tooltip uses CSS variables, removed redundant tok/s display, unified TTFT/TPS status coloring - Playground: MetricsRow uses
stat-cardwith provider-colored accents, provider label usesgetProviderColor - WorkflowResults: removed bar charts (MetricBarChart, TaskCharts) — cleaner table-only layout
- WorkflowProgress: added live metrics strip, cooldown timer, elapsed timer, completed task summary pills
- WorkflowConfigPanel: cache hit rate input width narrowed for compact layout
- Output Scope selector for long-context presets (16K/64K/150K/256K): controls how many documents the model reads, limiting output length (~500 tokens for 3 docs, unlimited for All docs)
- Output Scope available in Benchmark, Workflow, and Playground pages with persistent selection via localStorage
- Input/Output/Total throughput metrics in Workflow Detail: calculated as concurrency × avg tokens per request / avg response time
- Throughput columns (In T/s, Out T/s, Total T/s) in provider comparison tables
- Throughput summary in workflow header and results summary bar
- Tooltips on all metric labels, table column headers, and parameter controls across all pages (WorkflowResults, ResultsPanel, ConfigPanel, PlaygroundPage, HistoryDetailPage)
- Long-context 64K preset prompt suffix updated to support configurable output scope
- Workflow task editor: duplicate button to clone an existing task with all its configuration
- Duplicating, deleting, or reordering tasks now correctly preserves heavy prompts (>10K chars) instead of silently truncating them
- History list: show concurrency and iteration count columns
- History detail: show input/output token counts and ratio (In:Out)
- History detail: real-time iteration progress bar for running workflows via SSE
- Backfill input/output token stats for older workflows on first access
- Long context preset prompts: balanced for ~40:1 input-to-output token ratio with "Don't overthink this" guidance
- Cache hit rate: reduced sliding window from concurrency-sized (e.g. 50) to fixed 5, keeping KV cache memory pressure realistic for large prompts
- Cache hit rate: reuse now picks from a sliding window of recent prefixes (sized to concurrency) instead of the entire pool, avoiding stale entries that inference engines (SGLang, vLLM) may have evicted under memory pressure
- Cache hit rate: replaced fixed-K-prefixes + shuffle with per-request Bernoulli scheduling — each request independently rolls miss/hit with the target probability, producing a uniform distribution throughout the run instead of clustering all misses at the start
- Cache hit rate:
targetCacheHitRatewas silently dropped by both the benchmark and workflow route handlers — the field was validated but never passed to the engine, so the feature had no effect
- Cache hit rate: prefix size now adapts to prompt length (~5%, clamped 128–4096 chars) to avoid inflating short prompts — previously a fixed ~4 KB prefix would double a 1K-token input
- Cache hit rate: replaced short UUID prefix (~5 tokens) with ~1024-token random prefix to reliably bust block-level KV cache on inference engines (vLLM, SGLang, etc.)
- Cache hit rate: replaced round-robin variant assignment with Fisher–Yates shuffled schedule so cache misses are spread evenly across the run instead of clustered at the start
- Cache hit rate control (
targetCacheHitRate): prepends unique UUID prefixes to each request to simulate realistic multi-user traffic with configurable prefix-cache hit rate (0–99%). Available as a toggle + percentage input in the WorkflowConfigPanel Advanced section. Formula: K = iterations × (1 − rate) unique variants, cycled round-robin.
- Raised concurrency limit from 50 to 200 and iterations limit from 1000 to 2000
- Replaced batch-based concurrency with sliding-window worker pool to maintain steady in-flight request count — previously, requests that completed early left slots idle causing actual concurrency to drop over time; now a new request starts immediately whenever one finishes
- Long Context 150K preset: a new built-in prompt preset (~150,000 tokens) bridging the gap between the existing 64K and 256K presets. Available in ConfigPanel, WorkflowConfigPanel, and PlaygroundPage. Loaded on demand via dynamic import to avoid bundle size impact.
maxQpsparameter for workflow tasks: global token bucket rate limiting across all concurrent slots. Set to a positive integer to cap requests per second;0means unlimited. Available in the WorkflowConfigPanel Advanced section with quick-select buttons (Off / 1 / 5 / 10).- Token bucket implementation in the benchmark engine with cancellation support — rate limiting integrates cleanly with existing cancel flow and does not affect
requestIntervalorconcurrencybehavior.
- Eliminated all hardcoded secrets: JWT secret, encryption key, and salt are now auto-generated and persisted to
data/directory - Force password change on first login with default credentials (
changeme) - Restricted CORS to configured origin (default: same-origin only)
- Added login rate limiting (5 attempts per 5 minutes per IP)
- Added Helmet security headers with Content Security Policy
- Moved auth verify and change-password endpoints behind authentication middleware
- Replaced JWT-in-query-string with short-lived one-time tokens for SSE and download URLs
- New password must differ from current password when changing
- JWT token storage moved from
localStoragetosessionStorage
- Zod schema validation for all API request bodies with descriptive error messages
ProviderConfigUpdateSchemafor partial provider updatesPOST /api/auth/change-passwordendpointPOST /api/auth/sse-tokenendpoint for one-time token exchange- Shared SQLite connection singleton with WAL mode and busy timeout
- CSV escaping utility to prevent injection in exports
- Express Request type augmentation (
req.user) - 5 new test suites: encryption, auth middleware, validation schemas, benchmark engine, store sync (65 backend tests total)
- SQLite-first write pattern across all stores: DB writes before in-memory Map updates to prevent inconsistency on failures
- Encryption migration runs synchronously before server startup to prevent race conditions
- Monitor cleanup runs daily at 3am with 7-day retention
cancelledRunscleanup in benchmark engine on both success and error paths- PRAGMA
table_infomigration pattern replaces try/catchALTER TABLE apiKeysfield in benchmark and workflow schemas is now optional with empty defaultsupportsVision/supportsToolsuse nullish coalescing (??) instead of logical OR
- Encryption migration race condition — server could accept requests before migration completed
store.delete()violated SQLite-first pattern (deleted Map before DB)PUT /api/providers/:idhad no input validation- Monitor error responses returned HTTP 200 instead of 500
- Provider test endpoints could crash the server on connection failure (now returns 502)
- Workflow error recovery could overwrite cancellation status
- Frontend infinite re-render loop on History page with running workflows
- EventSource not cleaned up on component unmount in useWorkflow hook
- Missing
pageConfigfallback for unknown routes - Redundant
method/actionattributes on login form
- Dead code:
backend/src/services/store-old.ts
- Merged Docker publish into CI pipeline — Docker image build now requires all quality checks to pass first
- CI workflow also triggers on version tags so quality gate runs before Docker push
- CI lint warnings: cleaned up unused imports/variables across frontend and backend
- Upgraded GitHub Actions to v5 (Node.js 24 compatible)
- Root docs (CHANGELOG, README, docker-compose) excluded from Prettier formatting
- lint-staged glob expanded to cover config
.jsfiles
- Vitest test framework with initial test suites (frontend: tokenCount, MonitorPage helpers; backend: MonitorStore CRUD)
- Prettier for consistent code formatting across frontend and backend
- ESLint for backend (flat config, typescript-eslint)
- Husky pre-commit hook with lint-staged (auto-format on commit)
- GitHub Actions CI pipeline: typecheck, lint, format check, tests, build
scripts/release.shfor automated version bump, changelog update, tag, and push
- 11 TypeScript type errors across the codebase (ResultsPanel JSX, App.tsx, WorkflowResults, etc.)
- CI error messages now show specific failures and actionable fix instructions
- Monitor trend charts: expandable TTFT, TPS, and Latency time-series graphs per model with 1h/6h/24h range selector and threshold reference lines
- Playground history: clicking a failed history entry no longer crashes (undefined metrics guard)
- Playground history: selecting a record whose provider was deleted no longer causes a blank screen (graceful fallback to empty provider selection)
- Monitor health classification now based on TPS (tokens per second) instead of raw latency
- Monitor probe prompt upgraded to generate longer responses for accurate TPS measurement
- Provider deletion cascades to monitor targets cleanup
- Provider model rename auto-syncs monitor targets (preserves monitoring config)
- Anthropic streaming fallback: read
input_tokensfrommessage_deltafor LiteLLM compatibility
- Monitor health thresholds:
latencySlowMs/latencyVerySlowMsreplaced withtpsSlowThreshold(default 20) /tpsVerySlowThreshold(default 5) - Monitor UI tooltips show TPS instead of latency as primary metric
- Workflow templates default
warmupRunschanged from 2 to 0 - Quick Benchmark and Workflow config default
warmupRunschanged from 2 to 0
formatNumbercrash when token values are undefined (WorkflowResults page)- Workflow detail table showing Tokens as 0 (field mismatch:
promptTokensvstotalTokens) - Orphaned monitor targets remaining after provider model rename or deletion
- Playground image input redesigned: inline button at prompt bottom, drag-and-drop, clipboard paste support (matching ChatGPT/Claude UX)
- Playground Run button moved to prompt textarea bottom-right for faster access
- Config row (Max Tokens, Streaming, Thinking) moved above prompt area
- Presets integrated into prompt bottom bar
- Removed URL image input (file upload only)
- Playground history with SQLite persistence, auto-save on every run
- History sidebar with replay: click any past run to restore prompt, config, and response
- Thinking/reasoning toggle for Anthropic extended thinking and OpenAI reasoning effort
- Copy response button in Playground
- Long Context presets (8K/16K/32K/64K/128K) in Playground
- Backend image validation (size and count limits)
- Anthropic extended thinking not working (wrong API version, missing thinking params)
- Anthropic required CLI headers accidentally removed
- Playground upstream API calls not aborted when client disconnects (resource leak)
- Gemini using fake conversation turns instead of native
systemInstruction - Playground
/runreturning HTTP 200 on errors instead of 502 - Image URLs silently dropped for Anthropic/Gemini (now fetched and converted to base64)
- Backend
maxTokensdefault misaligned (512 vs 4096) - Flash of empty state on Monitor, Config, and Workflow pages before data loads
- History sidebar defaults to open for better discoverability
- Playground design polish: label sizes, config row layout, mobile responsiveness
- Gemini streaming support (
streamGenerateContentwithalt=sse) for accurate TTFT measurement - Gemini streaming in Playground with real-time token output
- History page refresh button for manual data reload
- Auto-refresh History page every 30s when running workflows exist
- Reload workflow data when navigating to History page
- Gemini TTFT always showing 0 due to missing streaming implementation
- Playground non-streaming TTFT showing fabricated value (
responseTime * 0.3) instead of N/A - Playground Gemini image input not being passed to API (images were silently dropped)
- Playground Gemini format falling back to non-streaming instead of using native streaming
- Frontend TTFT displaying
0msinstead ofN/Afor non-streaming requests (WorkflowResults, ResultsPanel, PlaygroundPage) - Non-streaming
/runendpoint not passing images parameter to provider
backend/public/added to.gitignore(build artifact)
- Renamed project from LLM Benchmark to LLM API Radar
- Updated all UI references, branding, screenshots, and demo GIF
Initial release.
- Multi-provider benchmark engine (OpenAI, Anthropic, Gemini, OpenAI-Compatible)
- Workflow engine with multi-task sequential execution
- Per-task prompt, concurrency, and iteration configuration
- Warmup runs to eliminate cold-start bias
- Live streaming metrics with per-provider area charts
- Radar comparison across all dimensions
- Persistent run history with full result details
- JSON and CSV export
- Playground page with streaming, vision support, and image upload
- Monitor page with periodic health checks, configurable thresholds, and 24h history
- JWT-based authentication with configurable credentials
- Docker deployment with multi-stage alpine build
- One-click build script (
start.sh) and Docker Compose - GitHub Actions workflow for Docker Hub auto-publish on tag push
- Dark theme UI with Ant Design 5
- SQLite storage with single-file database