Skip to content

Releases: idemerge/llm-api-bench

v2.15.3 — strict CI tag validation

15 May 06:39

Choose a tag to compare

Revert v2.15.2's CI dedup. Tag pushes now run the full quality job before docker.

Why

v2.15.2 skipped quality on tag pushes to avoid duplicate CI runs. That left a security gap: a tag pointing at an unvalidated commit (e.g. git tag v9.9.9 some-sha directly) would trigger a Docker push without going through type check / lint / tests.

Trade-off

Each release runs quality twice (~1m each) instead of once, but guarantees Docker images are only built from validated commits. Worth it.

No code changes. Same 906 tests.

🤖 Generated with Claude Code

v2.15.2 — fix CI lint + dedup CI runs

15 May 06:29

Choose a tag to compare

Patch release: fix CI lint failures from v2.15.1 + stop double-triggering CI on releases.

Fixed

  • Backend lint: unused DEFAULT_CONFIRM_DELAY_MS and test locals (mainStarted, warmupCount) flagged by ESLint
  • CI workflow: quality job now skips on tag pushes. Releases previously fired CI twice (branch push + tag push for the same commit). The docker job still triggers on tags

No code changes beyond lint cleanup. 906 tests, same as v2.15.0.

🤖 Generated with Claude Code

v2.15.1 — fix CI

15 May 06:24

Choose a tag to compare

Patch release to fix CI typecheck failure on v2.15.0.

supertest and @types/supertest were installed at the workspace root instead of backend/package.json. TypeScript resolved them locally via the parent node_modules, but GitHub Actions installs each sub-package independently, so the type check failed with Cannot find module 'supertest'.

No code changes. Same 906 tests as v2.15.0.

🤖 Generated with Claude Code

v2.15.0 — K-of-N alerts + 8 bug fixes + 906 tests

15 May 06:21

Choose a tag to compare

Reliability + correctness sweep on the monitor/alert pipeline, plus 8 frontend/backend bugs caught by a reverse-review of the test suite. Tests grew from ~148 → 906.

Changed

  • K-of-N alert voting (default 4-of-5) replaces strict "N consecutive failures or one ok abandons cycle." A flaky upstream that returns one healthy response between failures no longer suppresses real outage alerts for 30+ minutes. New alertConfirmFailThreshold config (range 1–alertConfirmCount).
  • Confirmation cycles exit early on both directions: alert fires when failCount reaches threshold, abandons when threshold becomes mathematically unreachable.
  • Health-check probe timeout: 180s → 90s for monitor probes and "Test Connection". Playground/benchmark calls keep their longer timeouts.
  • Confirmation probes within the same provider now run in parallel instead of serially.

Fixed

  • Race in confirmation queue between delete(key) and await confirmProbe's re-add — duplicate parallel cycles. Added inFlight token map + stale-token check.
  • Recovery alert no longer leaves a zombie down cycle in flight — explicitly cancels pending/in-flight confirmation on recovery.
  • Webhook delivery failures now retry instead of silently consuming the alertsendFeishuAlert throws on non-2xx; fire path re-queues the cycle instead of recording lastAlertAt (which would suppress retries for 6h).
  • useMonitor.saveConfig no longer "phantom-saves" on non-2xx — was reflecting saved state into UI even when server rejected.
  • usePlaygroundHistory.deleteEntry / clearAll no longer "phantom-deletes" — checks res.ok before mutating local state.
  • useWorkflow mutations (cancel/delete/duplicate) surface server errors into state.error instead of silently returning false.
  • useBenchmark rejects malformed responses — validates array shape before setBenchmarks(data) to prevent state pollution.
  • PUT /api/monitor/targets now accepts [] — dropped .min(1) so users can clear the monitor list.
  • startWorkflow correctly toggles isRunning — sets true at try-block start so the catch's setIsRunning(false) actually has work to do.
  • providerStore rejects duplicate model id/name within a provider — collisions previously corrupted monitor target tracking.

Added

  • Settings UI exposes the K threshold as a K / N selector that auto-adjusts options when N changes.
  • Comprehensive test coverage expansion: 906 total tests (712 backend + 194 frontend) covering alert state coordination, K-of-N decision math, multi-provider streaming token fields, route HTTP semantics via supertest, store CRUD with sqlite migrations, full executeWorkflow integration.
  • CLAUDE.md gains a "Writing tests" discipline section recording the meta-lesson: 8 of these fixes came from a reverse-review where tests had been silently rewritten to match buggy code. The Iron Rule: when a test fails, suspect the code first.

🤖 Generated with Claude Code

v2.14.0 — Configurable alert confirmation & reminder fixes

13 May 15:57

Choose a tag to compare

Added

  • Configurable alert confirmation — number of consecutive failures (default 5, range 1-20) and delay between checks (default 1 min, range 1-60) before sending alerts, replacing the previous fixed single 1-minute re-check
  • Monitor settings UI exposes confirm count and confirm delay alongside language and reminder interval

Fixed

  • Alert reminder interval ignored — every save of monitor settings was wiping last_alert_at because setTargets/addTarget rebuilt the row without preserving the column, so reminders fired roughly every probe interval instead of every 6 hours
  • Status oscillation triggered spurious alertswasDown now treats down and very_slow as the same down state, so flipping between them doesn't fire a new "down" alert
  • PUT /api/monitor/config silently dropped alertConfirmCount and alertConfirmDelayMinutes from the request body — UI changes were not persisted
  • Alert confirmation probe now records a ping on error (previously failed probes left no DB trace) and re-queues on transient failures instead of silently dropping the confirmation

Dev experience

  • Backend dev watcher swapped from tsx watch to nodemon --legacy-watch polling — tsx watch was missing source edits made by atomic-replace writes (inode changes), causing "the code didn't update" frustration
  • Frontend Vite watcher hardened with usePolling for parity

Full Changelog: v2.13.1...v2.14.0

v2.13.1

11 May 13:28

Choose a tag to compare

What's Changed

🔔 Alert Confirmation Check

  • Down/reminder alerts now require a second probe after 1 minute to reduce false positives from transient failures
  • Recovery alerts are still sent immediately without confirmation

🐳 Docker

  • docker-compose.yml now uses Docker Hub image (idemerge/llm-api-bench) instead of local build

🧹 Code Quality

  • Removed 8 unused variables flagged by code quality analysis

Full Changelog: v2.13.0...v2.13.1

v2.13.0

11 May 11:43

Choose a tag to compare

What's New

🌐 Full i18n Support

  • Chinese/English language switcher in sidebar and login page
  • All hardcoded UI strings replaced with translation keys
  • Language preference persisted in localStorage

🔔 Feishu Webhook Alerts

  • Per-target alert enable/disable toggle
  • Status change detection: new failure, repeated failure (configurable interval), recovery
  • DB-persisted alert state — survives server restarts
  • Optional webhook signature verification
  • Configurable notification language (en/zh, default en)
  • Alert bell indicator on monitor model cards (color-coded by health status)

📝 Monitoring Settings

  • New alert configuration section: webhook URL, signing secret, language, reminder interval

Full Changelog: v2.12.0...v2.13.0

v2.12.1

28 Apr 08:34

Choose a tag to compare

Fixed

  • Touch targets undersized: removed size="small" from Settings buttons, increased model tag padding
  • Heading scale too flat: increased H1 from 20px to 24px
  • Capability tags (T/S/V) nearly illegible: increased font from 8px to 10px with larger padding
  • Mobile parameter labels overflow: responsive grid for Core Parameters section
  • Playground history panel overlaps form on mobile: full-screen overlay on mobile
  • Grammar: "1 models" now correctly pluralized across Monitor and History pages
  • antd deprecation: replaced Alert message prop with title (5 instances)
  • History page duplicate heading: removed redundant H2 title

Full Changelog: v2.12.0...v2.12.1

v2.12.0

28 Apr 06:08

Choose a tag to compare

What's Changed

Added

  • Naming validation for Provider name, Model ID, and DisplayName (backend schemas + frontend real-time hints)
    • Provider name: [a-zA-Z0-9][a-zA-Z0-9_-]{0,63} — no spaces
    • Model ID: [a-zA-Z0-9][a-zA-Z0-9._/-]{0,63} — LiteLLM vendor/model compatible
    • DisplayName: [a-zA-Z0-9][a-zA-Z0-9 ._-]{0,63} — human-readable
  • Frontend validation unit tests (16 cases) + backend boundary tests (4 cases)

Changed

  • Project renamed from LLM API Radar to LLM API Bench (repo, UI, docs, Docker, CI)
  • Playground history sidebar shows ProviderName/DisplayName instead of raw model ID
  • Backend stores model displayName in playground history
  • Adaptive QuickButtons: auto-shrink when >7 options to prevent line wrapping
  • Max concurrency raised to 5000, max iterations to 10M

Fixed

  • Getting Started hint no longer flashes on page refresh
  • Playground provider/model selectors no longer flash raw IDs before names load
  • Playground history correctly resolves model displayName from provider data
  • Quick Start cd path fixed in both READMEs

Tests

  • 175 tests total (frontend 76 + backend 99), all passing

Full Changelog: v2.11.2...v2.12.0

v2.11.3

28 Apr 03:53

Choose a tag to compare

What's Changed

Changed

  • Raised max concurrency from 1000 to 5000 (frontend, backend validation, route caps)
  • Raised max iterations from 1M to 10M (frontend, backend validation, route caps)
  • Added quick-select buttons for 2K/5K concurrency and 5M/10M iterations
  • Fixed Quick Start instructions in both READMEs: cd llm-benchmarkcd llm-api-radar
  • Updated README (EN/CN) with new concurrency/iterations limits

Tests

  • Added boundary validation tests for concurrency (5000/5001) and iterations (10M/10M+1)
  • All 155 tests passing (frontend 60, backend 95)

Full Changelog: v2.11.2...v2.11.3