Add per-run cost and duration tracking to Orchestra. Every LLM invocation records token counts, cost (USD), duration, and phase metadata. Aggregated metrics are surfaced per thread, with configurable budget caps that integrate with the existing HITL interrupt system. The frontend displays real-time cost badges and a detailed breakdown panel.
Spec: .claude/specs/cost-tracking.md
- As a user, I want to see the running cost of my current thread in the chat header so I can monitor spend in real time.
- As a user, I want to set a budget cap on a thread or assistant so that execution pauses before exceeding my limit.
- As a user, I want to view a per-turn cost breakdown (model, tokens, cost, duration) so I can identify expensive steps.
- As an admin, I want to configure model pricing tables via API so costs stay accurate when providers change rates.
- As a team lead, I want historical cost data per thread so I can analyze spending trends across projects.
Goal: Define schemas, database tables, and pricing constants.
Tasks:
- Create
backend/src/schemas/entities/metrics.pywithTurnMetrics,ThreadCostSummary, andRunBudgetPydantic models. - Create
backend/src/constants/pricing.pywith hardcoded pricing dictionary for OpenAI, Anthropic, Google models. - Create Alembic migration for
turn_metricstable (columns: turn_id, thread_id, assistant_id, user_id, model, input_tokens, output_tokens, total_tokens, input_cost, output_cost, total_cost, duration_ms, phase, created_at). - Create Alembic migration for
run_budgetstable (columns: id, thread_id, assistant_id, user_id, max_cost_usd, max_tokens, action_on_exceed, is_active, created_at, updated_at). - Create Alembic migration for
pricing_overridestable (columns: id, model, input_per_1k, output_per_1k, currency, created_at, updated_at).
Acceptance Criteria:
- Migrations run without errors on a clean database.
- Pydantic schemas validate correctly with sample data.
- Pricing dictionary covers at least 10 commonly used models.
Goal: Instrument LLM calls to capture token usage, cost, and duration.
Tasks:
- Create
backend/src/utils/metrics.pywithcapture_turn_metrics()wrapper. - Hook into the LangChain callback system (or wrap the LLM service layer) to intercept response metadata.
- Extract
usage(input_tokens, output_tokens) from LLM response. - Compute cost using pricing lookup (DB override -> hardcoded fallback).
- Measure wall-clock duration with
time.perf_counter(). - Persist
TurnMetricsasynchronously (background task or fire-and-forget DB write). - Emit a
cost_updateevent on the SSE/streaming channel for real-time frontend updates.
Acceptance Criteria:
- Every LLM call produces a
TurnMetricsrecord in the database. - Cost calculation matches expected values for known token counts and pricing.
- Metrics capture adds < 5ms overhead to request latency.
Goal: Enforce spending limits with HITL integration.
Tasks:
- Create
backend/src/services/budget.pywith budget check logic. - Before each LLM call, query cumulative thread cost (cached, refresh per turn).
- Compare against active
RunBudgetfor the thread/assistant. - Implement three actions:
pause(HITL interrupt),warn(log + UI event),stop(raise error). - Integrate with existing HITL interrupt flow for the
pauseaction. - Add budget status to the
cost_updateSSE event payload.
Acceptance Criteria:
- Thread pauses when cost exceeds budget with
pauseaction. - User can approve continuation after HITL interrupt.
stopaction halts execution and returns a clear error message.warnaction logs and sends notification but does not block.
Goal: Expose metrics, pricing, and budget management via REST API.
Tasks:
- Create
backend/src/routes/v0/metrics.pywith thread metrics endpoints. - Create
backend/src/routes/v0/pricing.pywith pricing CRUD endpoints. - Create
backend/src/routes/v0/budgets.pywith budget CRUD endpoints. - Create corresponding controller and service layers.
- Add OpenAPI documentation for all new endpoints.
- Register routes in the v0 router.
Acceptance Criteria:
- All endpoints return correct data with proper HTTP status codes.
- Pricing CRUD is restricted to admin users.
- Thread metrics are scoped to the requesting user's threads.
- Endpoints appear in
/docswith full OpenAPI schemas.
Goal: Display cost data in the chat UI with budget configuration.
Tasks:
- Add
CostBadgecomponent to chat header showing running thread cost. - Subscribe to
cost_updateSSE events for real-time updates. - Build
CostBreakdownPanel(expandable side panel or modal) with per-turn table and per-phase summary. - Build
BudgetConfigDialogfor setting budget caps on threads/assistants. - Add color-coded warning indicators: green (< 75%), yellow (75-99%), red (>= 100%) of budget.
- Add cost column to thread list view for historical comparison.
Acceptance Criteria:
- Cost badge updates within 1 second of each LLM response.
- Breakdown panel displays accurate per-turn data with model, tokens, cost, and duration.
- Budget dialog saves configuration and reflects active budget in the UI.
- Warning colors transition correctly at threshold boundaries.
Goal: Comprehensive test coverage and developer documentation.
Tasks:
- Unit tests for pricing calculation (
test_pricing.py). - Unit tests for budget threshold logic (
test_budget.py). - Unit tests for metrics aggregation (
test_metrics.py). - Integration tests for metrics capture during LLM calls.
- Integration tests for budget enforcement with HITL interrupt.
- Integration tests for all API endpoints.
- Frontend tests for
CostBadge,CostBreakdownPanel,BudgetConfigDialog. - Update API documentation and
llm.txtif public-facing.
Acceptance Criteria:
-
= 80% code coverage on new backend modules.
- All integration tests pass in CI.
- Frontend component tests cover rendering and user interaction flows.
- Existing HITL system: Budget
pauseaction depends on the interrupt flow inbackend/src/schemas/entities/hitl.py. - LangChain callback system: Metrics capture hooks into the existing LLM invocation layer.
- SSE/streaming infrastructure: Real-time cost updates use the existing event streaming channel.
- Database: Requires PostgreSQL with Alembic migrations.
| Risk | Impact | Mitigation |
|---|---|---|
| Provider pricing changes frequently | Stale costs | DB-backed overrides + periodic review cadence |
| Metrics capture adds latency | Slower responses | Async DB writes, fire-and-forget pattern |
| Token count unavailable for some models | Missing cost data | Graceful fallback: log warning, record zero cost |
| Budget check race condition (concurrent turns) | Overspend | Optimistic check + post-turn reconciliation |
| Large turn_metrics table over time | DB performance | Indexed queries, future: partitioning or archival |
- Unit: Pure function tests for pricing math, budget thresholds, aggregation logic. No DB or network needed.
- Integration: Full request cycle with test database. Verify metrics records created, budget enforcement triggers, API responses correct.
- Frontend: Vitest + Testing Library for component rendering, mock SSE events, form validation.
- Manual QA: End-to-end flow with real LLM calls in dev environment, verify cost badge accuracy.