RFC: Admin /stats hourly rollups for high-volume usage analytics

## Summary

Admin UI `/stats` recently exposed a scaling problem in Cloud API prod: billing-summary errors were logged as generic database failures, and the first mitigation attempt in #722 proposed several regular `CREATE INDEX` migrations on the high-volume `organization_usage_log` table.

Review feedback on #722 correctly called out that normal `refinery` PostgreSQL migrations run in transactions. Creating indexes this way on hot usage tables can block writes long enough to exhaust Cloud API database connections. PR #722 is being reduced to diagnostics only; this issue tracks the durable design.

Related context:
- #722
- #704

## Goals

- Make admin `/stats` safe and fast for millions+ of usage rows.
- Keep public/admin API response shapes unchanged.
- Allow `/stats` numbers to be a few minutes stale.
- Avoid blocking DDL on hot usage tables.
- Preserve current dashboard attribution semantics unless explicitly changed later.

## Non-goals

- Do not make #722 the long-term performance fix.
- Do not use normal `refinery` migrations for hot-table index builds.
- Do not turn this into finance-grade recognized revenue; `/stats` remains consumed-cost analytics.
- Do not solve p95 latency rollups in v1 unless we add a separate histogram design.

## Proposed Architecture

Create fine-grain hourly rollup tables for usage-derived admin analytics:

- Inference rollups by hour, organization, workspace, API key, and model.
- Service usage rollups by hour, organization, workspace, API key, and service.

Rollup rows should aggregate additive metrics such as:

- request count
- input/output/cache-read tokens
- total tokens
- total cost in nano-USD
- provider-error/timeout counts where applicable
- last usage timestamp where needed for org revenue ranking

For model attribution, store stable model identifiers/names in rollups and keep joining current `models` metadata for verifiable/provider splits. This preserves the current `/stats` behavior documented in the API types, where historical usage is interpreted through current model metadata.

## Worker And Backfill

Run an async Cloud API rollup worker with:

- a Postgres advisory lock so only one instance processes at a time
- bounded time chunks
- a safe watermark, for example `now() - a few minutes`
- a resumable progress table per source table
- idempotent upserts into rollup tables
- sanitized progress/failure logging

Historical backfill should be chunked async, not a single large migration. Endpoints should continue using raw logs until the relevant rollup range is caught up and parity has been validated.

## Query Cutover

Cut endpoints over behind a config flag.

Use rollups for heavy additive aggregates in:

- `/v1/admin/platform/metrics`
- `/v1/admin/platform/metrics/timeseries`
- `/v1/admin/platform/model-revenue`
- `/v1/admin/platform/org-revenue`
- `/v1/admin/platform/billing-summary` inference/service consumed split

Keep non-usage snapshot queries on their current tables, for example users, organizations, active credit limits, and source breakdowns.

For p95 latency metrics, use a hybrid v1 approach:

- rollups serve revenue/token/request aggregates
- raw logs continue serving p95 latency until a histogram rollup design exists

## DDL Safety

If indexes are later proven necessary on hot usage tables, build them out-of-band with `CREATE INDEX CONCURRENTLY`, not through normal `refinery` migrations.

Any schema migration for small rollup/progress tables can stay in regular migrations, because those tables are new and not hot write paths at creation time.

## Validation Plan

- Compare raw-log analytics vs rollup analytics on fixture data for platform metrics, timeseries, model revenue, org revenue, and billing split.
- Cover sort, pagination, search, provider/verifiable filters, and empty ranges.
- Test worker chunking, advisory-lock behavior, resume after interruption, duplicate-safe upserts, and raw-tail freshness.
- Keep endpoint JSON response shapes unchanged.
- After rollout, compare Datadog `/stats` latency/error rates before and after rollup cutover.

## Implementation Checklist

- [ ] Define rollup/progress table schema.
- [ ] Add repository methods to roll up bounded chunks from `organization_usage_log` and `organization_service_usage_log`.
- [ ] Add advisory-lock guarded async worker with configurable interval/chunk size/watermark.
- [ ] Add parity tests comparing raw queries and rollup queries.
- [ ] Add feature flag for rollup-backed `/stats` reads.
- [ ] Cut billing-summary inference/service split to rollups after backfill validation.
- [ ] Cut platform metrics/timeseries/model revenue/org revenue to rollups after parity validation.
- [ ] Decide whether p95 latency should get histogram rollups in a follow-up RFC.

## Open Questions

- What exact freshness watermark should production use?
- Should rollup backfill be enabled automatically in prod, or manually started after deploy?
- What Datadog monitors should gate rollout and cutover?
- When, if ever, should `/stats` switch from current model metadata attribution to usage-time attribution?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Admin /stats hourly rollups for high-volume usage analytics #725

Summary

Goals

Non-goals

Proposed Architecture

Worker And Backfill

Query Cutover

DDL Safety

Validation Plan

Implementation Checklist

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: Admin /stats hourly rollups for high-volume usage analytics #725

Description

Summary

Goals

Non-goals

Proposed Architecture

Worker And Backfill

Query Cutover

DDL Safety

Validation Plan

Implementation Checklist

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions