Skip to content

RFC: Admin /stats hourly rollups for high-volume usage analytics #725

@PierreLeGuen

Description

@PierreLeGuen

Summary

Admin UI /stats recently exposed a scaling problem in Cloud API prod: billing-summary errors were logged as generic database failures, and the first mitigation attempt in #722 proposed several regular CREATE INDEX migrations on the high-volume organization_usage_log table.

Review feedback on #722 correctly called out that normal refinery PostgreSQL migrations run in transactions. Creating indexes this way on hot usage tables can block writes long enough to exhaust Cloud API database connections. PR #722 is being reduced to diagnostics only; this issue tracks the durable design.

Related context:

Goals

  • Make admin /stats safe and fast for millions+ of usage rows.
  • Keep public/admin API response shapes unchanged.
  • Allow /stats numbers to be a few minutes stale.
  • Avoid blocking DDL on hot usage tables.
  • Preserve current dashboard attribution semantics unless explicitly changed later.

Non-goals

  • Do not make Add admin stats diagnostics and indexes #722 the long-term performance fix.
  • Do not use normal refinery migrations for hot-table index builds.
  • Do not turn this into finance-grade recognized revenue; /stats remains consumed-cost analytics.
  • Do not solve p95 latency rollups in v1 unless we add a separate histogram design.

Proposed Architecture

Create fine-grain hourly rollup tables for usage-derived admin analytics:

  • Inference rollups by hour, organization, workspace, API key, and model.
  • Service usage rollups by hour, organization, workspace, API key, and service.

Rollup rows should aggregate additive metrics such as:

  • request count
  • input/output/cache-read tokens
  • total tokens
  • total cost in nano-USD
  • provider-error/timeout counts where applicable
  • last usage timestamp where needed for org revenue ranking

For model attribution, store stable model identifiers/names in rollups and keep joining current models metadata for verifiable/provider splits. This preserves the current /stats behavior documented in the API types, where historical usage is interpreted through current model metadata.

Worker And Backfill

Run an async Cloud API rollup worker with:

  • a Postgres advisory lock so only one instance processes at a time
  • bounded time chunks
  • a safe watermark, for example now() - a few minutes
  • a resumable progress table per source table
  • idempotent upserts into rollup tables
  • sanitized progress/failure logging

Historical backfill should be chunked async, not a single large migration. Endpoints should continue using raw logs until the relevant rollup range is caught up and parity has been validated.

Query Cutover

Cut endpoints over behind a config flag.

Use rollups for heavy additive aggregates in:

  • /v1/admin/platform/metrics
  • /v1/admin/platform/metrics/timeseries
  • /v1/admin/platform/model-revenue
  • /v1/admin/platform/org-revenue
  • /v1/admin/platform/billing-summary inference/service consumed split

Keep non-usage snapshot queries on their current tables, for example users, organizations, active credit limits, and source breakdowns.

For p95 latency metrics, use a hybrid v1 approach:

  • rollups serve revenue/token/request aggregates
  • raw logs continue serving p95 latency until a histogram rollup design exists

DDL Safety

If indexes are later proven necessary on hot usage tables, build them out-of-band with CREATE INDEX CONCURRENTLY, not through normal refinery migrations.

Any schema migration for small rollup/progress tables can stay in regular migrations, because those tables are new and not hot write paths at creation time.

Validation Plan

  • Compare raw-log analytics vs rollup analytics on fixture data for platform metrics, timeseries, model revenue, org revenue, and billing split.
  • Cover sort, pagination, search, provider/verifiable filters, and empty ranges.
  • Test worker chunking, advisory-lock behavior, resume after interruption, duplicate-safe upserts, and raw-tail freshness.
  • Keep endpoint JSON response shapes unchanged.
  • After rollout, compare Datadog /stats latency/error rates before and after rollup cutover.

Implementation Checklist

  • Define rollup/progress table schema.
  • Add repository methods to roll up bounded chunks from organization_usage_log and organization_service_usage_log.
  • Add advisory-lock guarded async worker with configurable interval/chunk size/watermark.
  • Add parity tests comparing raw queries and rollup queries.
  • Add feature flag for rollup-backed /stats reads.
  • Cut billing-summary inference/service split to rollups after backfill validation.
  • Cut platform metrics/timeseries/model revenue/org revenue to rollups after parity validation.
  • Decide whether p95 latency should get histogram rollups in a follow-up RFC.

Open Questions

  • What exact freshness watermark should production use?
  • Should rollup backfill be enabled automatically in prod, or manually started after deploy?
  • What Datadog monitors should gate rollout and cutover?
  • When, if ever, should /stats switch from current model metadata attribution to usage-time attribution?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestrfcDesign/architecture RFC

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions