You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Admin UI /stats recently exposed a scaling problem in Cloud API prod: billing-summary errors were logged as generic database failures, and the first mitigation attempt in #722 proposed several regular CREATE INDEX migrations on the high-volume organization_usage_log table.
Review feedback on #722 correctly called out that normal refinery PostgreSQL migrations run in transactions. Creating indexes this way on hot usage tables can block writes long enough to exhaust Cloud API database connections. PR #722 is being reduced to diagnostics only; this issue tracks the durable design.
Do not use normal refinery migrations for hot-table index builds.
Do not turn this into finance-grade recognized revenue; /stats remains consumed-cost analytics.
Do not solve p95 latency rollups in v1 unless we add a separate histogram design.
Proposed Architecture
Create fine-grain hourly rollup tables for usage-derived admin analytics:
Inference rollups by hour, organization, workspace, API key, and model.
Service usage rollups by hour, organization, workspace, API key, and service.
Rollup rows should aggregate additive metrics such as:
request count
input/output/cache-read tokens
total tokens
total cost in nano-USD
provider-error/timeout counts where applicable
last usage timestamp where needed for org revenue ranking
For model attribution, store stable model identifiers/names in rollups and keep joining current models metadata for verifiable/provider splits. This preserves the current /stats behavior documented in the API types, where historical usage is interpreted through current model metadata.
Worker And Backfill
Run an async Cloud API rollup worker with:
a Postgres advisory lock so only one instance processes at a time
bounded time chunks
a safe watermark, for example now() - a few minutes
a resumable progress table per source table
idempotent upserts into rollup tables
sanitized progress/failure logging
Historical backfill should be chunked async, not a single large migration. Endpoints should continue using raw logs until the relevant rollup range is caught up and parity has been validated.
Keep non-usage snapshot queries on their current tables, for example users, organizations, active credit limits, and source breakdowns.
For p95 latency metrics, use a hybrid v1 approach:
rollups serve revenue/token/request aggregates
raw logs continue serving p95 latency until a histogram rollup design exists
DDL Safety
If indexes are later proven necessary on hot usage tables, build them out-of-band with CREATE INDEX CONCURRENTLY, not through normal refinery migrations.
Any schema migration for small rollup/progress tables can stay in regular migrations, because those tables are new and not hot write paths at creation time.
Validation Plan
Compare raw-log analytics vs rollup analytics on fixture data for platform metrics, timeseries, model revenue, org revenue, and billing split.
Cover sort, pagination, search, provider/verifiable filters, and empty ranges.
Test worker chunking, advisory-lock behavior, resume after interruption, duplicate-safe upserts, and raw-tail freshness.
Keep endpoint JSON response shapes unchanged.
After rollout, compare Datadog /stats latency/error rates before and after rollup cutover.
Implementation Checklist
Define rollup/progress table schema.
Add repository methods to roll up bounded chunks from organization_usage_log and organization_service_usage_log.
Add advisory-lock guarded async worker with configurable interval/chunk size/watermark.
Add parity tests comparing raw queries and rollup queries.
Add feature flag for rollup-backed /stats reads.
Cut billing-summary inference/service split to rollups after backfill validation.
Cut platform metrics/timeseries/model revenue/org revenue to rollups after parity validation.
Decide whether p95 latency should get histogram rollups in a follow-up RFC.
Open Questions
What exact freshness watermark should production use?
Should rollup backfill be enabled automatically in prod, or manually started after deploy?
What Datadog monitors should gate rollout and cutover?
When, if ever, should /stats switch from current model metadata attribution to usage-time attribution?
Summary
Admin UI
/statsrecently exposed a scaling problem in Cloud API prod: billing-summary errors were logged as generic database failures, and the first mitigation attempt in #722 proposed several regularCREATE INDEXmigrations on the high-volumeorganization_usage_logtable.Review feedback on #722 correctly called out that normal
refineryPostgreSQL migrations run in transactions. Creating indexes this way on hot usage tables can block writes long enough to exhaust Cloud API database connections. PR #722 is being reduced to diagnostics only; this issue tracks the durable design.Related context:
Goals
/statssafe and fast for millions+ of usage rows./statsnumbers to be a few minutes stale.Non-goals
refinerymigrations for hot-table index builds./statsremains consumed-cost analytics.Proposed Architecture
Create fine-grain hourly rollup tables for usage-derived admin analytics:
Rollup rows should aggregate additive metrics such as:
For model attribution, store stable model identifiers/names in rollups and keep joining current
modelsmetadata for verifiable/provider splits. This preserves the current/statsbehavior documented in the API types, where historical usage is interpreted through current model metadata.Worker And Backfill
Run an async Cloud API rollup worker with:
now() - a few minutesHistorical backfill should be chunked async, not a single large migration. Endpoints should continue using raw logs until the relevant rollup range is caught up and parity has been validated.
Query Cutover
Cut endpoints over behind a config flag.
Use rollups for heavy additive aggregates in:
/v1/admin/platform/metrics/v1/admin/platform/metrics/timeseries/v1/admin/platform/model-revenue/v1/admin/platform/org-revenue/v1/admin/platform/billing-summaryinference/service consumed splitKeep non-usage snapshot queries on their current tables, for example users, organizations, active credit limits, and source breakdowns.
For p95 latency metrics, use a hybrid v1 approach:
DDL Safety
If indexes are later proven necessary on hot usage tables, build them out-of-band with
CREATE INDEX CONCURRENTLY, not through normalrefinerymigrations.Any schema migration for small rollup/progress tables can stay in regular migrations, because those tables are new and not hot write paths at creation time.
Validation Plan
/statslatency/error rates before and after rollup cutover.Implementation Checklist
organization_usage_logandorganization_service_usage_log./statsreads.Open Questions
/statsswitch from current model metadata attribution to usage-time attribution?