Skip to content

feat: expose cumulative token usage on success + budget enforcement#2392

Open
mimran-khan wants to merge 2 commits into
567-labs:mainfrom
mimran-khan:feat/token-usage-budget
Open

feat: expose cumulative token usage on success + budget enforcement#2392
mimran-khan wants to merge 2 commits into
567-labs:mainfrom
mimran-khan:feat/token-usage-budget

Conversation

@mimran-khan

@mimran-khan mimran-khan commented Jun 24, 2026

Copy link
Copy Markdown

Fixes #2391. Related to #2056.

Problem

The retry system tracks cumulative token usage across all attempts, but this data is only accessible when retries fail (via InstructorRetryException.total_usage). On success, it's computed and thrown away. Users have no visibility into how expensive a successful extraction actually was, and no way to cap runaway costs from complex schemas that trigger many retries.

This came up in the #2056 discussion where someone mentioned losing "hundreds of dollars a day" from retries they couldn't observe or control.

What this PR does

Three things, all backward-compatible:

1. _total_usage on successful responses

After a successful extraction, the parsed model now has _total_usage attached (same pattern as _raw_response):

user = client.chat.completions.create(
    model="gpt-4o",
    response_model=User,
    messages=[...],
    max_retries=5,
)
print(f"Total tokens across all retries: {user._total_usage.total_tokens}")

2. completion:usage hook

New hook that fires after each API attempt with the running total. Enables integration with metrics/observability without touching core logic:

def on_usage(usage, *, attempt_number=0):
    metrics.gauge("instructor.tokens.cumulative", usage.total_tokens)

client.on("completion:usage", on_usage)

3. token_budget parameter

Optional parameter that raises TokenBudgetExceeded if cumulative tokens exceed the limit:

from instructor.v2.core.errors import TokenBudgetExceeded

try:
    user = client.chat.completions.create(
        response_model=ComplexSchema,
        max_retries=10,
        token_budget=5000,  # hard cap across all attempts
        ...
    )
except TokenBudgetExceeded as e:
    print(f"Aborted: used {e.total_usage.total_tokens} tokens in {e.n_attempts} attempts")

Files changed

  • instructor/v2/core/hooks.py - added COMPLETION_USAGE hook name + CompletionUsageHandler protocol + emit_completion_usage() method
  • instructor/v2/core/errors.py - added TokenBudgetExceeded exception
  • instructor/v2/core/retry.py - emit usage hook, check budget, attach _total_usage on success, let TokenBudgetExceeded escape the retry wrapper
  • instructor/v2/core/patch.py - pass token_budget from create() down to retry logic
  • tests/test_token_budget.py - 6 tests covering all new behavior

Backward compatibility

All changes are additive. Existing code is unaffected:

  • token_budget defaults to None (no enforcement)
  • _total_usage is only attached, never required
  • The new hook only fires if handlers are registered

Checklist before requesting a review

  • I have performed a self-review of my code
  • If it is a core feature, I have added thorough tests.
  • If it is a core feature, I have added documentation.

Previously, total_usage (accumulated token counts across retry attempts)
was only available when retries FAILED via InstructorRetryException.
On successful extraction, this data was computed but discarded.

Changes:
- Attach _total_usage to successful parsed responses (like _raw_response)
- Add completion:usage hook that fires after each attempt with cumulative
  token counts, enabling observability integration
- Add token_budget parameter to create() that raises TokenBudgetExceeded
  if cumulative tokens across all attempts exceed the configured limit
- TokenBudgetExceeded exception includes usage data, budget, and attempt count

This prevents runaway retry costs in production and gives users visibility
into how many tokens their retries actually consume when things work.

Refs 567-labs#2391
@jxnl

jxnl commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Reviewed for merge readiness. Directionally this is the right PR to keep for #2391 / the #2056 retry-cost thread, and I closed the older conflicted #2296 in favor of this one. Before merge, I would like this refreshed with CI and one follow-up check: make sure cumulative usage is exposed consistently for non-BaseModel successful response shapes too, especially list[Model] / ListResponse, or explicitly document that _total_usage is only attached to BaseModel responses. The hook and token_budget path should also have async coverage matching the sync tests.

Addresses reviewer feedback:
- _total_usage is now attached to ListResponse results (list[Model]),
  not just single BaseModel responses
- Added 5 async tests mirroring the sync coverage (usage attachment,
  hook firing, budget enforcement, under-budget success, list response)
- Added 3 ListResponse-specific tests (sync) for budget enforcement
  and usage attachment on list results

AdapterBase (primitive return types like str/int) cannot carry attributes,
so _total_usage is only available on BaseModel and ListResponse shapes.
@mimran-khan

Copy link
Copy Markdown
Author

Thanks for the review and for closing #2296 in favor of this. Pushed an update addressing both points:

ListResponse coverage: _total_usage is now attached to ListResponse results (the list[Model] path). The _finalize_parsed_response helper handles both the case where the parser returns a plain list (converted to ListResponse) and where it returns a ListResponse directly. Added 3 dedicated sync tests for this path.

Async coverage: Added 5 async tests matching the sync suite - usage attachment, hook firing, budget exceeded, under-budget success, and list response attachment. All use asyncio.run() with AsyncMock to exercise retry_async_v2 directly.

Note on AdapterBase: For primitive return types (like str, int via AdapterBase), Python doesn't allow attaching attributes to built-in types, so _total_usage is only available on BaseModel and ListResponse shapes. The hook and token_budget still work regardless of response shape since they operate before finalization.

14 tests total now, all passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: expose cumulative token usage on successful responses + budget enforcement

2 participants