Skip to content

Feature: support cached input tokens in LLM span cost tracking #2741

@carlitose

Description

@carlitose

Is your feature request related to a problem? Please describe.

LlmSpan (and update_llm_span / @observe(type="llm")) only supports four cost-related
fields: input_token_count, output_token_count, cost_per_input_token,
cost_per_output_token. There is no way to report cached input tokens, even though
every major provider now bills them at a discounted rate:

  • OpenAI / Azure OpenAI: prompt_tokens_details.cached_tokens (inclusive a subset of
    prompt_tokens), billed at ~50% of input price (up to 100% off on provisioned
    deployments).
  • Anthropic: cache_read_input_tokens (exclusive of input_tokens), billed at ~10% of
    input price.

Because cost is computed as input_token_count × cost_per_input_token, any application
with a decent cache hit rate sees its cost systematically overestimated on Confident
AI and there is no workaround that keeps both token counts and cost exact (a weighted
effective cost_per_input_token keeps cost exact but shows a misleading per-token rate).

Your own codebase has already hit this limitation: the OpenAI Agents integration
(deepeval/openai_agents/extractors.py) extracts usage.input_tokens_details.cached_tokens
from the SDK response, but has to stash it in the generic span.metadata dict because
LlmSpan has no first-class field for it. So it is invisible to cost/token analytics.

The OpenTelemetry exporter has the same gap: the confident.llm.* namespace exposes the
same four fields, and GenAI semconv cache attributes (e.g.
gen_ai.usage.cache_read.input_tokens as emitted by OpenLLMetry) are not mapped.

Notably, Confident AI's own AI agent observability playbook
recommends capturing "Tokens by category (prompt, completion, cached, reasoning),
estimated dollar cost". The SDK currently can't express that schema.

Describe the solution you'd like

  1. New optional fields on LlmSpan (+ update_llm_span params):
    cached_input_token_count and cost_per_cached_input_token.
  2. Server-side cost formula when present:
    (input − cached) × cost_per_input + cached × cost_per_cached + output × cost_per_output
    (assuming inclusive counting; Anthropic's exclusive counting would need either
    normalization at ingestion or a documented convention).
  3. A cached-input price column in Settings → Model Costs, so automatic cost inference
    benefits too.
  4. OTel exporter: map the GenAI semconv cached-token attributes to the new field.
  5. The OpenAI Agents extractor could then promote cached_input_tokens from metadata
    to the first-class field.

I'm happy to contribute the SDK side of this (types, update_llm_span, the
_convert_span_to_api_span mapping, OTel exporter, extractors) as a PR if the
/v1/traces schema and cost computation can be extended to accept it.

Describe alternatives you've considered

  • Passing a weighted effective cost_per_input_token per call (exact cost, misleading
    unit rate, requires maintaining a provider price table client-side).
  • Recording cached tokens as span metadata (what the OpenAI Agents integration does
    today. Visible on the trace but excluded from cost/token analytics).

Additional context

Verified against deepeval 4.0.4 and current main (deepeval/tracing/types.py,
deepeval/tracing/api.py, deepeval/tracing/otel/exporter.py). Use case: a WhatsApp
customer-support agent on Azure OpenAI where system prompt + tool schemas dominate input
and cache hit rates are high, making the dashboard cost consistently inflated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions