-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
🧠 Feature: Token Bloat Detector
Problem
We’re losing 10–30% of LLM spend to silent token inefficiencies — verbose generations, over-sized context windows, retry storms, and model misuse (e.g. GPT-4 where GPT-3.5 would suffice).
Existing observability tools show token counts; they don’t explain waste or enforce prevention. CrashLens needs to detect, attribute, and enforce token efficiency as policy.
Goals
- Identify and quantify “token bloat” across requests, routes, and models.
- Attribute waste to specific prompt templates, model versions, or user flows.
- Auto-suggest fixes and integrate with CI/CD or runtime policy enforcement.
- Output audit-grade financial reports showing cost impact per % of bloat.
Detection Logic (initial heuristics)
-
Compare actual tokens used vs. expected tokens (based on prompt length, model efficiency benchmarks).
-
Flag overage thresholds:
-
25% deviation = warning
-
50% deviation = violation
-
-
Classify root causes:
- Prompt verbosity – unnecessary or repetitive phrasing.
- Context overreach – excessive retrieval/context size.
- Model misuse – high-cost model where smaller model suffices.
- Retry storms – repeated completions for same intent.
Output & Integration
-
CLI & API endpoints:
crashlens detect --bloat→ outputs JSON report per route.crashlens enforce --policy token_bloat.yaml→ blocks CI/CD merges if threshold exceeded.
-
Dashboard view: cost leakage chart (% of spend wasted).
-
Optional webhook → send alerts to Slack or Grafana.
Example Policy
policies:
- id: token_bloat_limit
threshold: 25
action: block
message: "Token usage exceeds expected by >25%."Metrics of Success
- Detect 90%+ of token inefficiencies with <5% false positives.
- Reduce token spend by 15–20% in pilot customers.
- Median detection latency <2s for 1M+ log entries.
Open Questions
- Which model metadata (usage, retries, latency) is exposed per vendor?
- How do we baseline “expected token use” per model-family?
- Should detection run inline (real-time) or as nightly batch job?
- How do we integrate fixes with OPA policy layer seamlessly?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request