Area(s)
No response
What's missing?
vLLM has deprecated Time Per Output Token (TPOT) metric and the recommendation is to use Inter Token Latency metric.
https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/
Other engines like TGI, NIM also support the ITL metric.
The Generative AI model server/client metrics Semantic Convention doesn't have Inter Token Latency metric.
Describe the solution you'd like
As the definitions of both the metrics are different:
Would like to know community's thoughts about these two metrics.
Can we consider adding ITL to the GenAI metrics Semantic Convention?
If not, can we use ITL and TPOT interchangeably?
What was the reason why this was not added in the convention before?
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.