-
Notifications
You must be signed in to change notification settings - Fork 25
feat: add inter-chunk-latency and tokens-per-chunk metrics #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ajcasagrande
commented
Sep 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements two new streaming metrics for measuring token generation patterns and response timing. The metrics help analyze streaming performance characteristics by measuring the average tokens delivered per response chunk and the average time between consecutive responses.
- Add
TokensPerChunkMetricto calculate average tokens per response chunk - Add
InterChunkLatencyMetricto measure average time between consecutive responses - Comprehensive test coverage for both metrics including edge cases and error conditions
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| aiperf/metrics/types/tokens_per_chunk.py | Implements TokensPerChunkMetric to calculate average tokens per response chunk |
| aiperf/metrics/types/inter_chunk_latency_metric.py | Implements InterChunkLatencyMetric to measure average latency between consecutive responses |
| tests/metrics/test_tokens_per_chunk.py | Comprehensive test suite for TokensPerChunkMetric including edge cases and error handling |
| tests/metrics/test_inter_chunk_latency_metric.py | Test suite for InterChunkLatencyMetric covering various scenarios and error conditions |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1) |
Hmm, but wouldn't that skew the results if the first chunk has >1 tokens? @IzzyPutterman |
|
Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric |
Yes could happen, I think many inference frameworks are returning the first token immediately to keep the TTFT reasonable. Equation could be generalized to attempt to count the number of tokens in that first chunk |
Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run. (As is the case for all metrics extending |
Not 100% set in my current reasoning, but I think not doing request aggregation is better, so list of all ICLs across all requests, then take quantiles etc from that. By stripping to request we lose some information |
Gotcha, ya. Was thinking of a meaningful way we could keep all the data but still aggregate it somehow at the end. |
the-david-oy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, pending the conversation with Izzy.
It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?
Somehow I lost my comment on this. Essentially if a metric is not able to be computed due to insufficient data, it should raise a |
Perfect, thanks for answering the above! That approach sounds great. |
|
Closing in favor of #293. Also, Tokens Per Chunk is under discussion, and is also not yet confirmed. So holding off for now. |