feat: add inter-chunk-latency and tokens-per-chunk metrics #291

ajcasagrande · 2025-09-23T04:37:17Z

Inter Chunk Latency = Average time between consecutive responses

Tokens Per Chunk = Output Sequence Length / Count(Responses)

Copilot

Pull Request Overview

This PR implements two new streaming metrics for measuring token generation patterns and response timing. The metrics help analyze streaming performance characteristics by measuring the average tokens delivered per response chunk and the average time between consecutive responses.

Add TokensPerChunkMetric to calculate average tokens per response chunk
Add InterChunkLatencyMetric to measure average time between consecutive responses
Comprehensive test coverage for both metrics including edge cases and error conditions

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
aiperf/metrics/types/tokens_per_chunk.py	Implements TokensPerChunkMetric to calculate average tokens per response chunk
aiperf/metrics/types/inter_chunk_latency_metric.py	Implements InterChunkLatencyMetric to measure average latency between consecutive responses
tests/metrics/test_tokens_per_chunk.py	Comprehensive test suite for TokensPerChunkMetric including edge cases and error handling
tests/metrics/test_inter_chunk_latency_metric.py	Test suite for InterChunkLatencyMetric covering various scenarios and error conditions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

codecov · 2025-09-23T04:39:15Z

Codecov Report

❌ Patch coverage is 95.55556% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
aiperf/metrics/types/tokens_per_chunk.py	90.47%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

IzzyPutterman · 2025-09-23T04:40:12Z

Inter Chunk Latency = Average time between consecutive responses

Tokens Per Chunk = Output Sequence Length / Count(Responses)

Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1)

ajcasagrande · 2025-09-23T04:43:03Z

Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1)

Hmm, but wouldn't that skew the results if the first chunk has >1 tokens? @IzzyPutterman

IzzyPutterman · 2025-09-23T04:43:31Z

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

IzzyPutterman · 2025-09-23T04:44:49Z

Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1)

Hmm, but wouldn't that skew the results if the first chunk has >1 tokens? @IzzyPutterman

Yes could happen, I think many inference frameworks are returning the first token immediately to keep the TTFT reasonable. Equation could be generalized to attempt to count the number of tokens in that first chunk

ajcasagrande · 2025-09-23T04:47:26Z

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run. (As is the case for all metrics extending BaseRecordMetric) However that will be adjusted come the benchmark interval time slicing of metrics story.

IzzyPutterman · 2025-09-23T04:49:42Z

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run.

Not 100% set in my current reasoning, but I think not doing request aggregation is better, so list of all ICLs across all requests, then take quantiles etc from that. By stripping to request we lose some information

ajcasagrande · 2025-09-23T04:51:30Z

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run.

Not 100% set in my current reasoning, but I think not doing request aggregation is better, so list of all ICLs across all requests, then take quantiles etc from that. By stripping to request we lose some information

Gotcha, ya. Was thinking of a meaningful way we could keep all the data but still aggregate it somehow at the end.

the-david-oy

Looks good, pending the conversation with Izzy.

It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?

ajcasagrande · 2025-09-23T20:26:27Z

Looks good, pending the conversation with Izzy.

It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?

Somehow I lost my comment on this. Essentially if a metric is not able to be computed due to insufficient data, it should raise a NoMetricValue which will cause it to be skipped and no value will be added for that row. If however there is sufficient data, but the data is bad, it should raise a ValueError, which will also cause it to be not added for that record, however I believe it will log a warning as well.

the-david-oy · 2025-09-23T21:17:03Z

Looks good, pending the conversation with Izzy.
It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?

Somehow I lost my comment on this. Essentially if a metric is not able to be computed due to insufficient data, it should raise a NoMetricValue which will cause it to be skipped and no value will be added for that row. If however there is sufficient data, but the data is bad, it should raise a ValueError, which will also cause it to be not added for that record, however I believe it will log a warning as well.

Perfect, thanks for answering the above! That approach sounds great.

ajcasagrande · 2025-09-23T21:26:03Z

Closing in favor of #293.

Also, Tokens Per Chunk is under discussion, and is also not yet confirmed. So holding off for now.

feat: add inter-chunk-latency and tokens-per-chunk metrics

4753cd3

github-actions bot added the feat label Sep 23, 2025

ajcasagrande requested review from IzzyPutterman and Copilot September 23, 2025 04:37

Copilot AI reviewed Sep 23, 2025

View reviewed changes

ajcasagrande requested review from debermudez and the-david-oy September 23, 2025 04:37

the-david-oy reviewed Sep 23, 2025

View reviewed changes

ajcasagrande closed this Sep 23, 2025

ajcasagrande deleted the ajc/inter-chunk-latency branch October 18, 2025 19:05

feat: add inter-chunk-latency and tokens-per-chunk metrics #291

feat: add inter-chunk-latency and tokens-per-chunk metrics #291

Uh oh!

Conversation

ajcasagrande commented Sep 23, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

codecov bot commented Sep 23, 2025

Codecov Report

Uh oh!

IzzyPutterman commented Sep 23, 2025

Uh oh!

ajcasagrande commented Sep 23, 2025

Uh oh!

IzzyPutterman commented Sep 23, 2025

Uh oh!

IzzyPutterman commented Sep 23, 2025

Uh oh!

ajcasagrande commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IzzyPutterman commented Sep 23, 2025

Uh oh!

ajcasagrande commented Sep 23, 2025

Uh oh!

the-david-oy left a comment

Choose a reason for hiding this comment

Uh oh!

ajcasagrande commented Sep 23, 2025

Uh oh!

the-david-oy commented Sep 23, 2025

Uh oh!

ajcasagrande commented Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajcasagrande commented Sep 23, 2025 •

edited

Loading