Skip to content

Conversation

@ajcasagrande
Copy link
Contributor

Inter Chunk Latency = Average time between consecutive responses
Tokens Per Chunk = Output Sequence Length / Count(Responses)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements two new streaming metrics for measuring token generation patterns and response timing. The metrics help analyze streaming performance characteristics by measuring the average tokens delivered per response chunk and the average time between consecutive responses.

  • Add TokensPerChunkMetric to calculate average tokens per response chunk
  • Add InterChunkLatencyMetric to measure average time between consecutive responses
  • Comprehensive test coverage for both metrics including edge cases and error conditions

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
aiperf/metrics/types/tokens_per_chunk.py Implements TokensPerChunkMetric to calculate average tokens per response chunk
aiperf/metrics/types/inter_chunk_latency_metric.py Implements InterChunkLatencyMetric to measure average latency between consecutive responses
tests/metrics/test_tokens_per_chunk.py Comprehensive test suite for TokensPerChunkMetric including edge cases and error handling
tests/metrics/test_inter_chunk_latency_metric.py Test suite for InterChunkLatencyMetric covering various scenarios and error conditions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@codecov
Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 95.55556% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
aiperf/metrics/types/tokens_per_chunk.py 90.47% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@IzzyPutterman
Copy link

Inter Chunk Latency = Average time between consecutive responses
Tokens Per Chunk = Output Sequence Length / Count(Responses)

Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1)

@ajcasagrande
Copy link
Contributor Author

Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1)

Hmm, but wouldn't that skew the results if the first chunk has >1 tokens? @IzzyPutterman

@IzzyPutterman
Copy link

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

@IzzyPutterman
Copy link

Not sure I would define Tokens per chunk this way. I would do (OSL-1)/(Count(responses)-1)

Hmm, but wouldn't that skew the results if the first chunk has >1 tokens? @IzzyPutterman

Yes could happen, I think many inference frameworks are returning the first token immediately to keep the TTFT reasonable. Equation could be generalized to attempt to count the number of tokens in that first chunk

@ajcasagrande
Copy link
Contributor Author

ajcasagrande commented Sep 23, 2025

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run. (As is the case for all metrics extending BaseRecordMetric) However that will be adjusted come the benchmark interval time slicing of metrics story.

@IzzyPutterman
Copy link

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run.

Not 100% set in my current reasoning, but I think not doing request aggregation is better, so list of all ICLs across all requests, then take quantiles etc from that. By stripping to request we lose some information

@ajcasagrande
Copy link
Contributor Author

Also is inter-chunk latency request aggregated or are you aggregating over all requests? I think doing it over all requests might be better as its more of a server metric

Right now it is implemented the same way as ITL. You get a single average value for each request which will then be aggregated across all the requests for a single avg/min/max/p90/pXX for the whole run.

Not 100% set in my current reasoning, but I think not doing request aggregation is better, so list of all ICLs across all requests, then take quantiles etc from that. By stripping to request we lose some information

Gotcha, ya. Was thinking of a meaningful way we could keep all the data but still aggregate it somehow at the end.

Copy link
Contributor

@the-david-oy the-david-oy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, pending the conversation with Izzy.

It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?

@ajcasagrande
Copy link
Contributor Author

Looks good, pending the conversation with Izzy.

It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?

Somehow I lost my comment on this. Essentially if a metric is not able to be computed due to insufficient data, it should raise a NoMetricValue which will cause it to be skipped and no value will be added for that row. If however there is sufficient data, but the data is bad, it should raise a ValueError, which will also cause it to be not added for that record, however I believe it will log a warning as well.

@the-david-oy
Copy link
Contributor

Looks good, pending the conversation with Izzy.
It looks like one of the error conditions doesn't have a unit test, so it'd be good to add that. What happens when a metric returns an error? e.g. if a user uses this with an OSL distribution that allows for one chunk to happen sometimes (e.g. an OSL of 1 is valid), what happens?

Somehow I lost my comment on this. Essentially if a metric is not able to be computed due to insufficient data, it should raise a NoMetricValue which will cause it to be skipped and no value will be added for that row. If however there is sufficient data, but the data is bad, it should raise a ValueError, which will also cause it to be not added for that record, however I believe it will log a warning as well.

Perfect, thanks for answering the above! That approach sounds great.

@ajcasagrande
Copy link
Contributor Author

Closing in favor of #293.

Also, Tokens Per Chunk is under discussion, and is also not yet confirmed. So holding off for now.

@ajcasagrande ajcasagrande deleted the ajc/inter-chunk-latency branch October 18, 2025 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants