Skip to content

Commit 5663e16

Browse files
authored
Exclude yield/reply time from first token latency metric (opea-project#973)
While metrics are OK for small number of requests, when megaservice is handling many (hundreds of) _parallel_ requests, it was reporting clearly (~10%) larger first token latency, than the client receiving the tokens from the megaservice. Getting the time before token is yielded, means that reported first token latency can be slightly shorter than it actually is. However, testing with ChatQnA shows latencies to be clearly closer to ones seen by the client (within couple of percent) and typically smaller (i.e. logical). PS. Doing the metrics timing after yielding the token, meant that also time for sending the reply to the client and waiting that to complete, was included to the token time. I suspect that with lot of parallel requests, processing often had switched to other megaservice request processing threads, and getting control back to yielding thread for timing, could be delayed much longer than sending the response to client took. Signed-off-by: Eero Tamminen <[email protected]>
1 parent 3328ea3 commit 5663e16

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

comps/cores/mega/orchestrator.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -237,8 +237,8 @@ def generate():
237237
)
238238
token_start = time.time()
239239
else:
240-
yield chunk
241240
token_start = self.metrics.token_update(token_start, is_first)
241+
yield chunk
242242
is_first = False
243243
self.metrics.request_update(req_start)
244244
self.metrics.pending_update(False)
@@ -306,7 +306,7 @@ def token_generator(self, sentence: str, token_start: float, is_first: bool, is_
306306
suffix = "\n\n"
307307
tokens = re.findall(r"\s?\S+\s?", sentence, re.UNICODE)
308308
for token in tokens:
309-
yield prefix + repr(token.replace("\\n", "\n").encode("utf-8")) + suffix
310309
token_start = self.metrics.token_update(token_start, is_first)
310+
yield prefix + repr(token.replace("\\n", "\n").encode("utf-8")) + suffix
311311
if is_last:
312312
yield "data: [DONE]\n\n"

0 commit comments

Comments
 (0)