Conversation
- Introduced a new `InferenceMetricsTracker` class in both backend and frontend to accurately track inference performance metrics, including Time to First Token (TTFT) and Time Per Output Token (TPOT). - Updated `stream_to_cloud_model` and `stream_response_from_external_api` functions to utilize the new metrics tracker for logging and calculating performance statistics. - Enhanced frontend components to display detailed inference statistics, including client-side TTFT and per-token timing metrics. - Added support for progressive statistics during streaming, allowing real-time performance monitoring. - Updated related types to accommodate new metrics and statistics structures.
There was a problem hiding this comment.
Pull request overview
This PR implements comprehensive metrics improvements for inference performance tracking, adding both client-side and backend measurement capabilities, along with UI enhancements for displaying "thinking" processes in LLM responses.
Changes:
- Adds new metrics tracking infrastructure on both frontend (TypeScript) and backend (Python) to measure TTFT, TPOT, network latency, and per-token timing
- Implements thinking block visualization in the streaming message UI, allowing users to see and toggle the LLM's reasoning process
- Enhances the InferenceStats component to display new metrics including client TTFT, network latency, and per-token timing statistics
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| app/frontend/tsconfig.json | Adds explicit type declarations for vite/client, react, and react-dom |
| app/frontend/src/components/chatui/types.ts | Defines new interfaces for token timestamps, progressive stats, and enhanced inference metrics |
| app/frontend/src/components/chatui/metricsTracker.ts | Implements client-side metrics tracker for measuring TTFT, recording token timestamps, and calculating network latency |
| app/frontend/src/components/chatui/runInference.ts | Integrates metrics tracker into inference flow to capture timing data during streaming |
| app/frontend/src/components/chatui/StreamingMessage.tsx | Adds thinking block detection, extraction, and visualization with collapsible UI |
| app/frontend/src/components/chatui/InferenceStats.tsx | Enhances stats display with new client-side metrics, network latency, and per-token timing percentiles |
| app/backend/model_control/metrics_tracker.py | Implements backend metrics tracker with TTFT/TPOT calculation and per-token timing |
| app/backend/model_control/model_utils.py | Integrates backend metrics tracker into streaming functions for both cloud and external API calls |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // SPDX-FileCopyrightText: © 2025 Tenstorrent AI ULC | ||
| // SPDX-FileCopyrightText: © 2025 Tenstorrent AI ULC | ||
| // SPDX-FileCopyrightText: © 2026 Tenstorrent AI ULC | ||
| // SPDX-FileCopyrightText: © 2026 Tenstorrent AI ULC |
There was a problem hiding this comment.
The file header contains the SPDX-FileCopyrightText line twice. This can cause header/CI checks to fail and should be deduplicated to a single line.
| // SPDX-FileCopyrightText: © 2026 Tenstorrent AI ULC |
| continue; // Skip processing this chunk as part of the generated text | ||
| } | ||
| // Handle final statistics from backend (after [DONE]) | ||
| if (!isAgentSelected && jsonData.ttft && jsonData.tpot) { |
There was a problem hiding this comment.
The final-stats detection uses a truthy check (jsonData.ttft && jsonData.tpot). Since the backend can legitimately emit 0.0 for TTFT/TPOT (e.g., no tokens or only one token), this condition will skip attaching stats. Use an explicit numeric/undefined check (e.g., typeof jsonData.ttft === "number" and typeof jsonData.tpot === "number").
| if (!isAgentSelected && jsonData.ttft && jsonData.tpot) { | |
| if ( | |
| !isAgentSelected && | |
| typeof jsonData.ttft === "number" && | |
| typeof jsonData.tpot === "number" | |
| ) { |
| def get_stats(self) -> Dict[str, float]: | ||
| """ | ||
| Get final statistics | ||
|
|
||
| Returns: |
There was a problem hiding this comment.
get_stats() is annotated as returning Dict[str, float], but the returned dict includes integer values for token counts/context length. Update the annotation to a mixed type (e.g., Dict[str, float | int]) or normalize values to floats to keep typing accurate.
| }) { | ||
| const [renderedContent, setRenderedContent] = useState(""); | ||
| const showThinking = externalShowThinking ?? false; | ||
| const [showThinking, setShowThinking] = useState(false); |
There was a problem hiding this comment.
externalShowThinking is destructured from props but never used. With noUnusedLocals: true, this will fail type-checking; remove the destructuring or wire it into the new internal showThinking state (e.g., treat it as a controlled prop).
| const [showThinking, setShowThinking] = useState(false); | |
| const [showThinking, setShowThinking] = useState(externalShowThinking ?? false); |
| for _ in range(tokens_arrived): | ||
| self.token_times.append(current_time) |
There was a problem hiding this comment.
For token count jumps, record_token() appends multiple identical timestamps (same current_time). This biases get_tpot() downward by treating batch-arrived tokens as zero-time intervals. Consider deriving per-token time from (delta_time / delta_tokens) or using the median-based approach (get_accurate_tpot()) for reported stats.
| for _ in range(tokens_arrived): | |
| self.token_times.append(current_time) | |
| # Distribute timestamps evenly between last token time and now | |
| if self.token_times: | |
| last_time = self.token_times[-1] | |
| else: | |
| # Fallback: if for some reason we have no prior timestamps, | |
| # treat all tokens as arriving at current_time. | |
| last_time = current_time | |
| if tokens_arrived == 1 or current_time <= last_time: | |
| # Single token or no positive time delta: keep current behavior | |
| self.token_times.append(current_time) | |
| else: | |
| delta_time = current_time - last_time | |
| step = delta_time / tokens_arrived | |
| for i in range(1, tokens_arrived + 1): | |
| self.token_times.append(last_time + step * i) |
| class InferenceMetricsTracker: | ||
| """Track inference metrics during streaming""" |
There was a problem hiding this comment.
There are existing backend tests under app/backend/model_control/, but the new InferenceMetricsTracker logic is untested. Adding unit tests for record_token() (including token-count jumps) and get_stats() would help prevent regressions in TTFT/TPOT calculations.
| "module": "ESNext", | ||
| "skipLibCheck": true, | ||
| "baseUrl": ".", | ||
| "types": ["vite/client", "react", "react-dom"], |
There was a problem hiding this comment.
Setting compilerOptions.types overrides TypeScript’s automatic inclusion of @types/* packages. With the current list, @types/node is no longer included, but the frontend source uses NodeJS.* types (e.g. NodeJS.Timeout), which will cause type-check/build failures. Consider either removing the types override or adding "node" to this list (or refactoring those call sites to use ReturnType<typeof setTimeout>).
| "types": ["vite/client", "react", "react-dom"], | |
| "types": ["vite/client", "react", "react-dom", "node"], |
| const [showThinking, setShowThinking] = useState(false); | ||
| const [isThinkingActive, setIsThinkingActive] = useState(false); | ||
| const contentRef = useRef(processContent(content).cleanedContent); | ||
| const contentRef = useRef(processContent(content, isStreamFinished).cleanedContent); |
There was a problem hiding this comment.
processContent is declared as (content: string), but it’s called as processContent(content, isStreamFinished) here. This is a TS compile error; either update processContent to accept the second parameter (and use it) or remove the extra argument.
| const contentRef = useRef(processContent(content, isStreamFinished).cleanedContent); | |
| const contentRef = useRef(processContent(content).cleanedContent); |
No description provided.