Skip to content

Added token metrics with the model response#80

Merged
madclaws merged 5 commits intomainfrom
benchmarl
Jan 30, 2026
Merged

Added token metrics with the model response#80
madclaws merged 5 commits intomainfrom
benchmarl

Conversation

@madclaws
Copy link
Copy Markdown
Member

No description provided.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

This pull request introduces performance metrics tracking across the ML backend system. A new GenerationMetrics dataclass is added to capture time-to-first-token, total tokens generated, tokens per second, and total latency. The Python backend modules are updated to initialize and track these metrics during token streaming, detecting and capturing metrics payloads in the streaming loop and attaching aggregated metrics to final responses. The Rust runtime adds a corresponding BenchmarkMetrics struct with serialization support, updates the ChatResponse to include optional metrics, and propagates metrics through the chat response pipeline.

Sequence Diagram

sequenceDiagram
    actor Client
    participant MLXRunner as Python: mlx_runner
    participant MLX as Python: mlx
    participant Schemas as Python: schemas
    participant RustRuntime as Rust: mlx.rs

    Client->>MLXRunner: Send generation request
    activate MLXRunner
    MLXRunner->>MLXRunner: Initialize ttft = None
    
    loop Token streaming
        MLXRunner->>Schemas: Emit token/metrics payload
        MLXRunner->>MLX: Yield token (or metrics)
    end
    
    MLXRunner->>MLXRunner: Finalize reasoning parser if active
    MLXRunner->>Schemas: Emit final GenerationMetrics
    deactivate MLXRunner
    
    MLX->>MLX: Detect GenerationMetrics in stream
    MLX->>MLX: Store metrics from payload
    MLX->>MLX: Assemble final_response with metrics
    MLX->>Client: Yield response with metrics data event
    
    Client->>RustRuntime: Receive response
    activate RustRuntime
    RustRuntime->>RustRuntime: Parse metrics from response
    RustRuntime->>RustRuntime: Convert to BenchmarkMetrics
    RustRuntime->>RustRuntime: Update bench_metrics accumulator
    RustRuntime->>RustRuntime: Display formatted metrics summary
    deactivate RustRuntime
    
    RustRuntime->>Client: Return ChatResponse with metrics
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive No description was provided by the author, making it impossible to assess relevance to the changeset. Add a brief description explaining what metrics are tracked, why they were added, and how they are used in the response pipeline.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Added token metrics with the model response' directly and clearly summarizes the main change: adding metrics tracking (token counts, latency measurements) to model responses across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch benchmarl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@madclaws madclaws linked an issue Jan 30, 2026 that may be closed by this pull request
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@server/backend/mlx.py`:
- Around line 173-181: Fix the indentation and replace the debug print with the
module logger: align the "if metrics:" block indentation to match surrounding
code and add the metrics into final_response as shown, then replace the
print(f"data: {json.dumps(final_response)}") call with logger.debug(...) (or
logger.info(...) per convention) to log the serialized final_response; reference
the final_response variable, the metrics block, and the existing logger object
when making the change.

In `@tiles/Cargo.toml`:
- Line 18: Remove the unused chrono dependency from the tiles crate by deleting
the line `chrono = "0.4"` in Cargo.toml, then search for and remove any stray
`use chrono::...` or `extern crate chrono;` references (none should be needed
because BenchmarkMetrics uses f64 and timing uses std::time), run cargo check to
confirm no remaining chrono references, and commit the Cargo.toml change.

In `@tiles/src/runtime/mlx.rs`:
- Around line 454-456: bench_metrics.total_tokens /
bench_metrics.total_latency_s as i32 can panic because casting total_latency_s
(f64) to i32 truncates values <1.0 to 0 and integer division truncates the
result; change the computation to operate in floating point and guard against
zero/near-zero latency by checking bench_metrics.total_latency_s <= 0.0 (or <
epsilon) and returning 0.0 in that case, otherwise compute rate as
(bench_metrics.total_tokens as f64) / bench_metrics.total_latency_s and then
format or round that f64 for display instead of casting the latency to i32.
🧹 Nitpick comments (3)
tiles/src/runtime/mlx.rs (1)

36-46: tokens_per_second accumulation is semantically incorrect.

Adding tokens_per_second values from multiple responses doesn't produce a meaningful aggregate throughput. Since the display code at line 455-456 recalculates throughput from total_tokens / total_latency_s, consider removing the accumulation or documenting that this field represents a sum rather than a rate.

♻️ Option: Don't accumulate tokens_per_second
     fn update(&mut self, metrics: BenchmarkMetrics) -> &Self {
         if self.ttft_ms == 0.0 {
             self.ttft_ms += metrics.ttft_ms;
         }
         self.total_tokens += metrics.total_tokens;
-        self.tokens_per_second += metrics.tokens_per_second;
+        // tokens_per_second is recalculated from total_tokens/total_latency_s when displayed
         self.total_latency_s += metrics.total_latency_s;
         self
     }
server/backend/mlx_runner.py (2)

569-580: Code duplication: Metrics emission repeated three times.

The GenerationMetrics construction and yield logic is duplicated at three exit points (native stop, chat stop, end of generation). Consider extracting a helper method to reduce duplication and ensure consistency.

♻️ Proposed helper method

Add a helper method within generate_streaming:

def _emit_metrics():
    total_latency = time.time() - start_time
    tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0
    ttft_ms = (ttft * 1000) if ttft is not None else (total_latency * 1000)
    return GenerationMetrics(
        ttft_ms=ttft_ms,
        total_tokens=tokens_generated,
        tokens_per_second=tokens_per_second,
        total_latency_s=total_latency
    )

Then replace each emission site with yield _emit_metrics().

Also applies to: 610-621, 644-654


573-574: TTFT of 0ms is misleading when no tokens were generated before stop.

When ttft is None (e.g., stop token hit before first token yield), ttft_ms defaults to 0, which suggests instant response rather than "not measured". Consider using None or the total latency as fallback for clearer semantics.

💡 Alternative: Use total_latency as fallback
-ttft_ms = (ttft * 1000) if ttft is not None else 0
+ttft_ms = (ttft * 1000) if ttft is not None else (total_latency * 1000)

This makes TTFT equal to total latency when no intermediate tokens were generated, which is technically accurate.

Also applies to: 614-615, 647-648

Comment thread server/backend/mlx.py Outdated
Comment on lines +173 to +181
# Include benchmarking metrics if available
if metrics:
final_response["metrics"] = {
"ttft_ms": metrics.ttft_ms,
"total_tokens": metrics.total_tokens,
"tokens_per_second": metrics.tokens_per_second,
"total_latency_s": metrics.total_latency_s,
}
print(f"data: {json.dumps(final_response)}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Indentation inconsistency and debug print statement.

  1. Lines 173-180 have inconsistent indentation (extra leading spaces on line 173).
  2. Line 181 uses print() while the rest of the file uses logger. Consider using logger.debug() for consistency.
🐛 Proposed fix
-       # Include benchmarking metrics if available
+    # Include benchmarking metrics if available
     if metrics:
         final_response["metrics"] = {
             "ttft_ms": metrics.ttft_ms,
             "total_tokens": metrics.total_tokens,
             "tokens_per_second": metrics.tokens_per_second,
             "total_latency_s": metrics.total_latency_s,
         }
-    print(f"data: {json.dumps(final_response)}")
+    logger.debug(f"Final response: {json.dumps(final_response)}")
🤖 Prompt for AI Agents
In `@server/backend/mlx.py` around lines 173 - 181, Fix the indentation and
replace the debug print with the module logger: align the "if metrics:" block
indentation to match surrounding code and add the metrics into final_response as
shown, then replace the print(f"data: {json.dumps(final_response)}") call with
logger.debug(...) (or logger.info(...) per convention) to log the serialized
final_response; reference the final_response variable, the metrics block, and
the existing logger object when making the change.

Comment thread tiles/Cargo.toml Outdated
Comment thread tiles/src/runtime/mlx.rs Outdated
Comment on lines +454 to +456
"💡".yellow(),
bench_metrics.total_tokens
/ bench_metrics.total_latency_s as i32,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Integer division can cause division by zero panic.

Line 456 casts total_latency_s (f64) to i32, which truncates values less than 1.0 to 0, causing a panic. Additionally, dividing i32 by i32 produces integer truncation.

🐛 Proposed fix
                             println!(
                                 "{}",
                                 format!(
                                     "\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",
                                     "💡".yellow(),
-                                    bench_metrics.total_tokens
-                                        / bench_metrics.total_latency_s as i32,
+                                    bench_metrics.total_tokens as f64
+                                        / bench_metrics.total_latency_s,
                                     bench_metrics.total_tokens,
                                     bench_metrics.ttft_ms
                                 )
                                 .dimmed()
                             );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"💡".yellow(),
bench_metrics.total_tokens
/ bench_metrics.total_latency_s as i32,
println!(
"{}",
format!(
"\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",
"💡".yellow(),
bench_metrics.total_tokens as f64
/ bench_metrics.total_latency_s,
bench_metrics.total_tokens,
bench_metrics.ttft_ms
)
.dimmed()
);
🤖 Prompt for AI Agents
In `@tiles/src/runtime/mlx.rs` around lines 454 - 456, bench_metrics.total_tokens
/ bench_metrics.total_latency_s as i32 can panic because casting total_latency_s
(f64) to i32 truncates values <1.0 to 0 and integer division truncates the
result; change the computation to operate in floating point and guard against
zero/near-zero latency by checking bench_metrics.total_latency_s <= 0.0 (or <
epsilon) and returning 0.0 in that case, otherwise compute rate as
(bench_metrics.total_tokens as f64) / bench_metrics.total_latency_s and then
format or round that f64 for display instead of casting the latency to i32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a benchmark for chat session

1 participant