Added token metrics with the model response by madclaws · Pull Request #80 · tilesprivacy/tiles

madclaws · 2026-01-30T06:16:15Z

No description provided.

@kshitijgetsac

thanks to @kshitijgetsac for the reference #43

coderabbitai · 2026-01-30T06:16:29Z

📝 Walkthrough

Walkthrough

This pull request introduces performance metrics tracking across the ML backend system. A new GenerationMetrics dataclass is added to capture time-to-first-token, total tokens generated, tokens per second, and total latency. The Python backend modules are updated to initialize and track these metrics during token streaming, detecting and capturing metrics payloads in the streaming loop and attaching aggregated metrics to final responses. The Rust runtime adds a corresponding BenchmarkMetrics struct with serialization support, updates the ChatResponse to include optional metrics, and propagates metrics through the chat response pipeline.

Sequence Diagram

sequenceDiagram
    actor Client
    participant MLXRunner as Python: mlx_runner
    participant MLX as Python: mlx
    participant Schemas as Python: schemas
    participant RustRuntime as Rust: mlx.rs

    Client->>MLXRunner: Send generation request
    activate MLXRunner
    MLXRunner->>MLXRunner: Initialize ttft = None
    
    loop Token streaming
        MLXRunner->>Schemas: Emit token/metrics payload
        MLXRunner->>MLX: Yield token (or metrics)
    end
    
    MLXRunner->>MLXRunner: Finalize reasoning parser if active
    MLXRunner->>Schemas: Emit final GenerationMetrics
    deactivate MLXRunner
    
    MLX->>MLX: Detect GenerationMetrics in stream
    MLX->>MLX: Store metrics from payload
    MLX->>MLX: Assemble final_response with metrics
    MLX->>Client: Yield response with metrics data event
    
    Client->>RustRuntime: Receive response
    activate RustRuntime
    RustRuntime->>RustRuntime: Parse metrics from response
    RustRuntime->>RustRuntime: Convert to BenchmarkMetrics
    RustRuntime->>RustRuntime: Update bench_metrics accumulator
    RustRuntime->>RustRuntime: Display formatted metrics summary
    deactivate RustRuntime
    
    RustRuntime->>Client: Return ChatResponse with metrics

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	No description was provided by the author, making it impossible to assess relevance to the changeset.	Add a brief description explaining what metrics are tracked, why they were added, and how they are used in the response pipeline.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Added token metrics with the model response' directly and clearly summarizes the main change: adding metrics tracking (token counts, latency measurements) to model responses across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch benchmarl

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@server/backend/mlx.py`:
- Around line 173-181: Fix the indentation and replace the debug print with the
module logger: align the "if metrics:" block indentation to match surrounding
code and add the metrics into final_response as shown, then replace the
print(f"data: {json.dumps(final_response)}") call with logger.debug(...) (or
logger.info(...) per convention) to log the serialized final_response; reference
the final_response variable, the metrics block, and the existing logger object
when making the change.

In `@tiles/Cargo.toml`:
- Line 18: Remove the unused chrono dependency from the tiles crate by deleting
the line `chrono = "0.4"` in Cargo.toml, then search for and remove any stray
`use chrono::...` or `extern crate chrono;` references (none should be needed
because BenchmarkMetrics uses f64 and timing uses std::time), run cargo check to
confirm no remaining chrono references, and commit the Cargo.toml change.

In `@tiles/src/runtime/mlx.rs`:
- Around line 454-456: bench_metrics.total_tokens /
bench_metrics.total_latency_s as i32 can panic because casting total_latency_s
(f64) to i32 truncates values <1.0 to 0 and integer division truncates the
result; change the computation to operate in floating point and guard against
zero/near-zero latency by checking bench_metrics.total_latency_s <= 0.0 (or <
epsilon) and returning 0.0 in that case, otherwise compute rate as
(bench_metrics.total_tokens as f64) / bench_metrics.total_latency_s and then
format or round that f64 for display instead of casting the latency to i32.

🧹 Nitpick comments (3)

tiles/src/runtime/mlx.rs (1)
36-46: tokens_per_second accumulation is semantically incorrect.

Adding tokens_per_second values from multiple responses doesn't produce a meaningful aggregate throughput. Since the display code at line 455-456 recalculates throughput from total_tokens / total_latency_s, consider removing the accumulation or documenting that this field represents a sum rather than a rate.
♻️ Option: Don't accumulate tokens_per_second
     fn update(&mut self, metrics: BenchmarkMetrics) -> &Self {
         if self.ttft_ms == 0.0 {
             self.ttft_ms += metrics.ttft_ms;
         }
         self.total_tokens += metrics.total_tokens;
-        self.tokens_per_second += metrics.tokens_per_second;
+        // tokens_per_second is recalculated from total_tokens/total_latency_s when displayed
         self.total_latency_s += metrics.total_latency_s;
         self
     }
server/backend/mlx_runner.py (2)
569-580: Code duplication: Metrics emission repeated three times.

The GenerationMetrics construction and yield logic is duplicated at three exit points (native stop, chat stop, end of generation). Consider extracting a helper method to reduce duplication and ensure consistency.
♻️ Proposed helper method

Add a helper method within generate_streaming:
def _emit_metrics():
    total_latency = time.time() - start_time
    tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0
    ttft_ms = (ttft * 1000) if ttft is not None else (total_latency * 1000)
    return GenerationMetrics(
        ttft_ms=ttft_ms,
        total_tokens=tokens_generated,
        tokens_per_second=tokens_per_second,
        total_latency_s=total_latency
    )
Then replace each emission site with yield _emit_metrics().
Also applies to: 610-621, 644-654

573-574: TTFT of 0ms is misleading when no tokens were generated before stop.

When ttft is None (e.g., stop token hit before first token yield), ttft_ms defaults to 0, which suggests instant response rather than "not measured". Consider using None or the total latency as fallback for clearer semantics.
💡 Alternative: Use total_latency as fallback
-ttft_ms = (ttft * 1000) if ttft is not None else 0
+ttft_ms = (ttft * 1000) if ttft is not None else (total_latency * 1000)
This makes TTFT equal to total latency when no intermediate tokens were generated, which is technically accurate.
Also applies to: 614-615, 647-648

coderabbitai · 2026-01-30T06:25:42Z

+       # Include benchmarking metrics if available
+    if metrics:
+        final_response["metrics"] = {
+            "ttft_ms": metrics.ttft_ms,
+            "total_tokens": metrics.total_tokens,
+            "tokens_per_second": metrics.tokens_per_second,
+            "total_latency_s": metrics.total_latency_s,
+        }
+    print(f"data: {json.dumps(final_response)}")


⚠️ Potential issue | 🟡 Minor

Indentation inconsistency and debug print statement.

Lines 173-180 have inconsistent indentation (extra leading spaces on line 173).

Line 181 uses print() while the rest of the file uses logger. Consider using logger.debug() for consistency.

🐛 Proposed fix

- # Include benchmarking metrics if available + # Include benchmarking metrics if available if metrics: final_response["metrics"] = { "ttft_ms": metrics.ttft_ms, "total_tokens": metrics.total_tokens, "tokens_per_second": metrics.tokens_per_second, "total_latency_s": metrics.total_latency_s, } - print(f"data: {json.dumps(final_response)}") + logger.debug(f"Final response: {json.dumps(final_response)}")

🤖 Prompt for AI Agents

In `@server/backend/mlx.py` around lines 173 - 181, Fix the indentation and replace the debug print with the module logger: align the "if metrics:" block indentation to match surrounding code and add the metrics into final_response as shown, then replace the print(f"data: {json.dumps(final_response)}") call with logger.debug(...) (or logger.info(...) per convention) to log the serialized final_response; reference the final_response variable, the metrics block, and the existing logger object when making the change.

coderabbitai · 2026-01-30T06:25:42Z

+                                    "💡".yellow(),
+                                    bench_metrics.total_tokens
+                                        / bench_metrics.total_latency_s as i32,


⚠️ Potential issue | 🔴 Critical

Critical: Integer division can cause division by zero panic.

Line 456 casts total_latency_s (f64) to i32, which truncates values less than 1.0 to 0, causing a panic. Additionally, dividing i32 by i32 produces integer truncation.

🐛 Proposed fix

println!( "{}", format!( "\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT", "💡".yellow(), - bench_metrics.total_tokens - / bench_metrics.total_latency_s as i32, + bench_metrics.total_tokens as f64 + / bench_metrics.total_latency_s, bench_metrics.total_tokens, bench_metrics.ttft_ms ) .dimmed() );

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"💡".yellow(),

bench_metrics.total_tokens

/ bench_metrics.total_latency_s as i32,

println!(

"{}",

format!(

"\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",

"💡".yellow(),

bench_metrics.total_tokens as f64

/ bench_metrics.total_latency_s,

bench_metrics.total_tokens,

bench_metrics.ttft_ms

)

.dimmed()

);

🤖 Prompt for AI Agents

In `@tiles/src/runtime/mlx.rs` around lines 454 - 456, bench_metrics.total_tokens / bench_metrics.total_latency_s as i32 can panic because casting total_latency_s (f64) to i32 truncates values <1.0 to 0 and integer division truncates the result; change the computation to operate in floating point and guard against zero/near-zero latency by checking bench_metrics.total_latency_s <= 0.0 (or < epsilon) and returning 0.0 in that case, otherwise compute rate as (bench_metrics.total_tokens as f64) / bench_metrics.total_latency_s and then format or round that f64 for display instead of casting the latency to i32.

refactor

madclaws added 2 commits January 29, 2026 18:25

feat: added basic benchmark in the repl

48d1ea6

thanks to @kshitijgetsac for the reference #43

feat: Added aggregation to benchmark, needed for memory models

a00a2a0

madclaws linked an issue Jan 30, 2026 that may be closed by this pull request

Create a benchmark for chat session #37

Closed

coderabbitai bot reviewed Jan 30, 2026

View reviewed changes

madclaws added 3 commits January 30, 2026 12:07

refactor

89a57e9

feat: Added aggregation to benchmark, needed for memory models

8b20019

refactor

Merge branch 'benchmarl' of github.com:tilesprivacy/tiles into benchmarl

59ddce4

madclaws merged commit 7dcfa4e into main Jan 30, 2026
2 checks passed

madclaws deleted the benchmarl branch January 30, 2026 06:44

madclaws mentioned this pull request Jan 30, 2026

feat: add benchmarking harness with TTFT, throughput, and latency metrics #43

Closed

coderabbitai bot mentioned this pull request Feb 2, 2026

Added basic message support for /v1/responses api #82

Merged

This was referenced Feb 28, 2026

Added harmony renderer suppport + responses api refactor #92

Merged

Implemented chatdb for chat persistence #94

Merged

coderabbitai bot mentioned this pull request Apr 5, 2026

server: OpenAI API and llama.cpp Web UI compatibility #121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added token metrics with the model response#80

Added token metrics with the model response#80
madclaws merged 5 commits intomainfrom
benchmarl

madclaws commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Walkthrough

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 30, 2026

Uh oh!

Uh oh!

coderabbitai bot Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-                                    "💡".yellow(),
-                                    bench_metrics.total_tokens
-                                        / bench_metrics.total_latency_s as i32,
+                             println!(
+                                 "{}",
+                                 format!(
+                                     "\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",
+                                     "💡".yellow(),
+                                     bench_metrics.total_tokens as f64
+                                         / bench_metrics.total_latency_s,
+                                     bench_metrics.total_tokens,
+                                     bench_metrics.ttft_ms
+                                 )
+                                 .dimmed()
+                             );

Conversation

madclaws commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Jan 30, 2026 •

edited

Loading