Conversation
thanks to @kshitijgetsac for the reference #43
📝 WalkthroughWalkthroughThis pull request introduces performance metrics tracking across the ML backend system. A new Sequence DiagramsequenceDiagram
actor Client
participant MLXRunner as Python: mlx_runner
participant MLX as Python: mlx
participant Schemas as Python: schemas
participant RustRuntime as Rust: mlx.rs
Client->>MLXRunner: Send generation request
activate MLXRunner
MLXRunner->>MLXRunner: Initialize ttft = None
loop Token streaming
MLXRunner->>Schemas: Emit token/metrics payload
MLXRunner->>MLX: Yield token (or metrics)
end
MLXRunner->>MLXRunner: Finalize reasoning parser if active
MLXRunner->>Schemas: Emit final GenerationMetrics
deactivate MLXRunner
MLX->>MLX: Detect GenerationMetrics in stream
MLX->>MLX: Store metrics from payload
MLX->>MLX: Assemble final_response with metrics
MLX->>Client: Yield response with metrics data event
Client->>RustRuntime: Receive response
activate RustRuntime
RustRuntime->>RustRuntime: Parse metrics from response
RustRuntime->>RustRuntime: Convert to BenchmarkMetrics
RustRuntime->>RustRuntime: Update bench_metrics accumulator
RustRuntime->>RustRuntime: Display formatted metrics summary
deactivate RustRuntime
RustRuntime->>Client: Return ChatResponse with metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@server/backend/mlx.py`:
- Around line 173-181: Fix the indentation and replace the debug print with the
module logger: align the "if metrics:" block indentation to match surrounding
code and add the metrics into final_response as shown, then replace the
print(f"data: {json.dumps(final_response)}") call with logger.debug(...) (or
logger.info(...) per convention) to log the serialized final_response; reference
the final_response variable, the metrics block, and the existing logger object
when making the change.
In `@tiles/Cargo.toml`:
- Line 18: Remove the unused chrono dependency from the tiles crate by deleting
the line `chrono = "0.4"` in Cargo.toml, then search for and remove any stray
`use chrono::...` or `extern crate chrono;` references (none should be needed
because BenchmarkMetrics uses f64 and timing uses std::time), run cargo check to
confirm no remaining chrono references, and commit the Cargo.toml change.
In `@tiles/src/runtime/mlx.rs`:
- Around line 454-456: bench_metrics.total_tokens /
bench_metrics.total_latency_s as i32 can panic because casting total_latency_s
(f64) to i32 truncates values <1.0 to 0 and integer division truncates the
result; change the computation to operate in floating point and guard against
zero/near-zero latency by checking bench_metrics.total_latency_s <= 0.0 (or <
epsilon) and returning 0.0 in that case, otherwise compute rate as
(bench_metrics.total_tokens as f64) / bench_metrics.total_latency_s and then
format or round that f64 for display instead of casting the latency to i32.
🧹 Nitpick comments (3)
tiles/src/runtime/mlx.rs (1)
36-46:tokens_per_secondaccumulation is semantically incorrect.Adding
tokens_per_secondvalues from multiple responses doesn't produce a meaningful aggregate throughput. Since the display code at line 455-456 recalculates throughput fromtotal_tokens / total_latency_s, consider removing the accumulation or documenting that this field represents a sum rather than a rate.♻️ Option: Don't accumulate tokens_per_second
fn update(&mut self, metrics: BenchmarkMetrics) -> &Self { if self.ttft_ms == 0.0 { self.ttft_ms += metrics.ttft_ms; } self.total_tokens += metrics.total_tokens; - self.tokens_per_second += metrics.tokens_per_second; + // tokens_per_second is recalculated from total_tokens/total_latency_s when displayed self.total_latency_s += metrics.total_latency_s; self }server/backend/mlx_runner.py (2)
569-580: Code duplication: Metrics emission repeated three times.The
GenerationMetricsconstruction and yield logic is duplicated at three exit points (native stop, chat stop, end of generation). Consider extracting a helper method to reduce duplication and ensure consistency.♻️ Proposed helper method
Add a helper method within
generate_streaming:def _emit_metrics(): total_latency = time.time() - start_time tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0 ttft_ms = (ttft * 1000) if ttft is not None else (total_latency * 1000) return GenerationMetrics( ttft_ms=ttft_ms, total_tokens=tokens_generated, tokens_per_second=tokens_per_second, total_latency_s=total_latency )Then replace each emission site with
yield _emit_metrics().Also applies to: 610-621, 644-654
573-574: TTFT of 0ms is misleading when no tokens were generated before stop.When
ttftisNone(e.g., stop token hit before first token yield),ttft_msdefaults to 0, which suggests instant response rather than "not measured". Consider usingNoneor the total latency as fallback for clearer semantics.💡 Alternative: Use total_latency as fallback
-ttft_ms = (ttft * 1000) if ttft is not None else 0 +ttft_ms = (ttft * 1000) if ttft is not None else (total_latency * 1000)This makes TTFT equal to total latency when no intermediate tokens were generated, which is technically accurate.
Also applies to: 614-615, 647-648
| # Include benchmarking metrics if available | ||
| if metrics: | ||
| final_response["metrics"] = { | ||
| "ttft_ms": metrics.ttft_ms, | ||
| "total_tokens": metrics.total_tokens, | ||
| "tokens_per_second": metrics.tokens_per_second, | ||
| "total_latency_s": metrics.total_latency_s, | ||
| } | ||
| print(f"data: {json.dumps(final_response)}") |
There was a problem hiding this comment.
Indentation inconsistency and debug print statement.
- Lines 173-180 have inconsistent indentation (extra leading spaces on line 173).
- Line 181 uses
print()while the rest of the file useslogger. Consider usinglogger.debug()for consistency.
🐛 Proposed fix
- # Include benchmarking metrics if available
+ # Include benchmarking metrics if available
if metrics:
final_response["metrics"] = {
"ttft_ms": metrics.ttft_ms,
"total_tokens": metrics.total_tokens,
"tokens_per_second": metrics.tokens_per_second,
"total_latency_s": metrics.total_latency_s,
}
- print(f"data: {json.dumps(final_response)}")
+ logger.debug(f"Final response: {json.dumps(final_response)}")🤖 Prompt for AI Agents
In `@server/backend/mlx.py` around lines 173 - 181, Fix the indentation and
replace the debug print with the module logger: align the "if metrics:" block
indentation to match surrounding code and add the metrics into final_response as
shown, then replace the print(f"data: {json.dumps(final_response)}") call with
logger.debug(...) (or logger.info(...) per convention) to log the serialized
final_response; reference the final_response variable, the metrics block, and
the existing logger object when making the change.
| "💡".yellow(), | ||
| bench_metrics.total_tokens | ||
| / bench_metrics.total_latency_s as i32, |
There was a problem hiding this comment.
Critical: Integer division can cause division by zero panic.
Line 456 casts total_latency_s (f64) to i32, which truncates values less than 1.0 to 0, causing a panic. Additionally, dividing i32 by i32 produces integer truncation.
🐛 Proposed fix
println!(
"{}",
format!(
"\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",
"💡".yellow(),
- bench_metrics.total_tokens
- / bench_metrics.total_latency_s as i32,
+ bench_metrics.total_tokens as f64
+ / bench_metrics.total_latency_s,
bench_metrics.total_tokens,
bench_metrics.ttft_ms
)
.dimmed()
);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "💡".yellow(), | |
| bench_metrics.total_tokens | |
| / bench_metrics.total_latency_s as i32, | |
| println!( | |
| "{}", | |
| format!( | |
| "\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT", | |
| "💡".yellow(), | |
| bench_metrics.total_tokens as f64 | |
| / bench_metrics.total_latency_s, | |
| bench_metrics.total_tokens, | |
| bench_metrics.ttft_ms | |
| ) | |
| .dimmed() | |
| ); |
🤖 Prompt for AI Agents
In `@tiles/src/runtime/mlx.rs` around lines 454 - 456, bench_metrics.total_tokens
/ bench_metrics.total_latency_s as i32 can panic because casting total_latency_s
(f64) to i32 truncates values <1.0 to 0 and integer division truncates the
result; change the computation to operate in floating point and guard against
zero/near-zero latency by checking bench_metrics.total_latency_s <= 0.0 (or <
epsilon) and returning 0.0 in that case, otherwise compute rate as
(bench_metrics.total_tokens as f64) / bench_metrics.total_latency_s and then
format or round that f64 for display instead of casting the latency to i32.
No description provided.