feat: add benchmarking harness with TTFT, throughput, and latency metrics#43
feat: add benchmarking harness with TTFT, throughput, and latency metrics#43kshitijgetsac wants to merge 1 commit intotilesprivacy:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds generation-metrics support end-to-end. A new Python dataclass Sequence Diagram(s)sequenceDiagram
participant Client
participant Server.mlx as MLX API (mlx.py)
participant Server.runner as MLX Runner (mlx_runner.py)
participant Model
participant ClientStream as Client Stream
Client->>Server.mlx: start chat stream request
Server.mlx->>Server.runner: start generation stream
Server.runner->>Model: request tokens (streaming)
Model-->>Server.runner: token
alt first token
Server.runner->>Server.runner: record ttft
end
Server.runner-->>Server.mlx: token (chunks) or GenerationMetrics objects intermittently
Server.mlx-->>ClientStream: emit token chunks (skip metric-only chunks)
Server.runner-->>Server.mlx: final GenerationMetrics at end-of-stream
Server.mlx->>ClientStream: final completion chunk including metrics field
sequenceDiagram
participant UserCLI as tiles CLI
participant Runtime
participant MLXRuntime
participant ServerProcess
participant LogFile
UserCLI->>Runtime: commands::bench(modelfile)
Runtime->>MLXRuntime: bench(run_args)
MLXRuntime->>ServerProcess: ensure server running & send benchmark prompt
ServerProcess->>MLXRuntime: streaming tokens + GenerationMetrics
MLXRuntime->>MLXRuntime: convert metrics to BenchmarkMetrics
MLXRuntime->>LogFile: append benchmark JSONL (timestamp, model, metrics)
MLXRuntime->>UserCLI: print metrics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
server/backend/mlx_runner.py (2)
426-426: Return type annotation is inconsistent with yielded values.The method signature declares
Iterator[str]but now yields bothstrandGenerationMetricsobjects. This breaks type safety and will cause issues for callers that expect only strings (e.g.,interactive_chatat line 849 prints tokens directly, which would print the rawGenerationMetricsobject).🔧 Proposed fix
Update the return type and handle metrics in callers:
- ) -> Iterator[str]: + ) -> Iterator[str | GenerationMetrics]:Or consider yielding metrics through a separate mechanism (e.g., callback, out-parameter, or wrapper class) to maintain backward compatibility.
Also applies to: 572-584
838-851: Callers need to handleGenerationMetricsobjects in the stream.This loop prints each yielded item directly. Since
generate_streamingnow yieldsGenerationMetricsobjects, this will print raw object representations likeGenerationMetrics(ttft_ms=12.5, ...)to the user.🔧 Proposed fix
for token in self.generate_streaming( prompt=prompt, max_tokens=max_tokens, temperature=temperature, top_p=top_p, repetition_penalty=repetition_penalty, use_chat_template=False, # Already applied in _format_conversation use_chat_stop_tokens=True, # Enable chat stop tokens in interactive mode interactive=True, # Enable full context length for interactive mode ): - # Stream all tokens directly (already formatted by generate_streaming) - print(token, end="", flush=True) - response_tokens.append(token) + if isinstance(token, GenerationMetrics): + # Handle metrics separately if needed + continue + # Stream all tokens directly (already formatted by generate_streaming) + print(token, end="", flush=True) + response_tokens.append(token)
🤖 Fix all issues with AI agents
In @tiles/src/runtime/mlx.rs:
- Around line 449-454: The code currently calls
serde_json::from_str(data).unwrap() (in mlx.rs) which will panic on malformed
SSE JSON; change it to handle parsing errors gracefully by replacing unwrap()
with a fallible path (e.g., match or if let Ok(v) =
serde_json::from_str::<Value>(data) { ... } else { log the parse error and
continue; }); ensure subsequent code that reads v.get("metrics") only runs when
parsing succeeded and preserve existing metrics assignment logic (the metrics =
serde_json::from_value(metrics_obj.clone()).ok() line) inside the successful
branch.
🧹 Nitpick comments (4)
tiles/src/runtime/cpu.rs (1)
27-29: Consider providing a cleaner error for unsupported platforms.The
unimplemented!()macro will panic iftiles benchis invoked on non-macOS systems (whereCPURuntimeis selected). A friendlier approach would print a message and return gracefully, similar to how the MLX bench handles errors.♻️ Suggested improvement
pub async fn bench(&self, _run_args: super::RunArgs) { - unimplemented!() + eprintln!("Benchmarking is not yet supported on this platform (CPU runtime)."); }server/backend/mlx_runner.py (1)
572-584: Consider extracting duplicated metrics calculation logic.The same metrics calculation pattern is repeated in three places (native stop token path, chat stop token path, and end of generation). This increases maintenance burden and risk of inconsistencies.
♻️ Suggested refactor
+ def _create_generation_metrics(self, start_time: float, tokens_generated: int, ttft: Optional[float]) -> GenerationMetrics: + total_latency = time.time() - start_time + tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0 + ttft_ms = (ttft * 1000) if ttft is not None else 0 + return GenerationMetrics( + ttft_ms=ttft_ms, + total_tokens=tokens_generated, + tokens_per_second=tokens_per_second, + total_latency_s=total_latency + )Then replace the inline calculations with calls to this helper.
Also applies to: 615-627, 652-662
tiles/src/runtime/mlx.rs (2)
215-215: Remove redundant import.
std::io::Writeis already imported at line 10. This inline import is unnecessary.♻️ Proposed fix
// Append to log file let mut file = std::fs::OpenOptions::new() .create(true) .append(true) .open(&log_file)?; - use std::io::Write; writeln!(file, "{}", log_entry)?;
129-145: Consider extracting shared modelfile parsing logic.The modelfile parsing code (lines 129-145) is nearly identical to lines 48-64 in
run(). Extracting a helper method would reduce duplication.♻️ Suggested refactor
fn parse_modelfile(modelfile_path: Option<&String>) -> Result<Modelfile, ()> { const DEFAULT_MODELFILE: &str = "FROM driaforall/mem-agent-mlx-4bit"; let result = if let Some(path) = modelfile_path { tilekit::modelfile::parse_from_file(path.as_str()) } else { tilekit::modelfile::parse(DEFAULT_MODELFILE) }; match result { Ok(mf) => Ok(mf), Err(_) => { println!("Invalid Modelfile"); Err(()) } } }Then use
self.parse_modelfile(run_args.modelfile_path.as_ref())in both methods.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
server/backend/mlx.pyserver/backend/mlx_runner.pyserver/schemas.pytiles/Cargo.tomltiles/src/commands/mod.rstiles/src/main.rstiles/src/runtime/cpu.rstiles/src/runtime/mlx.rstiles/src/runtime/mod.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{rs,toml}
⚙️ CodeRabbit configuration file
Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.
Files:
tiles/src/commands/mod.rstiles/src/runtime/mod.rstiles/src/main.rstiles/src/runtime/cpu.rstiles/Cargo.tomltiles/src/runtime/mlx.rs
🧬 Code graph analysis (7)
tiles/src/commands/mod.rs (3)
tiles/src/runtime/cpu.rs (1)
bench(27-29)tiles/src/runtime/mlx.rs (1)
bench(129-190)tiles/src/runtime/mod.rs (1)
bench(41-46)
tiles/src/runtime/mod.rs (3)
tiles/src/commands/mod.rs (1)
bench(22-24)tiles/src/runtime/cpu.rs (1)
bench(27-29)tiles/src/runtime/mlx.rs (1)
bench(129-190)
tiles/src/main.rs (3)
tiles/src/commands/mod.rs (1)
bench(22-24)tiles/src/runtime/mlx.rs (1)
bench(129-190)tiles/src/runtime/mod.rs (1)
bench(41-46)
tiles/src/runtime/cpu.rs (3)
tiles/src/commands/mod.rs (1)
bench(22-24)tiles/src/runtime/mlx.rs (1)
bench(129-190)tiles/src/runtime/mod.rs (1)
bench(41-46)
server/backend/mlx_runner.py (2)
server/schemas.py (1)
GenerationMetrics(70-75)server/reasoning_utils.py (1)
finalize(424-438)
server/backend/mlx.py (1)
server/schemas.py (1)
GenerationMetrics(70-75)
tiles/src/runtime/mlx.rs (4)
tiles/src/commands/mod.rs (1)
bench(22-24)tiles/src/runtime/cpu.rs (1)
bench(27-29)tiles/src/runtime/mod.rs (1)
bench(41-46)tilekit/src/modelfile.rs (2)
parse_from_file(246-251)parse(253-258)
🔇 Additional comments (14)
tiles/Cargo.toml (1)
17-17: LGTM!Adding
chronofor timestamp handling in benchmark logging is appropriate. The default features should suffice for the benchmarking use case.server/schemas.py (1)
69-75: LGTM!Using a plain
@dataclassfor internal metrics transfer is a good choice—it avoids Pydantic overhead for a structure that doesn't require request validation. The field types and documentation are clear.tiles/src/commands/mod.rs (1)
21-24: LGTM!The
benchcommand handler follows the established pattern of delegating to the runtime layer. Clean and consistent with the other command functions.tiles/src/main.rs (2)
25-29: LGTM!The
Benchsubcommand is well-integrated with the CLI. The doc comment clearly describes the purpose.
79-85: Reasonable approach for initial implementation.Setting
relay_count: 0with an explanatory comment is clear. If bench-specific options are needed later (warmup iterations, repetitions, etc.), consider introducing aBenchFlagsstruct similar toRunFlags.tiles/src/runtime/mod.rs (1)
40-46: LGTM!The
benchdispatch follows the established pattern for runtime method delegation. Returning()is consistent with how errors are handled internally in the MLX implementation.server/backend/mlx.py (3)
4-4: LGTM!Import of
GenerationMetricsis correctly added to support the new metrics flow.
116-132: Clean integration of metrics collection into the streaming loop.Using
isinstance()to differentiate the final metrics object from string tokens is straightforward and effective. Thecontinueensures metrics aren't accidentally emitted as content.
165-181: Metrics propagation looks good.The conditional inclusion of metrics in the final response is clean. Explicitly constructing the dict from dataclass fields ensures proper JSON serialization.
server/backend/mlx_runner.py (2)
23-23: LGTM!Import added correctly to support the new metrics emission feature.
631-634: LGTM!TTFT capture logic correctly measures time to first visible token. Placing it after stop token checks ensures we're measuring meaningful latency.
tiles/src/runtime/mlx.rs (3)
22-28: LGTM!The
BenchmarkMetricsstruct correctly mirrors the PythonGenerationMetricsdataclass fromserver/schemas.pywith matching field names for proper deserialization.
335-345: LGTM!Clear and concise metrics display in REPL. The formatting provides useful performance visibility without being intrusive.
97-97: LGTM with a note.Setting
PYTHONDONTWRITEBYTECODE=1prevents stale.pycfiles as noted in the PR description. This is a reasonable trade-off - minor startup performance impact for more predictable behavior during development cycles.
| let modelname = modelfile.from.as_ref().unwrap(); | ||
|
|
||
| // Start server if not running | ||
| if (ping().await).is_err() { | ||
| println!("Starting server..."); | ||
| let _res = self.start_server_daemon().await; | ||
| let _ = wait_until_server_is_up().await; | ||
| } | ||
|
|
||
| // Load model | ||
| let memory_path = get_memory_path() | ||
| .context("Retrieving memory_path failed") | ||
| .unwrap(); | ||
| if let Err(e) = load_model(modelname, &memory_path).await { | ||
| println!("Failed to load model: {}", e); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Several unwrap() calls risk panics; server start failure isn't handled.
- Line 147:
unwrap()onmodelfile.fromwill panic iffromisNone. - Line 152-153: If
start_server_daemonfails, the error is discarded andwait_until_server_is_up()may loop indefinitely. - Line 159:
unwrap()onget_memory_path()will panic on failure.
🔧 Proposed fix
- let modelname = modelfile.from.as_ref().unwrap();
+ let modelname = match modelfile.from.as_ref() {
+ Some(name) => name,
+ None => {
+ println!("Modelfile missing 'FROM' directive");
+ return;
+ }
+ };
// Start server if not running
if (ping().await).is_err() {
println!("Starting server...");
- let _res = self.start_server_daemon().await;
+ if let Err(e) = self.start_server_daemon().await {
+ println!("Failed to start server: {}", e);
+ return;
+ }
let _ = wait_until_server_is_up().await;
}
// Load model
- let memory_path = get_memory_path()
- .context("Retrieving memory_path failed")
- .unwrap();
+ let memory_path = match get_memory_path() {
+ Ok(path) => path,
+ Err(e) => {
+ println!("Failed to get memory path: {}", e);
+ return;
+ }
+ };📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let modelname = modelfile.from.as_ref().unwrap(); | |
| // Start server if not running | |
| if (ping().await).is_err() { | |
| println!("Starting server..."); | |
| let _res = self.start_server_daemon().await; | |
| let _ = wait_until_server_is_up().await; | |
| } | |
| // Load model | |
| let memory_path = get_memory_path() | |
| .context("Retrieving memory_path failed") | |
| .unwrap(); | |
| if let Err(e) = load_model(modelname, &memory_path).await { | |
| println!("Failed to load model: {}", e); | |
| return; | |
| } | |
| let modelname = match modelfile.from.as_ref() { | |
| Some(name) => name, | |
| None => { | |
| println!("Modelfile missing 'FROM' directive"); | |
| return; | |
| } | |
| }; | |
| // Start server if not running | |
| if (ping().await).is_err() { | |
| println!("Starting server..."); | |
| if let Err(e) = self.start_server_daemon().await { | |
| println!("Failed to start server: {}", e); | |
| return; | |
| } | |
| let _ = wait_until_server_is_up().await; | |
| } | |
| // Load model | |
| let memory_path = match get_memory_path() { | |
| Ok(path) => path, | |
| Err(e) => { | |
| println!("Failed to get memory path: {}", e); | |
| return; | |
| } | |
| }; | |
| if let Err(e) = load_model(modelname, &memory_path).await { | |
| println!("Failed to load model: {}", e); | |
| return; | |
| } |
| let v: Value = serde_json::from_str(data).unwrap(); | ||
|
|
||
| // Check for metrics in the response | ||
| if let Some(metrics_obj) = v.get("metrics") { | ||
| metrics = serde_json::from_value(metrics_obj.clone()).ok(); | ||
| } |
There was a problem hiding this comment.
unwrap() on JSON parsing can panic on malformed SSE data.
If the server sends malformed JSON in the SSE stream, this will panic. Consider graceful error handling.
🔧 Proposed fix
// Parse JSON
- let v: Value = serde_json::from_str(data).unwrap();
+ let v: Value = match serde_json::from_str(data) {
+ Ok(v) => v,
+ Err(e) => {
+ eprintln!("Failed to parse JSON: {}", e);
+ continue;
+ }
+ };📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let v: Value = serde_json::from_str(data).unwrap(); | |
| // Check for metrics in the response | |
| if let Some(metrics_obj) = v.get("metrics") { | |
| metrics = serde_json::from_value(metrics_obj.clone()).ok(); | |
| } | |
| let v: Value = match serde_json::from_str(data) { | |
| Ok(v) => v, | |
| Err(e) => { | |
| eprintln!("Failed to parse JSON: {}", e); | |
| continue; | |
| } | |
| }; | |
| // Check for metrics in the response | |
| if let Some(metrics_obj) = v.get("metrics") { | |
| metrics = serde_json::from_value(metrics_obj.clone()).ok(); | |
| } |
🤖 Prompt for AI Agents
In @tiles/src/runtime/mlx.rs around lines 449 - 454, The code currently calls
serde_json::from_str(data).unwrap() (in mlx.rs) which will panic on malformed
SSE JSON; change it to handle parsing errors gracefully by replacing unwrap()
with a fallible path (e.g., match or if let Ok(v) =
serde_json::from_str::<Value>(data) { ... } else { log the parse error and
continue; }); ensure subsequent code that reads v.get("metrics") only runs when
parsing succeeded and preserve existing metrics assignment logic (the metrics =
serde_json::from_value(metrics_obj.clone()).ok() line) inside the successful
branch.
There was a problem hiding this comment.
Also we do max relay_count relays with model, but we are not showing the aggregate of the total communication when it comes to total_tokens, latency etc.
This is a tad difference from normal model conversation, due to this being memory model, we try to communicate w the model until we get <reply></reply>
Right now we are only showing the metrics for the final reply, and not the aggregate.
So how should we approach this @feynon @kshitijgetsac ?
| if let Some(metrics) = response.metrics { | ||
| println!( | ||
| "\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT", | ||
| "💡".yellow(), | ||
| metrics.tokens_per_second, | ||
| metrics.total_tokens, | ||
| metrics.ttft_ms | ||
| ); | ||
| } |
| if reasoning_parser: | ||
| yield from reasoning_parser.finalize() |
There was a problem hiding this comment.
The StreamingReasoningParser buffers and formats this content as it streams. When generation stops (either naturally or via stop token), we need to:
- Flush any buffered content - The parser may have partial reasoning text that hasn't been yielded yet
- Finalize formatting - Add closing markers or format the final output properly, If we skip finalize() when a stop token triggers early return:
Buffered reasoning content would be lost
The response might be incomplete or malformed
User would see truncated output
The condition just ensures we only call it when the model is a reasoning model (the parser is only created for those models).
|
Should we benchamrk just basic prompts which doesn't use memory (mem-agent is smart enough to understand it) or do the benchmarking of the memory usage? |
|
|
| const DEFAULT_MODELFILE: &str = "FROM driaforall/mem-agent-mlx-4bit"; | ||
|
|
||
| // Parse modelfile | ||
| let modelfile_parse_result = if let Some(modelfile_str) = run_args.modelfile_path { | ||
| tilekit::modelfile::parse_from_file(modelfile_str.as_str()) | ||
| } else { | ||
| tilekit::modelfile::parse(DEFAULT_MODELFILE) | ||
| }; | ||
|
|
||
| let modelfile = match modelfile_parse_result { | ||
| Ok(mf) => mf, | ||
| Err(_err) => { | ||
| println!("Invalid Modelfile"); | ||
| return; | ||
| } | ||
| }; | ||
|
|
||
| let modelname = modelfile.from.as_ref().unwrap(); | ||
|
|
There was a problem hiding this comment.
There is some code duplication in bench and run, can we move them to a single function and use it?
There was a problem hiding this comment.
yes @madclaws I tried the pycode thing and the pycache wasn't getting cleared , sure I will change this part
There was a problem hiding this comment.
Ok, so if you check the bundler.sh and install.sh. During packing the release we remove the pycache folders.
and while extracting as part of installation we remove the existing server folder too first. So this won't be an issue in prod.
If you want to test this,
- go to tiles root project directory.
- run
just bundle - run
just install
Now you can try tiles run from anywhere .
No memory, we need to just benchmark pure tokens inference. |
…rics - Add `tiles bench` command for running benchmarks - Track TTFT, tokens/sec, total tokens, and latency - Display metrics after each REPL response - Save benchmark results to ~/.config/tiles/benchmark_log.jsonl - Add GenerationMetrics dataclass in Python server
I think we should do the aggregate of the total communication. |
2e50868 to
621a7e0
Compare
|
also for your aggregate thing , we just have to show the results for one particular chat and that is already implemented |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @server/backend/mlx_runner.py:
- Around line 631-633: The TTFT calculation is done after the stop-token checks,
so if the first generated token is a stop token the function returns before ttft
is set; move the ttft capture (set ttft = time.time() - start_time) immediately
after the code that obtains the first generated token (i.e., right after the
token is produced/decoded) and before any stop-token checks/early returns, and
then remove the existing conditional block that sets ttft later (the ttft is
None check at the end).
- Around line 572-584: The metrics calculation uses tokens_generated before
incrementing it when an early native/chat stop token is handled, causing
total_tokens to be off by one; update the stop-token branches (the branch that
yields reasoning_parser.finalize() and the chat stop branch referenced around
the same area) to increment tokens_generated immediately upon consuming the stop
token (i.e., before computing total_latency, tokens_per_second, ttft_ms and
yielding GenerationMetrics), ensuring GenerationMetrics.ttft_ms, total_tokens,
tokens_per_second and total_latency_s reflect the consumed stop token; apply the
same increment fix in the chat stop token branch mentioned around line 615.
🧹 Nitpick comments (4)
server/schemas.py (1)
69-75: Consider adding validation for metric values.The
GenerationMetricsdataclass is well-structured, but consider adding validation to ensure metrics are non-negative (e.g.,ttft_ms >= 0,total_tokens >= 0). This would catch bugs earlier if invalid values are produced.If validation is desired, you could either:
- Use Pydantic's
BaseModelwithFieldvalidators- Add
__post_init__to the dataclass to validate valuesExample with dataclass validation
@dataclass class GenerationMetrics: """Benchmarking metrics for token generation.""" ttft_ms: float # Time to first token in milliseconds total_tokens: int # Total tokens generated tokens_per_second: float # Throughput total_latency_s: float # End-to-end latency in seconds def __post_init__(self): if self.ttft_ms < 0: raise ValueError("ttft_ms must be non-negative") if self.total_tokens < 0: raise ValueError("total_tokens must be non-negative") if self.tokens_per_second < 0: raise ValueError("tokens_per_second must be non-negative") if self.total_latency_s < 0: raise ValueError("total_latency_s must be non-negative")tiles/src/runtime/mlx.rs (3)
22-28: Consider making BenchmarkMetrics fields public.The
BenchmarkMetricsfields are private, which is fine for internal use within this module. However, if this struct is intended to be used as a data transfer object across module boundaries, consider making the fieldspubfor easier access and a more conventional Rust API for serializable data structures.Proposed change
#[derive(Debug, Deserialize, Serialize)] pub struct BenchmarkMetrics { - ttft_ms: f64, - total_tokens: i32, - tokens_per_second: f64, - total_latency_s: f64, + pub ttft_ms: f64, + pub total_tokens: i32, + pub tokens_per_second: f64, + pub total_latency_s: f64, }
45-59: Error details lost in parse failure message.When
parse_from_fileorparsefails, the actual error is discarded and only "Invalid Modelfile" is printed. Including the underlying error would significantly improve debuggability.Proposed fix
match parse_result { Ok(mf) => Some(mf), - Err(_) => { - println!("Invalid Modelfile"); + Err(e) => { + println!("Invalid Modelfile: {}", e); None } }
185-213: Remove redundant import.
std::io::Writeis already imported at the module level (line 10), so the local import at line 207 is unnecessary.Proposed fix
let mut file = std::fs::OpenOptions::new() .create(true) .append(true) .open(&log_file)?; - use std::io::Write; writeln!(file, "{}", log_entry)?;
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
server/backend/mlx.pyserver/backend/mlx_runner.pyserver/schemas.pytiles/Cargo.tomltiles/src/commands/mod.rstiles/src/main.rstiles/src/runtime/cpu.rstiles/src/runtime/mlx.rstiles/src/runtime/mod.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- tiles/src/main.rs
- tiles/Cargo.toml
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{rs,toml}
⚙️ CodeRabbit configuration file
Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.
Files:
tiles/src/runtime/cpu.rstiles/src/commands/mod.rstiles/src/runtime/mlx.rstiles/src/runtime/mod.rs
🧬 Code graph analysis (6)
tiles/src/runtime/cpu.rs (3)
tiles/src/commands/mod.rs (1)
bench(22-24)tiles/src/runtime/mlx.rs (1)
bench(134-182)tiles/src/runtime/mod.rs (1)
bench(41-46)
server/backend/mlx.py (1)
server/schemas.py (4)
ChatMessage(16-18)ChatCompletionRequest(21-31)downloadRequest(65-66)GenerationMetrics(70-75)
server/backend/mlx_runner.py (2)
server/schemas.py (1)
GenerationMetrics(70-75)server/reasoning_utils.py (1)
finalize(424-438)
tiles/src/commands/mod.rs (3)
tiles/src/runtime/cpu.rs (1)
bench(27-29)tiles/src/runtime/mlx.rs (1)
bench(134-182)tiles/src/runtime/mod.rs (1)
bench(41-46)
tiles/src/runtime/mlx.rs (3)
tilekit/src/modelfile.rs (2)
parse_from_file(246-251)parse(253-258)tiles/src/commands/mod.rs (2)
run(6-8)bench(22-24)tiles/src/runtime/mod.rs (2)
run(20-25)bench(41-46)
tiles/src/runtime/mod.rs (3)
tiles/src/commands/mod.rs (1)
bench(22-24)tiles/src/runtime/cpu.rs (1)
bench(27-29)tiles/src/runtime/mlx.rs (1)
bench(134-182)
🔇 Additional comments (15)
tiles/src/runtime/cpu.rs (1)
27-29: LGTM! Stub implementation follows the established pattern.The
benchmethod correctly mirrors the existing stub pattern for other CPURuntime methods. This is appropriate for the placeholder CPU runtime implementation.tiles/src/commands/mod.rs (1)
22-24: LGTM! Clean delegation pattern.The
benchcommand function correctly follows the established pattern of delegating to the runtime layer, consistent withrun,start_server, andstop_servercommands.tiles/src/runtime/mod.rs (1)
41-46: LGTM! Correct enum dispatch pattern.The
benchmethod properly dispatches to the concrete runtime implementations, following the same pattern asrun,start_server_daemon, andstop_server_daemon.server/schemas.py (1)
3-3: LGTM! Appropriate import for dataclass.server/backend/mlx.py (4)
4-4: LGTM! Correct import of GenerationMetrics.
116-116: LGTM! Appropriate initialization of metrics variable.
129-132: LGTM! Correct metrics capture logic.The implementation correctly identifies
GenerationMetricsobjects in the stream and stores them without emitting them as text chunks. Thecontinuestatement appropriately skips the yield for metrics objects.
165-181: LGTM! Correct conditional metrics inclusion in final response.The implementation properly checks for the presence of metrics and conditionally includes them in the final streaming response. The field extraction is correct and matches the
GenerationMetricsschema.server/backend/mlx_runner.py (2)
652-662: LGTM—metrics correctly calculated at normal termination.The metrics calculation here is correct for the normal generation completion path (EOS or max_tokens reached).
23-23: LGTM—import matches schema definition.The GenerationMetrics import aligns with the schema definition in server/schemas.py.
tiles/src/runtime/mlx.rs (5)
102-102: Note:PYTHONDONTWRITEBYTECODEonly set in release builds.As mentioned in the PR discussion, this environment variable is set only when the server is started via
start_server_daemon, which is skipped in debug mode (see line 275:if !cfg!(debug_assertions)). This means.pycfiles are only suppressed in release builds. If this is intentional, consider adding a comment explaining the behavior; otherwise, verify whether this matches the intended design.Based on the PR discussion around .pyc caching behavior.
134-182: LGTM—benchmark implementation is clean and functional.The
benchmethod correctly:
- Ensures the server is running
- Loads the model
- Executes a fixed benchmark prompt for consistency
- Displays and persists metrics
Using a hardcoded simple prompt ("What is 2+2?") ensures reproducible benchmarks focused on token inference performance, aligning with the PR objectives.
425-446: LGTM—metrics correctly extracted and propagated.The metrics are properly parsed from the streaming response JSON and passed through to
ChatResponse. The optional handling ensures backward compatibility when metrics are absent.
328-337: LGTM—clear and concise metrics display.The metrics formatting in the REPL provides immediate feedback with throughput, token count, and TTFT in an easy-to-read format.
67-69: LGTM—clean early-return pattern.The
let Some(...) else { return; }pattern is idiomatic Rust and cleanly centralizes modelfile parsing with early exit on failure.
|
|
||
| # Yield metrics before returning | ||
| if reasoning_parser: | ||
| yield from reasoning_parser.finalize() | ||
| total_latency = time.time() - start_time | ||
| tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0 | ||
| ttft_ms = (ttft * 1000) if ttft is not None else 0 | ||
| yield GenerationMetrics( | ||
| ttft_ms=ttft_ms, | ||
| total_tokens=tokens_generated, | ||
| tokens_per_second=tokens_per_second, | ||
| total_latency_s=total_latency | ||
| ) |
There was a problem hiding this comment.
Same token-count issue for chat stop tokens.
This branch has the same token-counting issue as the native stop token branch (lines 572-584). See the earlier comment for details and fix.
If that's the case, then it should mentioned right? |
|
yeah sure will update the description |
thanks to @kshitijgetsac for the reference #43
thanks to @kshitijgetsac for the reference #43
|
Closing this, since have included this in PR with some extra additions. Thanks. |

tiles benchcommand for running benchmarks