feat: add benchmarking harness with TTFT, throughput, and latency metrics by kshitijgetsac · Pull Request #43 · tilesprivacy/tiles

kshitijgetsac · 2026-01-08T01:22:59Z

Add tiles bench command for running benchmarks
Track TTFT, tokens/sec, total tokens, and latency
Display metrics after each REPL response
Save benchmark results to ~/.config/tiles/benchmark_log.jsonl
Add GenerationMetrics dataclass in Python server, had a little problem with compiling the python code , .env("PYTHONDONTWRITEBYTECODE", "1") // Disable .pyc caching; set this var to 1 @madclaws , if this is a problem we can discuss this as the pycache wasn't getting cleared at all and it somehow was running the old code

coderabbitai · 2026-01-08T01:23:08Z

📝 Walkthrough

Walkthrough

Adds generation-metrics support end-to-end. A new Python dataclass GenerationMetrics is emitted during token streaming in mlx_runner.py; mlx.py consumes these metric objects and attaches a metrics object to the final chat completion chunk. On the Rust side, BenchmarkMetrics and an optional metrics field were added; a new bench CLI subcommand and runtime bench methods run a benchmark prompt, collect metrics, and append them to a JSONL log.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server.mlx as MLX API (mlx.py)
    participant Server.runner as MLX Runner (mlx_runner.py)
    participant Model
    participant ClientStream as Client Stream

    Client->>Server.mlx: start chat stream request
    Server.mlx->>Server.runner: start generation stream
    Server.runner->>Model: request tokens (streaming)
    Model-->>Server.runner: token
    alt first token
        Server.runner->>Server.runner: record ttft
    end
    Server.runner-->>Server.mlx: token (chunks) or GenerationMetrics objects intermittently
    Server.mlx-->>ClientStream: emit token chunks (skip metric-only chunks)
    Server.runner-->>Server.mlx: final GenerationMetrics at end-of-stream
    Server.mlx->>ClientStream: final completion chunk including metrics field

sequenceDiagram
    participant UserCLI as tiles CLI
    participant Runtime
    participant MLXRuntime
    participant ServerProcess
    participant LogFile

    UserCLI->>Runtime: commands::bench(modelfile)
    Runtime->>MLXRuntime: bench(run_args)
    MLXRuntime->>ServerProcess: ensure server running & send benchmark prompt
    ServerProcess->>MLXRuntime: streaming tokens + GenerationMetrics
    MLXRuntime->>MLXRuntime: convert metrics to BenchmarkMetrics
    MLXRuntime->>LogFile: append benchmark JSONL (timestamp, model, metrics)
    MLXRuntime->>UserCLI: print metrics

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

v0.3.0 + Linux compatibility #30: Extends streaming generation flow with GenerationMetrics emission and modifies the same backend streaming files.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature: adding a benchmarking harness with performance metrics (TTFT, throughput, latency).
Description check	✅ Passed	The description clearly outlines the key changes: the new bench command, metrics tracking, REPL display, log persistence, and mentions the Python GenerationMetrics addition.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

server/backend/mlx_runner.py (2)
426-426: Return type annotation is inconsistent with yielded values.

The method signature declares Iterator[str] but now yields both str and GenerationMetrics objects. This breaks type safety and will cause issues for callers that expect only strings (e.g., interactive_chat at line 849 prints tokens directly, which would print the raw GenerationMetrics object).
🔧 Proposed fix

Update the return type and handle metrics in callers:
-    ) -> Iterator[str]:
+    ) -> Iterator[str | GenerationMetrics]:
Or consider yielding metrics through a separate mechanism (e.g., callback, out-parameter, or wrapper class) to maintain backward compatibility.
Also applies to: 572-584

838-851: Callers need to handle GenerationMetrics objects in the stream.

This loop prints each yielded item directly. Since generate_streaming now yields GenerationMetrics objects, this will print raw object representations like GenerationMetrics(ttft_ms=12.5, ...) to the user.
🔧 Proposed fix
                 for token in self.generate_streaming(
                     prompt=prompt,
                     max_tokens=max_tokens,
                     temperature=temperature,
                     top_p=top_p,
                     repetition_penalty=repetition_penalty,
                     use_chat_template=False,  # Already applied in _format_conversation
                     use_chat_stop_tokens=True,  # Enable chat stop tokens in interactive mode
                     interactive=True,  # Enable full context length for interactive mode
                 ):
-                    # Stream all tokens directly (already formatted by generate_streaming)
-                    print(token, end="", flush=True)
-                    response_tokens.append(token)
+                    if isinstance(token, GenerationMetrics):
+                        # Handle metrics separately if needed
+                        continue
+                    # Stream all tokens directly (already formatted by generate_streaming)
+                    print(token, end="", flush=True)
+                    response_tokens.append(token)

🤖 Fix all issues with AI agents

In @tiles/src/runtime/mlx.rs:
- Around line 449-454: The code currently calls
serde_json::from_str(data).unwrap() (in mlx.rs) which will panic on malformed
SSE JSON; change it to handle parsing errors gracefully by replacing unwrap()
with a fallible path (e.g., match or if let Ok(v) =
serde_json::from_str::<Value>(data) { ... } else { log the parse error and
continue; }); ensure subsequent code that reads v.get("metrics") only runs when
parsing succeeded and preserve existing metrics assignment logic (the metrics =
serde_json::from_value(metrics_obj.clone()).ok() line) inside the successful
branch.

🧹 Nitpick comments (4)

tiles/src/runtime/cpu.rs (1)
27-29: Consider providing a cleaner error for unsupported platforms.

The unimplemented!() macro will panic if tiles bench is invoked on non-macOS systems (where CPURuntime is selected). A friendlier approach would print a message and return gracefully, similar to how the MLX bench handles errors.
♻️ Suggested improvement
     pub async fn bench(&self, _run_args: super::RunArgs) {
-        unimplemented!()
+        eprintln!("Benchmarking is not yet supported on this platform (CPU runtime).");
     }
server/backend/mlx_runner.py (1)
572-584: Consider extracting duplicated metrics calculation logic.

The same metrics calculation pattern is repeated in three places (native stop token path, chat stop token path, and end of generation). This increases maintenance burden and risk of inconsistencies.
♻️ Suggested refactor
+    def _create_generation_metrics(self, start_time: float, tokens_generated: int, ttft: Optional[float]) -> GenerationMetrics:
+        total_latency = time.time() - start_time
+        tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0
+        ttft_ms = (ttft * 1000) if ttft is not None else 0
+        return GenerationMetrics(
+            ttft_ms=ttft_ms,
+            total_tokens=tokens_generated,
+            tokens_per_second=tokens_per_second,
+            total_latency_s=total_latency
+        )
Then replace the inline calculations with calls to this helper.
Also applies to: 615-627, 652-662
tiles/src/runtime/mlx.rs (2)
215-215: Remove redundant import.

std::io::Write is already imported at line 10. This inline import is unnecessary.
♻️ Proposed fix
     // Append to log file
     let mut file = std::fs::OpenOptions::new()
         .create(true)
         .append(true)
         .open(&log_file)?;

-    use std::io::Write;
     writeln!(file, "{}", log_entry)?;
129-145: Consider extracting shared modelfile parsing logic.

The modelfile parsing code (lines 129-145) is nearly identical to lines 48-64 in run(). Extracting a helper method would reduce duplication.
♻️ Suggested refactor
fn parse_modelfile(modelfile_path: Option<&String>) -> Result<Modelfile, ()> {
    const DEFAULT_MODELFILE: &str = "FROM driaforall/mem-agent-mlx-4bit";
    
    let result = if let Some(path) = modelfile_path {
        tilekit::modelfile::parse_from_file(path.as_str())
    } else {
        tilekit::modelfile::parse(DEFAULT_MODELFILE)
    };
    
    match result {
        Ok(mf) => Ok(mf),
        Err(_) => {
            println!("Invalid Modelfile");
            Err(())
        }
    }
}
Then use self.parse_modelfile(run_args.modelfile_path.as_ref()) in both methods.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe14a70 and 2e50868.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

server/backend/mlx.py
server/backend/mlx_runner.py
server/schemas.py
tiles/Cargo.toml
tiles/src/commands/mod.rs
tiles/src/main.rs
tiles/src/runtime/cpu.rs
tiles/src/runtime/mlx.rs
tiles/src/runtime/mod.rs

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{rs,toml}

⚙️ CodeRabbit configuration file

Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.

Files:

tiles/src/commands/mod.rs
tiles/src/runtime/mod.rs
tiles/src/main.rs
tiles/src/runtime/cpu.rs
tiles/Cargo.toml
tiles/src/runtime/mlx.rs

🧬 Code graph analysis (7)

tiles/src/commands/mod.rs (3)

tiles/src/runtime/cpu.rs (1)

bench (27-29)

tiles/src/runtime/mlx.rs (1)

bench (129-190)

tiles/src/runtime/mod.rs (1)

bench (41-46)

tiles/src/runtime/mod.rs (3)

tiles/src/commands/mod.rs (1)

bench (22-24)

tiles/src/runtime/cpu.rs (1)

bench (27-29)

tiles/src/runtime/mlx.rs (1)

bench (129-190)

tiles/src/main.rs (3)

tiles/src/commands/mod.rs (1)

bench (22-24)

tiles/src/runtime/mlx.rs (1)

bench (129-190)

tiles/src/runtime/mod.rs (1)

bench (41-46)

tiles/src/runtime/cpu.rs (3)

tiles/src/commands/mod.rs (1)

bench (22-24)

tiles/src/runtime/mlx.rs (1)

bench (129-190)

tiles/src/runtime/mod.rs (1)

bench (41-46)

server/backend/mlx_runner.py (2)

server/schemas.py (1)

GenerationMetrics (70-75)

server/reasoning_utils.py (1)

finalize (424-438)

server/backend/mlx.py (1)

server/schemas.py (1)

GenerationMetrics (70-75)

tiles/src/runtime/mlx.rs (4)

tiles/src/commands/mod.rs (1)

bench (22-24)

tiles/src/runtime/cpu.rs (1)

bench (27-29)

tiles/src/runtime/mod.rs (1)

bench (41-46)

tilekit/src/modelfile.rs (2)

parse_from_file (246-251)

parse (253-258)

🔇 Additional comments (14)

tiles/Cargo.toml (1)

17-17: LGTM!

Adding chrono for timestamp handling in benchmark logging is appropriate. The default features should suffice for the benchmarking use case.

server/schemas.py (1)

69-75: LGTM!

Using a plain @dataclass for internal metrics transfer is a good choice—it avoids Pydantic overhead for a structure that doesn't require request validation. The field types and documentation are clear.

tiles/src/commands/mod.rs (1)

21-24: LGTM!

The bench command handler follows the established pattern of delegating to the runtime layer. Clean and consistent with the other command functions.

tiles/src/main.rs (2)

25-29: LGTM!

The Bench subcommand is well-integrated with the CLI. The doc comment clearly describes the purpose.

79-85: Reasonable approach for initial implementation.

Setting relay_count: 0 with an explanatory comment is clear. If bench-specific options are needed later (warmup iterations, repetitions, etc.), consider introducing a BenchFlags struct similar to RunFlags.

tiles/src/runtime/mod.rs (1)

40-46: LGTM!

The bench dispatch follows the established pattern for runtime method delegation. Returning () is consistent with how errors are handled internally in the MLX implementation.

server/backend/mlx.py (3)

4-4: LGTM!

Import of GenerationMetrics is correctly added to support the new metrics flow.

116-132: Clean integration of metrics collection into the streaming loop.

Using isinstance() to differentiate the final metrics object from string tokens is straightforward and effective. The continue ensures metrics aren't accidentally emitted as content.

165-181: Metrics propagation looks good.

The conditional inclusion of metrics in the final response is clean. Explicitly constructing the dict from dataclass fields ensures proper JSON serialization.

server/backend/mlx_runner.py (2)

23-23: LGTM!

Import added correctly to support the new metrics emission feature.

631-634: LGTM!

TTFT capture logic correctly measures time to first visible token. Placing it after stop token checks ensures we're measuring meaningful latency.

tiles/src/runtime/mlx.rs (3)

22-28: LGTM!

The BenchmarkMetrics struct correctly mirrors the Python GenerationMetrics dataclass from server/schemas.py with matching field names for proper deserialization.

335-345: LGTM!

Clear and concise metrics display in REPL. The formatting provides useful performance visibility without being intrusive.

97-97: LGTM with a note.

Setting PYTHONDONTWRITEBYTECODE=1 prevents stale .pyc files as noted in the PR description. This is a reasonable trade-off - minor startup performance impact for more predictable behavior during development cycles.

coderabbitai · 2026-01-08T01:26:43Z

+        let modelname = modelfile.from.as_ref().unwrap();
+
+        // Start server if not running
+        if (ping().await).is_err() {
+            println!("Starting server...");
+            let _res = self.start_server_daemon().await;
+            let _ = wait_until_server_is_up().await;
+        }
+
+        // Load model
+        let memory_path = get_memory_path()
+            .context("Retrieving memory_path failed")
+            .unwrap();
+        if let Err(e) = load_model(modelname, &memory_path).await {
+            println!("Failed to load model: {}", e);
+            return;
+        }


⚠️ Potential issue | 🟠 Major

Several unwrap() calls risk panics; server start failure isn't handled.

Line 147: unwrap() on modelfile.from will panic if from is None.

Line 152-153: If start_server_daemon fails, the error is discarded and wait_until_server_is_up() may loop indefinitely.

Line 159: unwrap() on get_memory_path() will panic on failure.

🔧 Proposed fix

- let modelname = modelfile.from.as_ref().unwrap(); + let modelname = match modelfile.from.as_ref() { + Some(name) => name, + None => { + println!("Modelfile missing 'FROM' directive"); + return; + } + }; // Start server if not running if (ping().await).is_err() { println!("Starting server..."); - let _res = self.start_server_daemon().await; + if let Err(e) = self.start_server_daemon().await { + println!("Failed to start server: {}", e); + return; + } let _ = wait_until_server_is_up().await; } // Load model - let memory_path = get_memory_path() - .context("Retrieving memory_path failed") - .unwrap(); + let memory_path = match get_memory_path() { + Ok(path) => path, + Err(e) => { + println!("Failed to get memory path: {}", e); + return; + } + };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let modelname = modelfile.from.as_ref().unwrap();

// Start server if not running

if (ping().await).is_err() {

println!("Starting server...");

let _res = self.start_server_daemon().await;

let _ = wait_until_server_is_up().await;

}

// Load model

let memory_path = get_memory_path()

.context("Retrieving memory_path failed")

.unwrap();

if let Err(e) = load_model(modelname, &memory_path).await {

println!("Failed to load model: {}", e);

return;

}

let modelname = match modelfile.from.as_ref() {

Some(name) => name,

None => {

println!("Modelfile missing 'FROM' directive");

return;

}

};

// Start server if not running

if (ping().await).is_err() {

println!("Starting server...");

if let Err(e) = self.start_server_daemon().await {

println!("Failed to start server: {}", e);

return;

}

let _ = wait_until_server_is_up().await;

}

// Load model

let memory_path = match get_memory_path() {

Ok(path) => path,

Err(e) => {

println!("Failed to get memory path: {}", e);

return;

}

};

if let Err(e) = load_model(modelname, &memory_path).await {

println!("Failed to load model: {}", e);

return;

}

coderabbitai · 2026-01-08T01:26:43Z

            let v: Value = serde_json::from_str(data).unwrap();
+
+            // Check for metrics in the response
+            if let Some(metrics_obj) = v.get("metrics") {
+                metrics = serde_json::from_value(metrics_obj.clone()).ok();
+            }


⚠️ Potential issue | 🟡 Minor

unwrap() on JSON parsing can panic on malformed SSE data.

If the server sends malformed JSON in the SSE stream, this will panic. Consider graceful error handling.

🔧 Proposed fix

// Parse JSON - let v: Value = serde_json::from_str(data).unwrap(); + let v: Value = match serde_json::from_str(data) { + Ok(v) => v, + Err(e) => { + eprintln!("Failed to parse JSON: {}", e); + continue; + } + };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let v: Value = serde_json::from_str(data).unwrap();

// Check for metrics in the response

if let Some(metrics_obj) = v.get("metrics") {

metrics = serde_json::from_value(metrics_obj.clone()).ok();

}

let v: Value = match serde_json::from_str(data) {

Ok(v) => v,

Err(e) => {

eprintln!("Failed to parse JSON: {}", e);

continue;

}

};

// Check for metrics in the response

if let Some(metrics_obj) = v.get("metrics") {

metrics = serde_json::from_value(metrics_obj.clone()).ok();

}

🤖 Prompt for AI Agents

In @tiles/src/runtime/mlx.rs around lines 449 - 454, The code currently calls serde_json::from_str(data).unwrap() (in mlx.rs) which will panic on malformed SSE JSON; change it to handle parsing errors gracefully by replacing unwrap() with a fallible path (e.g., match or if let Ok(v) = serde_json::from_str::<Value>(data) { ... } else { log the parse error and continue; }); ensure subsequent code that reads v.get("metrics") only runs when parsing succeeded and preserve existing metrics assignment logic (the metrics = serde_json::from_value(metrics_obj.clone()).ok() line) inside the successful branch.

madclaws

Also we do max relay_count relays with model, but we are not showing the aggregate of the total communication when it comes to total_tokens, latency etc.

This is a tad difference from normal model conversation, due to this being memory model, we try to communicate w the model until we get <reply></reply>

Right now we are only showing the metrics for the final reply, and not the aggregate.

So how should we approach this @feynon @kshitijgetsac ?

madclaws · 2026-01-09T10:47:46Z

+                                if let Some(metrics) = response.metrics {
+                                    println!(
+                                        "\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",
+                                        "💡".yellow(),
+                                        metrics.tokens_per_second,
+                                        metrics.total_tokens,
+                                        metrics.ttft_ms
+                                    );
+                                }


Can we dim the token info and is it possible to reduce the size of the text relative to the model reply?

madclaws · 2026-01-09T13:42:06Z

+                            if reasoning_parser:
+                                yield from reasoning_parser.finalize()


why do we need this check?

The StreamingReasoningParser buffers and formats this content as it streams. When generation stops (either naturally or via stop token), we need to:

Flush any buffered content - The parser may have partial reasoning text that hasn't been yielded yet

Finalize formatting - Add closing markers or format the final output properly, If we skip finalize() when a stop token triggers early return:
Buffered reasoning content would be lost
The response might be incomplete or malformed
User would see truncated output
The condition just ensures we only call it when the model is a reasoning model (the parser is only created for those models).

madclaws · 2026-01-09T13:59:04Z

Should we benchamrk just basic prompts which doesn't use memory (mem-agent is smart enough to understand it) or do the benchmarking of the memory usage?

madclaws · 2026-01-09T17:53:08Z

.env("PYTHONDONTWRITEBYTECODE", "1") , this part only runs in release mode and not in dev mode. Are you saying you tested this in release build and found out that pycache was not gettinng cleared? @kshitijgetsac

madclaws · 2026-01-09T17:57:32Z

+        const DEFAULT_MODELFILE: &str = "FROM driaforall/mem-agent-mlx-4bit";
+
+        // Parse modelfile
+        let modelfile_parse_result = if let Some(modelfile_str) = run_args.modelfile_path {
+            tilekit::modelfile::parse_from_file(modelfile_str.as_str())
+        } else {
+            tilekit::modelfile::parse(DEFAULT_MODELFILE)
+        };
+
+        let modelfile = match modelfile_parse_result {
+            Ok(mf) => mf,
+            Err(_err) => {
+                println!("Invalid Modelfile");
+                return;
+            }
+        };
+
+        let modelname = modelfile.from.as_ref().unwrap();
+


There is some code duplication in bench and run, can we move them to a single function and use it?

yes @madclaws I tried the pycode thing and the pycache wasn't getting cleared , sure I will change this part

Ok, so if you check the bundler.sh and install.sh. During packing the release we remove the pycache folders.
and while extracting as part of installation we remove the existing server folder too first. So this won't be an issue in prod.

If you want to test this,

go to tiles root project directory.

run just bundle

run just install

Now you can try tiles run from anywhere .

feynon · 2026-01-09T19:13:12Z

Should we benchamrk just basic prompts which doesn't use memory (mem-agent is smart enough to understand it) or do the benchmarking of the memory usage?

No memory, we need to just benchmark pure tokens inference.

…rics - Add `tiles bench` command for running benchmarks - Track TTFT, tokens/sec, total tokens, and latency - Display metrics after each REPL response - Save benchmark results to ~/.config/tiles/benchmark_log.jsonl - Add GenerationMetrics dataclass in Python server

feynon · 2026-01-09T19:32:12Z

tad difference from normal model conversation, due to this being memory model, we try to communicate w the model until we get <reply></reply>

I think we should do the aggregate of the total communication.

kshitijgetsac · 2026-01-09T19:35:09Z

also for your aggregate thing , we just have to show the results for one particular chat and that is already implemented

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @server/backend/mlx_runner.py:
- Around line 631-633: The TTFT calculation is done after the stop-token checks,
so if the first generated token is a stop token the function returns before ttft
is set; move the ttft capture (set ttft = time.time() - start_time) immediately
after the code that obtains the first generated token (i.e., right after the
token is produced/decoded) and before any stop-token checks/early returns, and
then remove the existing conditional block that sets ttft later (the ttft is
None check at the end).
- Around line 572-584: The metrics calculation uses tokens_generated before
incrementing it when an early native/chat stop token is handled, causing
total_tokens to be off by one; update the stop-token branches (the branch that
yields reasoning_parser.finalize() and the chat stop branch referenced around
the same area) to increment tokens_generated immediately upon consuming the stop
token (i.e., before computing total_latency, tokens_per_second, ttft_ms and
yielding GenerationMetrics), ensuring GenerationMetrics.ttft_ms, total_tokens,
tokens_per_second and total_latency_s reflect the consumed stop token; apply the
same increment fix in the chat stop token branch mentioned around line 615.

🧹 Nitpick comments (4)

server/schemas.py (1)
69-75: Consider adding validation for metric values.

The GenerationMetrics dataclass is well-structured, but consider adding validation to ensure metrics are non-negative (e.g., ttft_ms >= 0, total_tokens >= 0). This would catch bugs earlier if invalid values are produced.

If validation is desired, you could either:

Use Pydantic's BaseModel with Field validators

Add __post_init__ to the dataclass to validate values
Example with dataclass validation
@dataclass
class GenerationMetrics:
    """Benchmarking metrics for token generation."""
    ttft_ms: float  # Time to first token in milliseconds
    total_tokens: int  # Total tokens generated
    tokens_per_second: float  # Throughput
    total_latency_s: float  # End-to-end latency in seconds
    
    def __post_init__(self):
        if self.ttft_ms < 0:
            raise ValueError("ttft_ms must be non-negative")
        if self.total_tokens < 0:
            raise ValueError("total_tokens must be non-negative")
        if self.tokens_per_second < 0:
            raise ValueError("tokens_per_second must be non-negative")
        if self.total_latency_s < 0:
            raise ValueError("total_latency_s must be non-negative")
tiles/src/runtime/mlx.rs (3)
22-28: Consider making BenchmarkMetrics fields public.

The BenchmarkMetrics fields are private, which is fine for internal use within this module. However, if this struct is intended to be used as a data transfer object across module boundaries, consider making the fields pub for easier access and a more conventional Rust API for serializable data structures.
Proposed change
 #[derive(Debug, Deserialize, Serialize)]
 pub struct BenchmarkMetrics {
-    ttft_ms: f64,
-    total_tokens: i32,
-    tokens_per_second: f64,
-    total_latency_s: f64,
+    pub ttft_ms: f64,
+    pub total_tokens: i32,
+    pub tokens_per_second: f64,
+    pub total_latency_s: f64,
 }
45-59: Error details lost in parse failure message.

When parse_from_file or parse fails, the actual error is discarded and only "Invalid Modelfile" is printed. Including the underlying error would significantly improve debuggability.
Proposed fix
     match parse_result {
         Ok(mf) => Some(mf),
-        Err(_) => {
-            println!("Invalid Modelfile");
+        Err(e) => {
+            println!("Invalid Modelfile: {}", e);
             None
         }
     }
185-213: Remove redundant import.

std::io::Write is already imported at the module level (line 10), so the local import at line 207 is unnecessary.
Proposed fix
     let mut file = std::fs::OpenOptions::new()
         .create(true)
         .append(true)
         .open(&log_file)?;

-    use std::io::Write;
     writeln!(file, "{}", log_entry)?;

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e50868 and 621a7e0.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (9)

server/backend/mlx.py
server/backend/mlx_runner.py
server/schemas.py
tiles/Cargo.toml
tiles/src/commands/mod.rs
tiles/src/main.rs
tiles/src/runtime/cpu.rs
tiles/src/runtime/mlx.rs
tiles/src/runtime/mod.rs

🚧 Files skipped from review as they are similar to previous changes (2)

tiles/src/main.rs
tiles/Cargo.toml

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{rs,toml}

⚙️ CodeRabbit configuration file

Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.

Files:

tiles/src/runtime/cpu.rs
tiles/src/commands/mod.rs
tiles/src/runtime/mlx.rs
tiles/src/runtime/mod.rs

🧬 Code graph analysis (6)

tiles/src/runtime/cpu.rs (3)

tiles/src/commands/mod.rs (1)

bench (22-24)

tiles/src/runtime/mlx.rs (1)

bench (134-182)

tiles/src/runtime/mod.rs (1)

bench (41-46)

server/backend/mlx.py (1)

server/schemas.py (4)

ChatMessage (16-18)

ChatCompletionRequest (21-31)

downloadRequest (65-66)

GenerationMetrics (70-75)

server/backend/mlx_runner.py (2)

server/schemas.py (1)

GenerationMetrics (70-75)

server/reasoning_utils.py (1)

finalize (424-438)

tiles/src/commands/mod.rs (3)

tiles/src/runtime/cpu.rs (1)

bench (27-29)

tiles/src/runtime/mlx.rs (1)

bench (134-182)

tiles/src/runtime/mod.rs (1)

bench (41-46)

tiles/src/runtime/mlx.rs (3)

tilekit/src/modelfile.rs (2)

parse_from_file (246-251)

parse (253-258)

tiles/src/commands/mod.rs (2)

run (6-8)

bench (22-24)

tiles/src/runtime/mod.rs (2)

run (20-25)

bench (41-46)

tiles/src/runtime/mod.rs (3)

tiles/src/commands/mod.rs (1)

bench (22-24)

tiles/src/runtime/cpu.rs (1)

bench (27-29)

tiles/src/runtime/mlx.rs (1)

bench (134-182)

🔇 Additional comments (15)

tiles/src/runtime/cpu.rs (1)

27-29: LGTM! Stub implementation follows the established pattern.

The bench method correctly mirrors the existing stub pattern for other CPURuntime methods. This is appropriate for the placeholder CPU runtime implementation.

tiles/src/commands/mod.rs (1)

22-24: LGTM! Clean delegation pattern.

The bench command function correctly follows the established pattern of delegating to the runtime layer, consistent with run, start_server, and stop_server commands.

tiles/src/runtime/mod.rs (1)

41-46: LGTM! Correct enum dispatch pattern.

The bench method properly dispatches to the concrete runtime implementations, following the same pattern as run, start_server_daemon, and stop_server_daemon.

server/schemas.py (1)

3-3: LGTM! Appropriate import for dataclass.

server/backend/mlx.py (4)

4-4: LGTM! Correct import of GenerationMetrics.

116-116: LGTM! Appropriate initialization of metrics variable.

129-132: LGTM! Correct metrics capture logic.

The implementation correctly identifies GenerationMetrics objects in the stream and stores them without emitting them as text chunks. The continue statement appropriately skips the yield for metrics objects.

165-181: LGTM! Correct conditional metrics inclusion in final response.

The implementation properly checks for the presence of metrics and conditionally includes them in the final streaming response. The field extraction is correct and matches the GenerationMetrics schema.

server/backend/mlx_runner.py (2)

652-662: LGTM—metrics correctly calculated at normal termination.

The metrics calculation here is correct for the normal generation completion path (EOS or max_tokens reached).

23-23: LGTM—import matches schema definition.

The GenerationMetrics import aligns with the schema definition in server/schemas.py.

tiles/src/runtime/mlx.rs (5)

102-102: Note: PYTHONDONTWRITEBYTECODE only set in release builds.

As mentioned in the PR discussion, this environment variable is set only when the server is started via start_server_daemon, which is skipped in debug mode (see line 275: if !cfg!(debug_assertions)). This means .pyc files are only suppressed in release builds. If this is intentional, consider adding a comment explaining the behavior; otherwise, verify whether this matches the intended design.

Based on the PR discussion around .pyc caching behavior.

134-182: LGTM—benchmark implementation is clean and functional.

The bench method correctly:

Ensures the server is running

Loads the model

Executes a fixed benchmark prompt for consistency

Displays and persists metrics

Using a hardcoded simple prompt ("What is 2+2?") ensures reproducible benchmarks focused on token inference performance, aligning with the PR objectives.

425-446: LGTM—metrics correctly extracted and propagated.

The metrics are properly parsed from the streaming response JSON and passed through to ChatResponse. The optional handling ensures backward compatibility when metrics are absent.

328-337: LGTM—clear and concise metrics display.

The metrics formatting in the REPL provides immediate feedback with throughput, token count, and TTFT in an easy-to-read format.

67-69: LGTM—clean early-return pattern.

The let Some(...) else { return; } pattern is idiomatic Rust and cleanly centralizes modelfile parsing with early exit on failure.

coderabbitai · 2026-01-09T19:38:24Z

+
+                            # Yield metrics before returning
+                            if reasoning_parser:
+                                yield from reasoning_parser.finalize()
+                            total_latency = time.time() - start_time
+                            tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0
+                            ttft_ms = (ttft * 1000) if ttft is not None else 0
+                            yield GenerationMetrics(
+                                ttft_ms=ttft_ms,
+                                total_tokens=tokens_generated,
+                                tokens_per_second=tokens_per_second,
+                                total_latency_s=total_latency
+                            )


⚠️ Potential issue | 🟡 Minor

Same token-count issue for chat stop tokens.

This branch has the same token-counting issue as the native stop token branch (lines 572-584). See the earlier comment for details and fix.

madclaws · 2026-01-09T19:42:09Z

also for your aggregate thing , we just have to show the results for one particular chat and that is already implemented

If that's the case, then it should mentioned right?

kshitijgetsac · 2026-01-09T19:46:49Z

yeah sure will update the description

@kshitijgetsac

thanks to @kshitijgetsac for the reference #43

@kshitijgetsac

thanks to @kshitijgetsac for the reference #43

madclaws · 2026-01-30T08:49:23Z

Closing this, since have included this in PR with some extra additions. Thanks.

coderabbitai Bot reviewed Jan 8, 2026

View reviewed changes

madclaws linked an issue Jan 8, 2026 that may be closed by this pull request

Create a benchmark for chat session #37

Closed

madclaws reviewed Jan 9, 2026

View reviewed changes

kshitijgetsac force-pushed the chat-benchmarking-v2 branch from 2e50868 to 621a7e0 Compare January 9, 2026 19:33

coderabbitai Bot reviewed Jan 9, 2026

View reviewed changes

madclaws added a commit that referenced this pull request Jan 29, 2026

feat: added basic benchmark in the repl

48d1ea6

thanks to @kshitijgetsac for the reference #43

madclaws added a commit that referenced this pull request Jan 30, 2026

feat: added basic benchmark in the repl

9bc4b4b

thanks to @kshitijgetsac for the reference #43

madclaws closed this Jan 30, 2026

Conversation

kshitijgetsac commented Jan 8, 2026

Uh oh!

coderabbitai Bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

madclaws left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madclaws Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

madclaws Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshitijgetsac Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

madclaws commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madclaws commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madclaws Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

kshitijgetsac Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

madclaws Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

feynon commented Jan 9, 2026

Uh oh!

feynon commented Jan 9, 2026

Uh oh!

kshitijgetsac commented Jan 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madclaws commented Jan 9, 2026

Uh oh!

kshitijgetsac commented Jan 9, 2026

Uh oh!

madclaws commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Jan 8, 2026 •

edited

Loading

madclaws left a comment •

edited

Loading

madclaws Jan 9, 2026 •

edited

Loading

madclaws commented Jan 9, 2026 •

edited

Loading

madclaws commented Jan 9, 2026 •

edited

Loading

madclaws Jan 9, 2026 •

edited

Loading