Skip to content

feat: add benchmarking harness with TTFT, throughput, and latency metrics#43

Closed
kshitijgetsac wants to merge 1 commit intotilesprivacy:mainfrom
kshitijgetsac:chat-benchmarking-v2
Closed

feat: add benchmarking harness with TTFT, throughput, and latency metrics#43
kshitijgetsac wants to merge 1 commit intotilesprivacy:mainfrom
kshitijgetsac:chat-benchmarking-v2

Conversation

@kshitijgetsac
Copy link
Copy Markdown
Contributor

  • Add tiles bench command for running benchmarks
  • Track TTFT, tokens/sec, total tokens, and latency
  • Display metrics after each REPL response
  • Save benchmark results to ~/.config/tiles/benchmark_log.jsonl
  • Add GenerationMetrics dataclass in Python server, had a little problem with compiling the python code , .env("PYTHONDONTWRITEBYTECODE", "1") // Disable .pyc caching; set this var to 1 @madclaws , if this is a problem we can discuss this as the pycache wasn't getting cleared at all and it somehow was running the old code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

Adds generation-metrics support end-to-end. A new Python dataclass GenerationMetrics is emitted during token streaming in mlx_runner.py; mlx.py consumes these metric objects and attaches a metrics object to the final chat completion chunk. On the Rust side, BenchmarkMetrics and an optional metrics field were added; a new bench CLI subcommand and runtime bench methods run a benchmark prompt, collect metrics, and append them to a JSONL log.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Server.mlx as MLX API (mlx.py)
    participant Server.runner as MLX Runner (mlx_runner.py)
    participant Model
    participant ClientStream as Client Stream

    Client->>Server.mlx: start chat stream request
    Server.mlx->>Server.runner: start generation stream
    Server.runner->>Model: request tokens (streaming)
    Model-->>Server.runner: token
    alt first token
        Server.runner->>Server.runner: record ttft
    end
    Server.runner-->>Server.mlx: token (chunks) or GenerationMetrics objects intermittently
    Server.mlx-->>ClientStream: emit token chunks (skip metric-only chunks)
    Server.runner-->>Server.mlx: final GenerationMetrics at end-of-stream
    Server.mlx->>ClientStream: final completion chunk including metrics field
Loading
sequenceDiagram
    participant UserCLI as tiles CLI
    participant Runtime
    participant MLXRuntime
    participant ServerProcess
    participant LogFile

    UserCLI->>Runtime: commands::bench(modelfile)
    Runtime->>MLXRuntime: bench(run_args)
    MLXRuntime->>ServerProcess: ensure server running & send benchmark prompt
    ServerProcess->>MLXRuntime: streaming tokens + GenerationMetrics
    MLXRuntime->>MLXRuntime: convert metrics to BenchmarkMetrics
    MLXRuntime->>LogFile: append benchmark JSONL (timestamp, model, metrics)
    MLXRuntime->>UserCLI: print metrics
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: adding a benchmarking harness with performance metrics (TTFT, throughput, latency).
Description check ✅ Passed The description clearly outlines the key changes: the new bench command, metrics tracking, REPL display, log persistence, and mentions the Python GenerationMetrics addition.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
server/backend/mlx_runner.py (2)

426-426: Return type annotation is inconsistent with yielded values.

The method signature declares Iterator[str] but now yields both str and GenerationMetrics objects. This breaks type safety and will cause issues for callers that expect only strings (e.g., interactive_chat at line 849 prints tokens directly, which would print the raw GenerationMetrics object).

🔧 Proposed fix

Update the return type and handle metrics in callers:

-    ) -> Iterator[str]:
+    ) -> Iterator[str | GenerationMetrics]:

Or consider yielding metrics through a separate mechanism (e.g., callback, out-parameter, or wrapper class) to maintain backward compatibility.

Also applies to: 572-584


838-851: Callers need to handle GenerationMetrics objects in the stream.

This loop prints each yielded item directly. Since generate_streaming now yields GenerationMetrics objects, this will print raw object representations like GenerationMetrics(ttft_ms=12.5, ...) to the user.

🔧 Proposed fix
                 for token in self.generate_streaming(
                     prompt=prompt,
                     max_tokens=max_tokens,
                     temperature=temperature,
                     top_p=top_p,
                     repetition_penalty=repetition_penalty,
                     use_chat_template=False,  # Already applied in _format_conversation
                     use_chat_stop_tokens=True,  # Enable chat stop tokens in interactive mode
                     interactive=True,  # Enable full context length for interactive mode
                 ):
-                    # Stream all tokens directly (already formatted by generate_streaming)
-                    print(token, end="", flush=True)
-                    response_tokens.append(token)
+                    if isinstance(token, GenerationMetrics):
+                        # Handle metrics separately if needed
+                        continue
+                    # Stream all tokens directly (already formatted by generate_streaming)
+                    print(token, end="", flush=True)
+                    response_tokens.append(token)
🤖 Fix all issues with AI agents
In @tiles/src/runtime/mlx.rs:
- Around line 449-454: The code currently calls
serde_json::from_str(data).unwrap() (in mlx.rs) which will panic on malformed
SSE JSON; change it to handle parsing errors gracefully by replacing unwrap()
with a fallible path (e.g., match or if let Ok(v) =
serde_json::from_str::<Value>(data) { ... } else { log the parse error and
continue; }); ensure subsequent code that reads v.get("metrics") only runs when
parsing succeeded and preserve existing metrics assignment logic (the metrics =
serde_json::from_value(metrics_obj.clone()).ok() line) inside the successful
branch.
🧹 Nitpick comments (4)
tiles/src/runtime/cpu.rs (1)

27-29: Consider providing a cleaner error for unsupported platforms.

The unimplemented!() macro will panic if tiles bench is invoked on non-macOS systems (where CPURuntime is selected). A friendlier approach would print a message and return gracefully, similar to how the MLX bench handles errors.

♻️ Suggested improvement
     pub async fn bench(&self, _run_args: super::RunArgs) {
-        unimplemented!()
+        eprintln!("Benchmarking is not yet supported on this platform (CPU runtime).");
     }
server/backend/mlx_runner.py (1)

572-584: Consider extracting duplicated metrics calculation logic.

The same metrics calculation pattern is repeated in three places (native stop token path, chat stop token path, and end of generation). This increases maintenance burden and risk of inconsistencies.

♻️ Suggested refactor
+    def _create_generation_metrics(self, start_time: float, tokens_generated: int, ttft: Optional[float]) -> GenerationMetrics:
+        total_latency = time.time() - start_time
+        tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0
+        ttft_ms = (ttft * 1000) if ttft is not None else 0
+        return GenerationMetrics(
+            ttft_ms=ttft_ms,
+            total_tokens=tokens_generated,
+            tokens_per_second=tokens_per_second,
+            total_latency_s=total_latency
+        )

Then replace the inline calculations with calls to this helper.

Also applies to: 615-627, 652-662

tiles/src/runtime/mlx.rs (2)

215-215: Remove redundant import.

std::io::Write is already imported at line 10. This inline import is unnecessary.

♻️ Proposed fix
     // Append to log file
     let mut file = std::fs::OpenOptions::new()
         .create(true)
         .append(true)
         .open(&log_file)?;

-    use std::io::Write;
     writeln!(file, "{}", log_entry)?;

129-145: Consider extracting shared modelfile parsing logic.

The modelfile parsing code (lines 129-145) is nearly identical to lines 48-64 in run(). Extracting a helper method would reduce duplication.

♻️ Suggested refactor
fn parse_modelfile(modelfile_path: Option<&String>) -> Result<Modelfile, ()> {
    const DEFAULT_MODELFILE: &str = "FROM driaforall/mem-agent-mlx-4bit";
    
    let result = if let Some(path) = modelfile_path {
        tilekit::modelfile::parse_from_file(path.as_str())
    } else {
        tilekit::modelfile::parse(DEFAULT_MODELFILE)
    };
    
    match result {
        Ok(mf) => Ok(mf),
        Err(_) => {
            println!("Invalid Modelfile");
            Err(())
        }
    }
}

Then use self.parse_modelfile(run_args.modelfile_path.as_ref()) in both methods.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe14a70 and 2e50868.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • server/backend/mlx.py
  • server/backend/mlx_runner.py
  • server/schemas.py
  • tiles/Cargo.toml
  • tiles/src/commands/mod.rs
  • tiles/src/main.rs
  • tiles/src/runtime/cpu.rs
  • tiles/src/runtime/mlx.rs
  • tiles/src/runtime/mod.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{rs,toml}

⚙️ CodeRabbit configuration file

Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.

Files:

  • tiles/src/commands/mod.rs
  • tiles/src/runtime/mod.rs
  • tiles/src/main.rs
  • tiles/src/runtime/cpu.rs
  • tiles/Cargo.toml
  • tiles/src/runtime/mlx.rs
🧬 Code graph analysis (7)
tiles/src/commands/mod.rs (3)
tiles/src/runtime/cpu.rs (1)
  • bench (27-29)
tiles/src/runtime/mlx.rs (1)
  • bench (129-190)
tiles/src/runtime/mod.rs (1)
  • bench (41-46)
tiles/src/runtime/mod.rs (3)
tiles/src/commands/mod.rs (1)
  • bench (22-24)
tiles/src/runtime/cpu.rs (1)
  • bench (27-29)
tiles/src/runtime/mlx.rs (1)
  • bench (129-190)
tiles/src/main.rs (3)
tiles/src/commands/mod.rs (1)
  • bench (22-24)
tiles/src/runtime/mlx.rs (1)
  • bench (129-190)
tiles/src/runtime/mod.rs (1)
  • bench (41-46)
tiles/src/runtime/cpu.rs (3)
tiles/src/commands/mod.rs (1)
  • bench (22-24)
tiles/src/runtime/mlx.rs (1)
  • bench (129-190)
tiles/src/runtime/mod.rs (1)
  • bench (41-46)
server/backend/mlx_runner.py (2)
server/schemas.py (1)
  • GenerationMetrics (70-75)
server/reasoning_utils.py (1)
  • finalize (424-438)
server/backend/mlx.py (1)
server/schemas.py (1)
  • GenerationMetrics (70-75)
tiles/src/runtime/mlx.rs (4)
tiles/src/commands/mod.rs (1)
  • bench (22-24)
tiles/src/runtime/cpu.rs (1)
  • bench (27-29)
tiles/src/runtime/mod.rs (1)
  • bench (41-46)
tilekit/src/modelfile.rs (2)
  • parse_from_file (246-251)
  • parse (253-258)
🔇 Additional comments (14)
tiles/Cargo.toml (1)

17-17: LGTM!

Adding chrono for timestamp handling in benchmark logging is appropriate. The default features should suffice for the benchmarking use case.

server/schemas.py (1)

69-75: LGTM!

Using a plain @dataclass for internal metrics transfer is a good choice—it avoids Pydantic overhead for a structure that doesn't require request validation. The field types and documentation are clear.

tiles/src/commands/mod.rs (1)

21-24: LGTM!

The bench command handler follows the established pattern of delegating to the runtime layer. Clean and consistent with the other command functions.

tiles/src/main.rs (2)

25-29: LGTM!

The Bench subcommand is well-integrated with the CLI. The doc comment clearly describes the purpose.


79-85: Reasonable approach for initial implementation.

Setting relay_count: 0 with an explanatory comment is clear. If bench-specific options are needed later (warmup iterations, repetitions, etc.), consider introducing a BenchFlags struct similar to RunFlags.

tiles/src/runtime/mod.rs (1)

40-46: LGTM!

The bench dispatch follows the established pattern for runtime method delegation. Returning () is consistent with how errors are handled internally in the MLX implementation.

server/backend/mlx.py (3)

4-4: LGTM!

Import of GenerationMetrics is correctly added to support the new metrics flow.


116-132: Clean integration of metrics collection into the streaming loop.

Using isinstance() to differentiate the final metrics object from string tokens is straightforward and effective. The continue ensures metrics aren't accidentally emitted as content.


165-181: Metrics propagation looks good.

The conditional inclusion of metrics in the final response is clean. Explicitly constructing the dict from dataclass fields ensures proper JSON serialization.

server/backend/mlx_runner.py (2)

23-23: LGTM!

Import added correctly to support the new metrics emission feature.


631-634: LGTM!

TTFT capture logic correctly measures time to first visible token. Placing it after stop token checks ensures we're measuring meaningful latency.

tiles/src/runtime/mlx.rs (3)

22-28: LGTM!

The BenchmarkMetrics struct correctly mirrors the Python GenerationMetrics dataclass from server/schemas.py with matching field names for proper deserialization.


335-345: LGTM!

Clear and concise metrics display in REPL. The formatting provides useful performance visibility without being intrusive.


97-97: LGTM with a note.

Setting PYTHONDONTWRITEBYTECODE=1 prevents stale .pyc files as noted in the PR description. This is a reasonable trade-off - minor startup performance impact for more predictable behavior during development cycles.

Comment thread tiles/src/runtime/mlx.rs
Comment on lines +147 to +163
let modelname = modelfile.from.as_ref().unwrap();

// Start server if not running
if (ping().await).is_err() {
println!("Starting server...");
let _res = self.start_server_daemon().await;
let _ = wait_until_server_is_up().await;
}

// Load model
let memory_path = get_memory_path()
.context("Retrieving memory_path failed")
.unwrap();
if let Err(e) = load_model(modelname, &memory_path).await {
println!("Failed to load model: {}", e);
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Several unwrap() calls risk panics; server start failure isn't handled.

  1. Line 147: unwrap() on modelfile.from will panic if from is None.
  2. Line 152-153: If start_server_daemon fails, the error is discarded and wait_until_server_is_up() may loop indefinitely.
  3. Line 159: unwrap() on get_memory_path() will panic on failure.
🔧 Proposed fix
-        let modelname = modelfile.from.as_ref().unwrap();
+        let modelname = match modelfile.from.as_ref() {
+            Some(name) => name,
+            None => {
+                println!("Modelfile missing 'FROM' directive");
+                return;
+            }
+        };

         // Start server if not running
         if (ping().await).is_err() {
             println!("Starting server...");
-            let _res = self.start_server_daemon().await;
+            if let Err(e) = self.start_server_daemon().await {
+                println!("Failed to start server: {}", e);
+                return;
+            }
             let _ = wait_until_server_is_up().await;
         }

         // Load model
-        let memory_path = get_memory_path()
-            .context("Retrieving memory_path failed")
-            .unwrap();
+        let memory_path = match get_memory_path() {
+            Ok(path) => path,
+            Err(e) => {
+                println!("Failed to get memory path: {}", e);
+                return;
+            }
+        };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let modelname = modelfile.from.as_ref().unwrap();
// Start server if not running
if (ping().await).is_err() {
println!("Starting server...");
let _res = self.start_server_daemon().await;
let _ = wait_until_server_is_up().await;
}
// Load model
let memory_path = get_memory_path()
.context("Retrieving memory_path failed")
.unwrap();
if let Err(e) = load_model(modelname, &memory_path).await {
println!("Failed to load model: {}", e);
return;
}
let modelname = match modelfile.from.as_ref() {
Some(name) => name,
None => {
println!("Modelfile missing 'FROM' directive");
return;
}
};
// Start server if not running
if (ping().await).is_err() {
println!("Starting server...");
if let Err(e) = self.start_server_daemon().await {
println!("Failed to start server: {}", e);
return;
}
let _ = wait_until_server_is_up().await;
}
// Load model
let memory_path = match get_memory_path() {
Ok(path) => path,
Err(e) => {
println!("Failed to get memory path: {}", e);
return;
}
};
if let Err(e) = load_model(modelname, &memory_path).await {
println!("Failed to load model: {}", e);
return;
}

Comment thread tiles/src/runtime/mlx.rs
Comment on lines 449 to +454
let v: Value = serde_json::from_str(data).unwrap();

// Check for metrics in the response
if let Some(metrics_obj) = v.get("metrics") {
metrics = serde_json::from_value(metrics_obj.clone()).ok();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

unwrap() on JSON parsing can panic on malformed SSE data.

If the server sends malformed JSON in the SSE stream, this will panic. Consider graceful error handling.

🔧 Proposed fix
             // Parse JSON
-            let v: Value = serde_json::from_str(data).unwrap();
+            let v: Value = match serde_json::from_str(data) {
+                Ok(v) => v,
+                Err(e) => {
+                    eprintln!("Failed to parse JSON: {}", e);
+                    continue;
+                }
+            };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let v: Value = serde_json::from_str(data).unwrap();
// Check for metrics in the response
if let Some(metrics_obj) = v.get("metrics") {
metrics = serde_json::from_value(metrics_obj.clone()).ok();
}
let v: Value = match serde_json::from_str(data) {
Ok(v) => v,
Err(e) => {
eprintln!("Failed to parse JSON: {}", e);
continue;
}
};
// Check for metrics in the response
if let Some(metrics_obj) = v.get("metrics") {
metrics = serde_json::from_value(metrics_obj.clone()).ok();
}
🤖 Prompt for AI Agents
In @tiles/src/runtime/mlx.rs around lines 449 - 454, The code currently calls
serde_json::from_str(data).unwrap() (in mlx.rs) which will panic on malformed
SSE JSON; change it to handle parsing errors gracefully by replacing unwrap()
with a fallible path (e.g., match or if let Ok(v) =
serde_json::from_str::<Value>(data) { ... } else { log the parse error and
continue; }); ensure subsequent code that reads v.get("metrics") only runs when
parsing succeeded and preserve existing metrics assignment logic (the metrics =
serde_json::from_value(metrics_obj.clone()).ok() line) inside the successful
branch.

@madclaws madclaws linked an issue Jan 8, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Member

@madclaws madclaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we do max relay_count relays with model, but we are not showing the aggregate of the total communication when it comes to total_tokens, latency etc.

This is a tad difference from normal model conversation, due to this being memory model, we try to communicate w the model until we get <reply></reply>

Right now we are only showing the metrics for the final reply, and not the aggregate.

So how should we approach this @feynon @kshitijgetsac ?

Comment thread tiles/src/runtime/mlx.rs
Comment on lines +337 to +345
if let Some(metrics) = response.metrics {
println!(
"\n{} {:.1} tok/s | {} tokens | {:.0}ms TTFT",
"💡".yellow(),
metrics.tokens_per_second,
metrics.total_tokens,
metrics.ttft_ms
);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Can we dim the token info and is it possible to reduce the size of the text relative to the model reply?

Comment on lines +617 to +618
if reasoning_parser:
yield from reasoning_parser.finalize()
Copy link
Copy Markdown
Member

@madclaws madclaws Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StreamingReasoningParser buffers and formats this content as it streams. When generation stops (either naturally or via stop token), we need to:

  1. Flush any buffered content - The parser may have partial reasoning text that hasn't been yielded yet
  2. Finalize formatting - Add closing markers or format the final output properly, If we skip finalize() when a stop token triggers early return:
    Buffered reasoning content would be lost
    The response might be incomplete or malformed
    User would see truncated output
    The condition just ensures we only call it when the model is a reasoning model (the parser is only created for those models).

@madclaws
Copy link
Copy Markdown
Member

madclaws commented Jan 9, 2026

Should we benchamrk just basic prompts which doesn't use memory (mem-agent is smart enough to understand it) or do the benchmarking of the memory usage?

@madclaws
Copy link
Copy Markdown
Member

madclaws commented Jan 9, 2026

.env("PYTHONDONTWRITEBYTECODE", "1") , this part only runs in release mode and not in dev mode. Are you saying you tested this in release build and found out that pycache was not gettinng cleared? @kshitijgetsac

Comment thread tiles/src/runtime/mlx.rs Outdated
Comment on lines +130 to +148
const DEFAULT_MODELFILE: &str = "FROM driaforall/mem-agent-mlx-4bit";

// Parse modelfile
let modelfile_parse_result = if let Some(modelfile_str) = run_args.modelfile_path {
tilekit::modelfile::parse_from_file(modelfile_str.as_str())
} else {
tilekit::modelfile::parse(DEFAULT_MODELFILE)
};

let modelfile = match modelfile_parse_result {
Ok(mf) => mf,
Err(_err) => {
println!("Invalid Modelfile");
return;
}
};

let modelname = modelfile.from.as_ref().unwrap();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some code duplication in bench and run, can we move them to a single function and use it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @madclaws I tried the pycode thing and the pycache wasn't getting cleared , sure I will change this part

Copy link
Copy Markdown
Member

@madclaws madclaws Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if you check the bundler.sh and install.sh. During packing the release we remove the pycache folders.
and while extracting as part of installation we remove the existing server folder too first. So this won't be an issue in prod.

If you want to test this,

  • go to tiles root project directory.
  • run just bundle
  • run just install

Now you can try tiles run from anywhere .

@feynon
Copy link
Copy Markdown
Member

feynon commented Jan 9, 2026

Should we benchamrk just basic prompts which doesn't use memory (mem-agent is smart enough to understand it) or do the benchmarking of the memory usage?

No memory, we need to just benchmark pure tokens inference.

…rics

- Add `tiles bench` command for running benchmarks
- Track TTFT, tokens/sec, total tokens, and latency
- Display metrics after each REPL response
- Save benchmark results to ~/.config/tiles/benchmark_log.jsonl
- Add GenerationMetrics dataclass in Python server
@feynon
Copy link
Copy Markdown
Member

feynon commented Jan 9, 2026

tad difference from normal model conversation, due to this being memory model, we try to communicate w the model until we get <reply></reply>

I think we should do the aggregate of the total communication.

@kshitijgetsac
Copy link
Copy Markdown
Contributor Author

also for your aggregate thing , we just have to show the results for one particular chat and that is already implemented

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @server/backend/mlx_runner.py:
- Around line 631-633: The TTFT calculation is done after the stop-token checks,
so if the first generated token is a stop token the function returns before ttft
is set; move the ttft capture (set ttft = time.time() - start_time) immediately
after the code that obtains the first generated token (i.e., right after the
token is produced/decoded) and before any stop-token checks/early returns, and
then remove the existing conditional block that sets ttft later (the ttft is
None check at the end).
- Around line 572-584: The metrics calculation uses tokens_generated before
incrementing it when an early native/chat stop token is handled, causing
total_tokens to be off by one; update the stop-token branches (the branch that
yields reasoning_parser.finalize() and the chat stop branch referenced around
the same area) to increment tokens_generated immediately upon consuming the stop
token (i.e., before computing total_latency, tokens_per_second, ttft_ms and
yielding GenerationMetrics), ensuring GenerationMetrics.ttft_ms, total_tokens,
tokens_per_second and total_latency_s reflect the consumed stop token; apply the
same increment fix in the chat stop token branch mentioned around line 615.
🧹 Nitpick comments (4)
server/schemas.py (1)

69-75: Consider adding validation for metric values.

The GenerationMetrics dataclass is well-structured, but consider adding validation to ensure metrics are non-negative (e.g., ttft_ms >= 0, total_tokens >= 0). This would catch bugs earlier if invalid values are produced.

If validation is desired, you could either:

  1. Use Pydantic's BaseModel with Field validators
  2. Add __post_init__ to the dataclass to validate values
Example with dataclass validation
@dataclass
class GenerationMetrics:
    """Benchmarking metrics for token generation."""
    ttft_ms: float  # Time to first token in milliseconds
    total_tokens: int  # Total tokens generated
    tokens_per_second: float  # Throughput
    total_latency_s: float  # End-to-end latency in seconds
    
    def __post_init__(self):
        if self.ttft_ms < 0:
            raise ValueError("ttft_ms must be non-negative")
        if self.total_tokens < 0:
            raise ValueError("total_tokens must be non-negative")
        if self.tokens_per_second < 0:
            raise ValueError("tokens_per_second must be non-negative")
        if self.total_latency_s < 0:
            raise ValueError("total_latency_s must be non-negative")
tiles/src/runtime/mlx.rs (3)

22-28: Consider making BenchmarkMetrics fields public.

The BenchmarkMetrics fields are private, which is fine for internal use within this module. However, if this struct is intended to be used as a data transfer object across module boundaries, consider making the fields pub for easier access and a more conventional Rust API for serializable data structures.

Proposed change
 #[derive(Debug, Deserialize, Serialize)]
 pub struct BenchmarkMetrics {
-    ttft_ms: f64,
-    total_tokens: i32,
-    tokens_per_second: f64,
-    total_latency_s: f64,
+    pub ttft_ms: f64,
+    pub total_tokens: i32,
+    pub tokens_per_second: f64,
+    pub total_latency_s: f64,
 }

45-59: Error details lost in parse failure message.

When parse_from_file or parse fails, the actual error is discarded and only "Invalid Modelfile" is printed. Including the underlying error would significantly improve debuggability.

Proposed fix
     match parse_result {
         Ok(mf) => Some(mf),
-        Err(_) => {
-            println!("Invalid Modelfile");
+        Err(e) => {
+            println!("Invalid Modelfile: {}", e);
             None
         }
     }

185-213: Remove redundant import.

std::io::Write is already imported at the module level (line 10), so the local import at line 207 is unnecessary.

Proposed fix
     let mut file = std::fs::OpenOptions::new()
         .create(true)
         .append(true)
         .open(&log_file)?;

-    use std::io::Write;
     writeln!(file, "{}", log_entry)?;
📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e50868 and 621a7e0.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • server/backend/mlx.py
  • server/backend/mlx_runner.py
  • server/schemas.py
  • tiles/Cargo.toml
  • tiles/src/commands/mod.rs
  • tiles/src/main.rs
  • tiles/src/runtime/cpu.rs
  • tiles/src/runtime/mlx.rs
  • tiles/src/runtime/mod.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • tiles/src/main.rs
  • tiles/Cargo.toml
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{rs,toml}

⚙️ CodeRabbit configuration file

Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.

Files:

  • tiles/src/runtime/cpu.rs
  • tiles/src/commands/mod.rs
  • tiles/src/runtime/mlx.rs
  • tiles/src/runtime/mod.rs
🧬 Code graph analysis (6)
tiles/src/runtime/cpu.rs (3)
tiles/src/commands/mod.rs (1)
  • bench (22-24)
tiles/src/runtime/mlx.rs (1)
  • bench (134-182)
tiles/src/runtime/mod.rs (1)
  • bench (41-46)
server/backend/mlx.py (1)
server/schemas.py (4)
  • ChatMessage (16-18)
  • ChatCompletionRequest (21-31)
  • downloadRequest (65-66)
  • GenerationMetrics (70-75)
server/backend/mlx_runner.py (2)
server/schemas.py (1)
  • GenerationMetrics (70-75)
server/reasoning_utils.py (1)
  • finalize (424-438)
tiles/src/commands/mod.rs (3)
tiles/src/runtime/cpu.rs (1)
  • bench (27-29)
tiles/src/runtime/mlx.rs (1)
  • bench (134-182)
tiles/src/runtime/mod.rs (1)
  • bench (41-46)
tiles/src/runtime/mlx.rs (3)
tilekit/src/modelfile.rs (2)
  • parse_from_file (246-251)
  • parse (253-258)
tiles/src/commands/mod.rs (2)
  • run (6-8)
  • bench (22-24)
tiles/src/runtime/mod.rs (2)
  • run (20-25)
  • bench (41-46)
tiles/src/runtime/mod.rs (3)
tiles/src/commands/mod.rs (1)
  • bench (22-24)
tiles/src/runtime/cpu.rs (1)
  • bench (27-29)
tiles/src/runtime/mlx.rs (1)
  • bench (134-182)
🔇 Additional comments (15)
tiles/src/runtime/cpu.rs (1)

27-29: LGTM! Stub implementation follows the established pattern.

The bench method correctly mirrors the existing stub pattern for other CPURuntime methods. This is appropriate for the placeholder CPU runtime implementation.

tiles/src/commands/mod.rs (1)

22-24: LGTM! Clean delegation pattern.

The bench command function correctly follows the established pattern of delegating to the runtime layer, consistent with run, start_server, and stop_server commands.

tiles/src/runtime/mod.rs (1)

41-46: LGTM! Correct enum dispatch pattern.

The bench method properly dispatches to the concrete runtime implementations, following the same pattern as run, start_server_daemon, and stop_server_daemon.

server/schemas.py (1)

3-3: LGTM! Appropriate import for dataclass.

server/backend/mlx.py (4)

4-4: LGTM! Correct import of GenerationMetrics.


116-116: LGTM! Appropriate initialization of metrics variable.


129-132: LGTM! Correct metrics capture logic.

The implementation correctly identifies GenerationMetrics objects in the stream and stores them without emitting them as text chunks. The continue statement appropriately skips the yield for metrics objects.


165-181: LGTM! Correct conditional metrics inclusion in final response.

The implementation properly checks for the presence of metrics and conditionally includes them in the final streaming response. The field extraction is correct and matches the GenerationMetrics schema.

server/backend/mlx_runner.py (2)

652-662: LGTM—metrics correctly calculated at normal termination.

The metrics calculation here is correct for the normal generation completion path (EOS or max_tokens reached).


23-23: LGTM—import matches schema definition.

The GenerationMetrics import aligns with the schema definition in server/schemas.py.

tiles/src/runtime/mlx.rs (5)

102-102: Note: PYTHONDONTWRITEBYTECODE only set in release builds.

As mentioned in the PR discussion, this environment variable is set only when the server is started via start_server_daemon, which is skipped in debug mode (see line 275: if !cfg!(debug_assertions)). This means .pyc files are only suppressed in release builds. If this is intentional, consider adding a comment explaining the behavior; otherwise, verify whether this matches the intended design.

Based on the PR discussion around .pyc caching behavior.


134-182: LGTM—benchmark implementation is clean and functional.

The bench method correctly:

  • Ensures the server is running
  • Loads the model
  • Executes a fixed benchmark prompt for consistency
  • Displays and persists metrics

Using a hardcoded simple prompt ("What is 2+2?") ensures reproducible benchmarks focused on token inference performance, aligning with the PR objectives.


425-446: LGTM—metrics correctly extracted and propagated.

The metrics are properly parsed from the streaming response JSON and passed through to ChatResponse. The optional handling ensures backward compatibility when metrics are absent.


328-337: LGTM—clear and concise metrics display.

The metrics formatting in the REPL provides immediate feedback with throughput, token count, and TTFT in an easy-to-read format.


67-69: LGTM—clean early-return pattern.

The let Some(...) else { return; } pattern is idiomatic Rust and cleanly centralizes modelfile parsing with early exit on failure.

Comment thread server/backend/mlx_runner.py
Comment on lines +615 to +627

# Yield metrics before returning
if reasoning_parser:
yield from reasoning_parser.finalize()
total_latency = time.time() - start_time
tokens_per_second = tokens_generated / total_latency if total_latency > 0 else 0
ttft_ms = (ttft * 1000) if ttft is not None else 0
yield GenerationMetrics(
ttft_ms=ttft_ms,
total_tokens=tokens_generated,
tokens_per_second=tokens_per_second,
total_latency_s=total_latency
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Same token-count issue for chat stop tokens.

This branch has the same token-counting issue as the native stop token branch (lines 572-584). See the earlier comment for details and fix.

Comment thread server/backend/mlx_runner.py
@madclaws
Copy link
Copy Markdown
Member

madclaws commented Jan 9, 2026

also for your aggregate thing , we just have to show the results for one particular chat and that is already implemented

If that's the case, then it should mentioned right?

@kshitijgetsac
Copy link
Copy Markdown
Contributor Author

yeah sure will update the description

madclaws added a commit that referenced this pull request Jan 29, 2026
madclaws added a commit that referenced this pull request Jan 30, 2026
@madclaws
Copy link
Copy Markdown
Member

Closing this, since have included this in PR with some extra additions. Thanks.

@madclaws madclaws closed this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a benchmark for chat session

3 participants