fix(parity): CLI/server same 1024 max_tokens default + --serve --permissive (#130)

claude · claude · commit cc94a31e936f · 2026-04-26T19:44:22.000Z
Three CLI/server divergences fixed per the final plan on #130: 1. defaultMaxResponseTokens raised from 512 to 1024 - both CLI and server read from the same BodyLimits constant (compiler-enforced parity). 2. ServerConfig gains a permissive field; --serve --permissive makes the server use .permissiveContentTransformations (matches CLI --permissive). 3. ContextConfig.permissive now propagated on the server path so the summarize strategy respects the flag. CLI outputReserve also switched from magic 512 to BodyLimits.defaultOutputReserveTokens. README updated to reflect actual parity (1024 default, --permissive on both surfaces). Drafted by the apfel bug-solver routine. Needs review + local test run by @franzenzenhofer on a Mac with Apple Intelligence before merging. https://claude.ai/code/session_01AcZ95u48A7CNmuPQZwWgbj
diff --git a/README.md b/README.md
@@ -233,7 +233,7 @@ alias apfel=apfel-run                 # optional, every apfel flag still works
 | `GET /v1/logs`, `/v1/logs/stats` | Debug only | Requires `--debug` |
 | Tool calling | Supported | Native `ToolDefinition` + JSON detection. See [docs/tool-calling-guide.md](docs/tool-calling-guide.md) |
 | `response_format: json_object` | Supported | System-prompt injection; markdown fences stripped from output |
-| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. **`max_tokens` defaults to 512 when omitted** - see [Default response cap](#default-response-cap-max_tokens) |
+| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. **`max_tokens` defaults to 1024 when omitted** - see [Default response cap](#default-response-cap-max_tokens) |
 | `stream: true` | Supported | SSE; final usage chunk only when `stream_options: {"include_usage": true}` (per OpenAI spec) |
 | `finish_reason` | Supported | `stop`, `tool_calls`, `length` |
 | Context strategies | Supported | `x_context_strategy`, `x_context_max_turns`, `x_context_output_reserve` extension fields |
@@ -248,19 +248,19 @@ Full API spec: [openai/openai-openapi](https://github.com/openai/openai-openapi)
 
 ## Default response cap (`max_tokens`)
 
-When a `/v1/chat/completions` request **omits `max_tokens`**, the server applies a default cap of **512 tokens**. Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
+When `max_tokens` is **omitted**, both the CLI and the server apply a default cap of **1024 tokens**. Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
 
 ### Why a default exists
 
-The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. The 512-token default matches the output budget the context trimmer already reserves, so a typical short prompt gets a usable reply in ~1 second instead of hanging.
+The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. The 1024-token default covers typical structured JSON and short-to-medium replies while leaving 3072 tokens for input.
 
 ### When the cap is hit
 
-The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 512 tokens is too short for your prompt, raise it explicitly - up to whatever leaves room for your input inside the 4096-token window.
+The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 1024 tokens is too short for your prompt, raise it explicitly - up to whatever leaves room for your input inside the 4096-token window.
 
 ### Examples
 
-Without `max_tokens` (default 512 applied, fast and bounded):
+Without `max_tokens` (default 1024 applied, fast and bounded):
 
 ```bash
 curl -sS http://localhost:11434/v1/chat/completions \
@@ -275,7 +275,7 @@ With explicit `max_tokens` (recommended - sized to your need):
 ```bash
 curl -sS http://localhost:11434/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model":"apple-foundationmodel","max_tokens":1024,
+  -d '{"model":"apple-foundationmodel","max_tokens":2048,
        "messages":[{"role":"user","content":"Summarise this paragraph: ..."}]}'
 ```
 
@@ -293,7 +293,7 @@ Keep `input_tokens + max_tokens` comfortably below 4096. The context trimmer dro
 
 ### CLI parity
 
-The CLI (`apfel "prompt"`) does **not** apply this default - it streams to stdout with no server in front of it, so a runaway response is visible in real time and you can `Ctrl-C`. Use `--max-tokens N` if you want a hard cap.
+The CLI and the server apply the **same** 1024-token default from `BodyLimits.defaultMaxResponseTokens`. Override with `--max-tokens N` (CLI) or `"max_tokens": N` in the request body (server).
 
 ## Limitations
 
@@ -302,7 +302,7 @@ The CLI (`apfel "prompt"`) does **not** apply this default - it streams to stdou
 | Context window | **4096 tokens** (input + output combined) |
 | Platform | macOS 26+, Apple Silicon only |
 | Model | One model (`apple-foundationmodel`), not configurable |
-| Guardrails | Apple's safety system may block benign prompts. `--permissive` reduces false positives ([docs/PERMISSIVE.md](docs/PERMISSIVE.md)) |
+| Guardrails | Apple's safety system may block benign prompts. `--permissive` reduces false positives on both CLI and server ([docs/PERMISSIVE.md](docs/PERMISSIVE.md)) |
 | Speed | On-device, not cloud-scale - a few seconds per response |
 | No embeddings / vision | Not available on-device |
 
diff --git a/Sources/Core/Chat/BodyLimits.swift b/Sources/Core/Chat/BodyLimits.swift
@@ -13,7 +13,8 @@ package enum BodyLimits {
     /// into the 4096-token context window.
     public static let defaultOutputReserveTokens: Int = 512
 
-    /// Server-side cap applied when a client omits max_tokens.
-    /// Matches the output reserve to stay within the 4096-token context window.
-    public static let defaultMaxResponseTokens: Int = 512
+    /// Default cap applied when a client (server or CLI) omits max_tokens.
+    /// 1024 covers typical structured JSON and short-to-medium replies,
+    /// leaving 3072 tokens for input inside the 4096-token window.
+    public static let defaultMaxResponseTokens: Int = 1024
 }
diff --git a/Sources/Handlers.swift b/Sources/Handlers.swift
@@ -71,15 +71,16 @@ func handleChatCompletion(_ request: Request, context: some RequestContext) asyn
     let contextConfig = ContextConfig(
         strategy: chatRequest.x_context_strategy.flatMap { ContextStrategy(rawValue: $0) } ?? .newestFirst,
         maxTurns: chatRequest.x_context_max_turns,
-        outputReserve: chatRequest.x_context_output_reserve ?? BodyLimits.defaultOutputReserveTokens
+        outputReserve: chatRequest.x_context_output_reserve ?? BodyLimits.defaultOutputReserveTokens,
+        permissive: serverState.config.permissive
     )
 
     // Build session options from request (retry config comes from server config)
     let sessionOpts = SessionOptions(
         temperature: chatRequest.temperature,
         maxTokens: chatRequest.max_tokens ?? BodyLimits.defaultMaxResponseTokens,
         seed: chatRequest.seed.map { UInt64($0) },
-        permissive: false,
+        permissive: serverState.config.permissive,
         contextConfig: contextConfig,
         retryEnabled: serverState.config.retryEnabled,
         retryCount: serverState.config.retryCount
diff --git a/Sources/Server.swift b/Sources/Server.swift
@@ -20,6 +20,7 @@ struct ServerConfig: Sendable {
     let token: String?
     let tokenWasAutoGenerated: Bool
     let publicHealth: Bool
+    let permissive: Bool
     let retryEnabled: Bool
     let retryCount: Int
 
@@ -227,6 +228,7 @@ func startServer(config: ServerConfig, mcpManager: MCPManager? = nil) async thro
         "\(styled("├", .dim)) token:    \(config.token != nil ? "required" : "none")",
         "\(styled("├", .dim)) health:   \(config.healthRequiresAuthentication ? "auth required" : "public")",
         "\(styled("├", .dim)) max concurrent: \(config.maxConcurrent)",
+        "\(styled("├", .dim)) permissive: \(config.permissive ? "on" : "off")",
         "\(styled("├", .dim)) debug:    \(config.debug ? "on" : "off")",
     ]
     if config.tokenWasAutoGenerated, let token = config.token {
diff --git a/Sources/main.swift b/Sources/main.swift
@@ -134,13 +134,13 @@ if !fileContents.isEmpty {
 let contextConfig = ContextConfig(
     strategy: parsed.contextStrategy ?? .newestFirst,
     maxTurns: parsed.contextMaxTurns,
-    outputReserve: parsed.contextOutputReserve ?? 512,
+    outputReserve: parsed.contextOutputReserve ?? BodyLimits.defaultOutputReserveTokens,
     permissive: parsed.permissive
 )
 
 let sessionOpts = SessionOptions(
     temperature: parsed.temperature,
-    maxTokens: parsed.maxTokens,
+    maxTokens: parsed.maxTokens ?? BodyLimits.defaultMaxResponseTokens,
     seed: parsed.seed,
     permissive: parsed.permissive,
     contextConfig: contextConfig,
@@ -206,6 +206,7 @@ do {
             token: serverToken,
             tokenWasAutoGenerated: tokenWasAutoGenerated,
             publicHealth: parsed.serverPublicHealth,
+            permissive: parsed.permissive,
             retryEnabled: parsed.retryEnabled,
             retryCount: parsed.retryCount
         )
diff --git a/Tests/apfelTests/BodyLimitsTests.swift b/Tests/apfelTests/BodyLimitsTests.swift
@@ -4,6 +4,7 @@
 
 import Foundation
 import ApfelCore
+import ApfelCLI
 
 func runBodyLimitsTests() {
     test("maxRequestBodyBytes is 1 MiB") {
@@ -14,17 +15,33 @@ func runBodyLimitsTests() {
         try assertEqual(BodyLimits.defaultOutputReserveTokens, 512)
     }
 
-    test("defaultMaxResponseTokens is 512") {
-        try assertEqual(BodyLimits.defaultMaxResponseTokens, 512)
+    test("defaultMaxResponseTokens is 1024") {
+        try assertEqual(BodyLimits.defaultMaxResponseTokens, 1024)
     }
 
-    test("defaultMaxResponseTokens matches defaultOutputReserveTokens") {
-        try assertEqual(BodyLimits.defaultMaxResponseTokens, BodyLimits.defaultOutputReserveTokens)
+    test("defaultMaxResponseTokens fits within 4096-token context window") {
+        try assertTrue(BodyLimits.defaultMaxResponseTokens > 0)
+        try assertTrue(BodyLimits.defaultMaxResponseTokens <= 4096)
     }
 
     test("constants are positive") {
         try assertTrue(BodyLimits.maxRequestBodyBytes > 0)
         try assertTrue(BodyLimits.defaultOutputReserveTokens > 0)
         try assertTrue(BodyLimits.defaultMaxResponseTokens > 0)
     }
+
+    test("CLI maxTokens fallback uses BodyLimits.defaultMaxResponseTokens (parity with server)") {
+        let args = try CLIArguments.parse(["hello"])
+        try assertNil(args.maxTokens, "CLI should not set maxTokens when --max-tokens is omitted")
+        // Both main.swift and Handlers.swift apply ?? BodyLimits.defaultMaxResponseTokens,
+        // so the fallback is compiler-enforced via the same constant.
+        let fallback = args.maxTokens ?? BodyLimits.defaultMaxResponseTokens
+        try assertEqual(fallback, 1024)
+    }
+
+    test("CLI explicit --max-tokens overrides the default") {
+        let args = try CLIArguments.parse(["--max-tokens", "256", "hello"])
+        let resolved = args.maxTokens ?? BodyLimits.defaultMaxResponseTokens
+        try assertEqual(resolved, 256)
+    }
 }