Skip to content

Commit cc94a31

Browse files
committed
fix(parity): CLI/server same 1024 max_tokens default + --serve --permissive (#130)
Three CLI/server divergences fixed per the final plan on #130: 1. defaultMaxResponseTokens raised from 512 to 1024 - both CLI and server read from the same BodyLimits constant (compiler-enforced parity). 2. ServerConfig gains a permissive field; --serve --permissive makes the server use .permissiveContentTransformations (matches CLI --permissive). 3. ContextConfig.permissive now propagated on the server path so the summarize strategy respects the flag. CLI outputReserve also switched from magic 512 to BodyLimits.defaultOutputReserveTokens. README updated to reflect actual parity (1024 default, --permissive on both surfaces). Drafted by the apfel bug-solver routine. Needs review + local test run by @franzenzenhofer on a Mac with Apple Intelligence before merging. https://claude.ai/code/session_01AcZ95u48A7CNmuPQZwWgbj
1 parent 10650d8 commit cc94a31

6 files changed

Lines changed: 41 additions & 19 deletions

File tree

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,7 @@ alias apfel=apfel-run # optional, every apfel flag still works
233233
| `GET /v1/logs`, `/v1/logs/stats` | Debug only | Requires `--debug` |
234234
| Tool calling | Supported | Native `ToolDefinition` + JSON detection. See [docs/tool-calling-guide.md](docs/tool-calling-guide.md) |
235235
| `response_format: json_object` | Supported | System-prompt injection; markdown fences stripped from output |
236-
| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. **`max_tokens` defaults to 512 when omitted** - see [Default response cap](#default-response-cap-max_tokens) |
236+
| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. **`max_tokens` defaults to 1024 when omitted** - see [Default response cap](#default-response-cap-max_tokens) |
237237
| `stream: true` | Supported | SSE; final usage chunk only when `stream_options: {"include_usage": true}` (per OpenAI spec) |
238238
| `finish_reason` | Supported | `stop`, `tool_calls`, `length` |
239239
| Context strategies | Supported | `x_context_strategy`, `x_context_max_turns`, `x_context_output_reserve` extension fields |
@@ -248,19 +248,19 @@ Full API spec: [openai/openai-openapi](https://github.com/openai/openai-openapi)
248248
249249
## Default response cap (`max_tokens`)
250250
251-
When a `/v1/chat/completions` request **omits `max_tokens`**, the server applies a default cap of **512 tokens**. Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
251+
When `max_tokens` is **omitted**, both the CLI and the server apply a default cap of **1024 tokens**. Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
252252
253253
### Why a default exists
254254
255-
The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. The 512-token default matches the output budget the context trimmer already reserves, so a typical short prompt gets a usable reply in ~1 second instead of hanging.
255+
The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. The 1024-token default covers typical structured JSON and short-to-medium replies while leaving 3072 tokens for input.
256256
257257
### When the cap is hit
258258
259-
The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 512 tokens is too short for your prompt, raise it explicitly - up to whatever leaves room for your input inside the 4096-token window.
259+
The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 1024 tokens is too short for your prompt, raise it explicitly - up to whatever leaves room for your input inside the 4096-token window.
260260
261261
### Examples
262262
263-
Without `max_tokens` (default 512 applied, fast and bounded):
263+
Without `max_tokens` (default 1024 applied, fast and bounded):
264264
265265
```bash
266266
curl -sS http://localhost:11434/v1/chat/completions \
@@ -275,7 +275,7 @@ With explicit `max_tokens` (recommended - sized to your need):
275275
```bash
276276
curl -sS http://localhost:11434/v1/chat/completions \
277277
-H "Content-Type: application/json" \
278-
-d '{"model":"apple-foundationmodel","max_tokens":1024,
278+
-d '{"model":"apple-foundationmodel","max_tokens":2048,
279279
"messages":[{"role":"user","content":"Summarise this paragraph: ..."}]}'
280280
```
281281
@@ -293,7 +293,7 @@ Keep `input_tokens + max_tokens` comfortably below 4096. The context trimmer dro
293293
294294
### CLI parity
295295
296-
The CLI (`apfel "prompt"`) does **not** apply this default - it streams to stdout with no server in front of it, so a runaway response is visible in real time and you can `Ctrl-C`. Use `--max-tokens N` if you want a hard cap.
296+
The CLI and the server apply the **same** 1024-token default from `BodyLimits.defaultMaxResponseTokens`. Override with `--max-tokens N` (CLI) or `"max_tokens": N` in the request body (server).
297297
298298
## Limitations
299299
@@ -302,7 +302,7 @@ The CLI (`apfel "prompt"`) does **not** apply this default - it streams to stdou
302302
| Context window | **4096 tokens** (input + output combined) |
303303
| Platform | macOS 26+, Apple Silicon only |
304304
| Model | One model (`apple-foundationmodel`), not configurable |
305-
| Guardrails | Apple's safety system may block benign prompts. `--permissive` reduces false positives ([docs/PERMISSIVE.md](docs/PERMISSIVE.md)) |
305+
| Guardrails | Apple's safety system may block benign prompts. `--permissive` reduces false positives on both CLI and server ([docs/PERMISSIVE.md](docs/PERMISSIVE.md)) |
306306
| Speed | On-device, not cloud-scale - a few seconds per response |
307307
| No embeddings / vision | Not available on-device |
308308

Sources/Core/Chat/BodyLimits.swift

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ package enum BodyLimits {
1313
/// into the 4096-token context window.
1414
public static let defaultOutputReserveTokens: Int = 512
1515

16-
/// Server-side cap applied when a client omits max_tokens.
17-
/// Matches the output reserve to stay within the 4096-token context window.
18-
public static let defaultMaxResponseTokens: Int = 512
16+
/// Default cap applied when a client (server or CLI) omits max_tokens.
17+
/// 1024 covers typical structured JSON and short-to-medium replies,
18+
/// leaving 3072 tokens for input inside the 4096-token window.
19+
public static let defaultMaxResponseTokens: Int = 1024
1920
}

Sources/Handlers.swift

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,15 +71,16 @@ func handleChatCompletion(_ request: Request, context: some RequestContext) asyn
7171
let contextConfig = ContextConfig(
7272
strategy: chatRequest.x_context_strategy.flatMap { ContextStrategy(rawValue: $0) } ?? .newestFirst,
7373
maxTurns: chatRequest.x_context_max_turns,
74-
outputReserve: chatRequest.x_context_output_reserve ?? BodyLimits.defaultOutputReserveTokens
74+
outputReserve: chatRequest.x_context_output_reserve ?? BodyLimits.defaultOutputReserveTokens,
75+
permissive: serverState.config.permissive
7576
)
7677

7778
// Build session options from request (retry config comes from server config)
7879
let sessionOpts = SessionOptions(
7980
temperature: chatRequest.temperature,
8081
maxTokens: chatRequest.max_tokens ?? BodyLimits.defaultMaxResponseTokens,
8182
seed: chatRequest.seed.map { UInt64($0) },
82-
permissive: false,
83+
permissive: serverState.config.permissive,
8384
contextConfig: contextConfig,
8485
retryEnabled: serverState.config.retryEnabled,
8586
retryCount: serverState.config.retryCount

Sources/Server.swift

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ struct ServerConfig: Sendable {
2020
let token: String?
2121
let tokenWasAutoGenerated: Bool
2222
let publicHealth: Bool
23+
let permissive: Bool
2324
let retryEnabled: Bool
2425
let retryCount: Int
2526

@@ -227,6 +228,7 @@ func startServer(config: ServerConfig, mcpManager: MCPManager? = nil) async thro
227228
"\(styled("", .dim)) token: \(config.token != nil ? "required" : "none")",
228229
"\(styled("", .dim)) health: \(config.healthRequiresAuthentication ? "auth required" : "public")",
229230
"\(styled("", .dim)) max concurrent: \(config.maxConcurrent)",
231+
"\(styled("", .dim)) permissive: \(config.permissive ? "on" : "off")",
230232
"\(styled("", .dim)) debug: \(config.debug ? "on" : "off")",
231233
]
232234
if config.tokenWasAutoGenerated, let token = config.token {

Sources/main.swift

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,13 +134,13 @@ if !fileContents.isEmpty {
134134
let contextConfig = ContextConfig(
135135
strategy: parsed.contextStrategy ?? .newestFirst,
136136
maxTurns: parsed.contextMaxTurns,
137-
outputReserve: parsed.contextOutputReserve ?? 512,
137+
outputReserve: parsed.contextOutputReserve ?? BodyLimits.defaultOutputReserveTokens,
138138
permissive: parsed.permissive
139139
)
140140

141141
let sessionOpts = SessionOptions(
142142
temperature: parsed.temperature,
143-
maxTokens: parsed.maxTokens,
143+
maxTokens: parsed.maxTokens ?? BodyLimits.defaultMaxResponseTokens,
144144
seed: parsed.seed,
145145
permissive: parsed.permissive,
146146
contextConfig: contextConfig,
@@ -206,6 +206,7 @@ do {
206206
token: serverToken,
207207
tokenWasAutoGenerated: tokenWasAutoGenerated,
208208
publicHealth: parsed.serverPublicHealth,
209+
permissive: parsed.permissive,
209210
retryEnabled: parsed.retryEnabled,
210211
retryCount: parsed.retryCount
211212
)

Tests/apfelTests/BodyLimitsTests.swift

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
import Foundation
66
import ApfelCore
7+
import ApfelCLI
78

89
func runBodyLimitsTests() {
910
test("maxRequestBodyBytes is 1 MiB") {
@@ -14,17 +15,33 @@ func runBodyLimitsTests() {
1415
try assertEqual(BodyLimits.defaultOutputReserveTokens, 512)
1516
}
1617

17-
test("defaultMaxResponseTokens is 512") {
18-
try assertEqual(BodyLimits.defaultMaxResponseTokens, 512)
18+
test("defaultMaxResponseTokens is 1024") {
19+
try assertEqual(BodyLimits.defaultMaxResponseTokens, 1024)
1920
}
2021

21-
test("defaultMaxResponseTokens matches defaultOutputReserveTokens") {
22-
try assertEqual(BodyLimits.defaultMaxResponseTokens, BodyLimits.defaultOutputReserveTokens)
22+
test("defaultMaxResponseTokens fits within 4096-token context window") {
23+
try assertTrue(BodyLimits.defaultMaxResponseTokens > 0)
24+
try assertTrue(BodyLimits.defaultMaxResponseTokens <= 4096)
2325
}
2426

2527
test("constants are positive") {
2628
try assertTrue(BodyLimits.maxRequestBodyBytes > 0)
2729
try assertTrue(BodyLimits.defaultOutputReserveTokens > 0)
2830
try assertTrue(BodyLimits.defaultMaxResponseTokens > 0)
2931
}
32+
33+
test("CLI maxTokens fallback uses BodyLimits.defaultMaxResponseTokens (parity with server)") {
34+
let args = try CLIArguments.parse(["hello"])
35+
try assertNil(args.maxTokens, "CLI should not set maxTokens when --max-tokens is omitted")
36+
// Both main.swift and Handlers.swift apply ?? BodyLimits.defaultMaxResponseTokens,
37+
// so the fallback is compiler-enforced via the same constant.
38+
let fallback = args.maxTokens ?? BodyLimits.defaultMaxResponseTokens
39+
try assertEqual(fallback, 1024)
40+
}
41+
42+
test("CLI explicit --max-tokens overrides the default") {
43+
let args = try CLIArguments.parse(["--max-tokens", "256", "hello"])
44+
let resolved = args.maxTokens ?? BodyLimits.defaultMaxResponseTokens
45+
try assertEqual(resolved, 256)
46+
}
3047
}

0 commit comments

Comments
 (0)