You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(parity): CLI/server same 1024 max_tokens default + --serve --permissive (#130)
Three CLI/server divergences fixed per the final plan on #130:
1. defaultMaxResponseTokens raised from 512 to 1024 - both CLI and server
read from the same BodyLimits constant (compiler-enforced parity).
2. ServerConfig gains a permissive field; --serve --permissive makes the
server use .permissiveContentTransformations (matches CLI --permissive).
3. ContextConfig.permissive now propagated on the server path so the
summarize strategy respects the flag.
CLI outputReserve also switched from magic 512 to BodyLimits.defaultOutputReserveTokens.
README updated to reflect actual parity (1024 default, --permissive on both surfaces).
Drafted by the apfel bug-solver routine. Needs review + local test run
by @franzenzenhofer on a Mac with Apple Intelligence before merging.
https://claude.ai/code/session_01AcZ95u48A7CNmuPQZwWgbj
@@ -248,19 +248,19 @@ Full API spec: [openai/openai-openapi](https://github.com/openai/openai-openapi)
248
248
249
249
## Default response cap (`max_tokens`)
250
250
251
-
When a `/v1/chat/completions` request **omits `max_tokens`**, the server applies a default cap of **512 tokens**. Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
251
+
When `max_tokens` is **omitted**, both the CLI and the server apply a default cap of **1024 tokens**. Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
252
252
253
253
### Why a default exists
254
254
255
-
The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. The 512-token default matches the output budget the context trimmer already reserves, so a typical short prompt gets a usable reply in ~1 second instead of hanging.
255
+
The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. The 1024-token default covers typical structured JSON and short-to-medium replies while leaving 3072 tokens for input.
256
256
257
257
### When the cap is hit
258
258
259
-
The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 512 tokens is too short for your prompt, raise it explicitly - up to whatever leaves room for your input inside the 4096-token window.
259
+
The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 1024 tokens is too short for your prompt, raise it explicitly - up to whatever leaves room for your input inside the 4096-token window.
260
260
261
261
### Examples
262
262
263
-
Without `max_tokens` (default 512 applied, fast and bounded):
263
+
Without `max_tokens` (default 1024 applied, fast and bounded):
The CLI (`apfel "prompt"`) does **not** apply this default - it streams to stdout with no server in front of it, so a runaway response is visible in real time and you can `Ctrl-C`. Use `--max-tokens N` if you want a hard cap.
296
+
The CLI and the server apply the **same** 1024-token default from `BodyLimits.defaultMaxResponseTokens`. Override with `--max-tokens N` (CLI) or `"max_tokens": N` in the request body (server).
297
297
298
298
## Limitations
299
299
@@ -302,7 +302,7 @@ The CLI (`apfel "prompt"`) does **not** apply this default - it streams to stdou
| Model | One model (`apple-foundationmodel`), not configurable |
305
-
| Guardrails | Apple's safety system may block benign prompts. `--permissive` reduces false positives ([docs/PERMISSIVE.md](docs/PERMISSIVE.md)) |
305
+
| Guardrails | Apple's safety system may block benign prompts. `--permissive` reduces false positives on both CLI and server ([docs/PERMISSIVE.md](docs/PERMISSIVE.md)) |
306
306
| Speed | On-device, not cloud-scale - a few seconds per response |
307
307
| No embeddings / vision | Not available on-device |
0 commit comments