Arthur-Ficial · Arthur-Ficial · Apr 27, 2026 · Apr 27, 2026 · Apr 27, 2026
diff --git a/README.md b/README.md
@@ -233,7 +233,7 @@ alias apfel=apfel-run                 # optional, every apfel flag still works
 | `GET /v1/logs`, `/v1/logs/stats` | Debug only | Requires `--debug` |
 | Tool calling | Supported | Native `ToolDefinition` + JSON detection. See [docs/tool-calling-guide.md](docs/tool-calling-guide.md) |
 | `response_format: json_object` | Supported | System-prompt injection; markdown fences stripped from output |
-| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. **`max_tokens` defaults to 1024 when omitted** (CLI + server share the constant) - see [Default response cap](#default-response-cap-max_tokens) |
+| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. Omitting `max_tokens` uses the remaining context window (drop-in OpenAI semantics) - see [Default response cap](#default-response-cap-max_tokens) |
 | `stream: true` | Supported | SSE; final usage chunk only when `stream_options: {"include_usage": true}` (per OpenAI spec) |
 | `finish_reason` | Supported | `stop`, `tool_calls`, `length` |
 | Context strategies | Supported | `x_context_strategy`, `x_context_max_turns`, `x_context_output_reserve` extension fields |
@@ -248,35 +248,24 @@ Full API spec: [openai/openai-openapi](https://github.com/openai/openai-openapi)
 
 ## Default response cap (`max_tokens`)
 
-When `max_tokens` is not specified, **both the CLI and the OpenAI-compatible server** apply a default cap of **1024 tokens**. Same value, single source of truth ([`BodyLimits.defaultMaxResponseTokens`](https://github.com/Arthur-Ficial/apfel/blob/main/Sources/Core/Chat/BodyLimits.swift)). Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
+When `max_tokens` is omitted, **CLI and OpenAI-compatible server behave identically**: the value flows through as `nil` and the model uses whatever room is left in the 4096-token context window. This is drop-in OpenAI semantics - no arbitrary fallback constant.
 
-### Why a default exists
-
-The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. 1024 tokens covers typical structured JSON, short-to-medium chat replies, and most single-pass code blocks, while leaving 3072 tokens of the window for input.
-
-### When the cap is hit
-
-The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 1024 tokens is too short, raise it explicitly with `max_tokens` (or `--max-tokens` on the CLI) - up to whatever leaves room for your input inside the 4096-token window.
+The on-device model has a **4096-token context window** that holds input *and* output combined. If generation runs into the ceiling, the response ends cleanly with `finish_reason: "length"` and the partial content is returned (server: HTTP 200; CLI: exit 0 with a stderr warning). Pass `max_tokens` explicitly when you want a tighter latency budget or a known cap for your client.
 
 ### Examples
 
-Without `max_tokens` (default 1024 applied, fast and bounded):
-
 ```bash
+# Omitted: uses remaining window, finish_reason: "stop" or "length"
 curl -sS http://localhost:11434/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{"model":"apple-foundationmodel",
        "messages":[{"role":"user","content":"Reply SKIP, MOVE, or RENAME."}]}'
-# ~1s, returns a short reply, finish_reason: "stop" or "length"
-```
-
-With explicit `max_tokens` (recommended - sized to your need):
 
-```bash
+# Explicit cap (recommended for tight latency budgets)
 curl -sS http://localhost:11434/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{"model":"apple-foundationmodel","max_tokens":1024,
-       "messages":[{"role":"user","content":"Summarise this paragraph: ..."}]}'
+  -d '{"model":"apple-foundationmodel","max_tokens":128,
+       "messages":[{"role":"user","content":"Summarise: ..."}]}'
 ```
 
 ### Picking a value
@@ -287,16 +276,16 @@ curl -sS http://localhost:11434/v1/chat/completions \
 | One-line instruction                   | 64 - 128      |
 | Short paragraph                        | 256 - 512     |
 | Long paragraph / structured JSON       | 1024 - 2048   |
-| As long as the context window allows   | 4096 minus your input token count |
+| As long as the context window allows   | omit it       |
 
-Keep `input_tokens + max_tokens` comfortably below 4096. The context trimmer drops oldest history first to fit the input, but if the requested `max_tokens` leaves no room for any output, generation overflows and the request fails with `[context overflow]`. The validator only rejects non-positive values (`max_tokens <= 0`), so sizing is your responsibility.
+Keep `input_tokens + max_tokens` comfortably below 4096. If the prompt itself exceeds the window, generation cannot start and the request fails with `[context overflow]` (HTTP 400 / CLI exit 4). The validator rejects non-positive values (`max_tokens <= 0`).
 
 ### CLI parity
 
-The CLI applies the **same 1024-token default** as the server. The two surfaces read from the same `BodyLimits.defaultMaxResponseTokens` constant - they cannot drift. Override with `--max-tokens N` or `APFEL_MAX_TOKENS=N`.
+CLI and server share one rule: omitted = use remaining window. No constant to drift. Override with `--max-tokens N` or `APFEL_MAX_TOKENS=N`.
 
 ```bash
-apfel "Reply SKIP."                    # default cap (1024) applies
+apfel "Reply SKIP."                    # uses remaining window
 apfel --max-tokens 64 "Reply SKIP."    # explicit cap
 APFEL_MAX_TOKENS=2048 apfel "..."      # via env var
 ```

diff --git a/Sources/CLI.swift b/Sources/CLI.swift
@@ -69,6 +69,10 @@ func singlePrompt(_ prompt: String, systemPrompt: String?, stream: Bool, options
             metadata: .init(onDevice: true, version: version))
         print(jsonString(obj), terminator: "")
     }
+
+    if result.finishReason == .length {
+        printStderr("\(styled("apfel:", .yellow)) response truncated at the context window (finish_reason=length). Pass --max-tokens to control the cap explicitly.")
+    }
 }
 
 // MARK: - Interactive Chat

diff --git a/Sources/Core/Chat/BodyLimits.swift b/Sources/Core/Chat/BodyLimits.swift
@@ -13,11 +13,16 @@ package enum BodyLimits {
     /// into the 4096-token context window.
     public static let defaultOutputReserveTokens: Int = 512
 
-    /// Default cap applied to model responses when neither the CLI nor the
-    /// HTTP client provides max_tokens. Sized to cover typical short-to-medium
-    /// chat replies and structured JSON output, while leaving 3072 tokens of
-    /// the 4096-token context window for input.
-    /// Read by both surfaces (CLI: main.swift, server: Handlers.swift) so the
-    /// two stay in lock-step.
-    public static let defaultMaxResponseTokens: Int = 1024
+    // No fallback for max_tokens lives here on purpose. Omitted max_tokens
+    // flows through as nil; FoundationModels uses whatever room is left in
+    // the 4096-token window. Output-side overflow is handled by
+    // StreamErrorResolver as finish_reason: "length", so no arbitrary cap
+    // is needed. Drop-in OpenAI semantics, full window utilisation.
+
+    /// Vestigial in v1.3.3+. Omitted max_tokens now flows through as nil and
+    /// FoundationModels uses the remaining context window; this constant is
+    /// no longer consulted anywhere in the codebase. Kept solely for ApfelCore
+    /// API stability for one release, slated for removal in 2.0.0.
+    @available(*, deprecated, message: "No longer used. Omitted max_tokens flows through as nil; FoundationModels uses the remaining 4096-token context window. Output-side overflow is surfaced as finish_reason: \"length\". Will be removed in 2.0.0.")
+    public static let defaultMaxResponseTokens: Int = 0
 }
diff --git a/Sources/Core/Chat/StreamOutcome.swift b/Sources/Core/Chat/StreamOutcome.swift
@@ -0,0 +1,61 @@
+// ============================================================================
+// StreamOutcome.swift — Pure decision logic for handling errors thrown
+// during a streaming model response.
+//
+// On Apple's on-device FoundationModels, hitting the 4096-token context
+// ceiling after producing some content surfaces as a thrown error rather
+// than as a natural EOS. That throw is morally equivalent to OpenAI's
+// finish_reason: "length", not to a server error. This resolver makes the
+// distinction:
+//
+//   - prev empty + contextOverflow  -> the prompt itself is too big.
+//                                      Genuine 400, propagate.
+//   - prev non-empty + contextOverflow -> the model ran out of room while
+//                                          generating. Treat as a graceful
+//                                          truncation; emit finish_reason:
+//                                          "length" and hand the partial
+//                                          content back to the caller.
+//   - any other error                -> propagate.
+//
+// With this resolver in place, the historical 1024 max_tokens default is no
+// longer load-bearing: omitted max_tokens flows through as nil, the model
+// uses the remaining context window, and any overflow surfaces cleanly.
+// ============================================================================
+
+import Foundation
+
+public struct StreamOutcome: Sendable, Equatable, Hashable {
+    public let content: String
+    public let finishReason: FinishReason
+
+    public init(content: String, finishReason: FinishReason) {
+        self.content = content
+        self.finishReason = finishReason
+    }
+}
+
+public enum StreamErrorResolution: Sendable {
+    /// The model produced partial content before the throw. Treat as a clean
+    /// length-finish; do not propagate the error.
+    case truncated(String)
+    /// Genuine error. Propagate.
+    case fatal(ApfelError)
+}
+
+public enum StreamErrorResolver {
+    /// Decide how a stream-time error should be handled.
+    ///
+    /// - parameter prev: the content accumulated by the streaming loop before the throw.
+    ///   Empty means no tokens were emitted (the throw is prompt-side).
+    /// - parameter error: the classified error.
+    /// - returns: `.truncated(prev)` only when the error is a context overflow AND
+    ///   `prev` is non-empty. Everything else is fatal.
+    public static func resolve(prev: String, error: ApfelError) -> StreamErrorResolution {
+        switch error {
+        case .contextOverflow where !prev.isEmpty:
+            return .truncated(prev)
+        default:
+            return .fatal(error)
+        }
+    }
+}
diff --git a/Sources/Core/ToolCallHandler.swift b/Sources/Core/ToolCallHandler.swift
@@ -17,9 +17,16 @@ public struct ToolDef: Sendable {
 package struct ProcessPromptResult: Sendable {
     public let content: String
     public let toolLog: [ToolLogEntry]
+    public let finishReason: FinishReason
 
+    public init(content: String, toolLog: [ToolLogEntry], finishReason: FinishReason) {
+        self.content = content; self.toolLog = toolLog; self.finishReason = finishReason
+    }
+
+    /// Pre-1.3.3 initialiser preserved for source compatibility. Delegates to
+    /// the three-argument init with `finishReason: .stop`.
     public init(content: String, toolLog: [ToolLogEntry]) {
-        self.content = content; self.toolLog = toolLog
+        self.init(content: content, toolLog: toolLog, finishReason: .stop)
     }
 }
 

diff --git a/Sources/Handlers.swift b/Sources/Handlers.swift
@@ -77,7 +77,7 @@ func handleChatCompletion(_ request: Request, context: some RequestContext) asyn
     // Build session options from request (retry config comes from server config)
     let sessionOpts = SessionOptions(
         temperature: chatRequest.temperature,
-        maxTokens: chatRequest.max_tokens ?? BodyLimits.defaultMaxResponseTokens,
+        maxTokens: chatRequest.max_tokens,
         seed: chatRequest.seed.map { UInt64($0) },
         permissive: serverState.config.permissive,
         contextConfig: contextConfig,
@@ -329,11 +329,12 @@ private func nonStreamingResponse(
     events: [String]
 ) async throws -> (response: Response, trace: ChatRequestTrace) {
     let nsRetryMax = serverState.config.retryEnabled ? serverState.config.retryCount : 0
-    let rawContent: String
+    let outcome: StreamOutcome
     do {
-        rawContent = try await withRetry(maxRetries: nsRetryMax) {
-            let result = try await session.respond(to: prompt, options: genOpts)
-            return result.content
+        // Route non-streaming through collectStream so output-side context
+        // overflow surfaces as a graceful length-finish on this path too.
+        outcome = try await withRetry(maxRetries: nsRetryMax) {
+            try await collectStream(session, prompt: prompt, printDelta: false, options: genOpts)
         }
     } catch {
         let classified = ApfelError.classify(error)
@@ -355,6 +356,7 @@ private func nonStreamingResponse(
             event: "model error: \(classified.cliLabel)"
         )
     }
+    let rawContent = outcome.content
 
     // Detect tool calls in response
     let toolCalls = ToolCallHandler.detectToolCall(in: rawContent)
@@ -371,11 +373,9 @@ private func nonStreamingResponse(
     }
 
     let completionTokens = await TokenCounter.shared.count(deliveredContent)
-    let finishReason = FinishReasonResolver.resolve(
-        hasToolCalls: toolCalls != nil,
-        completionTokens: completionTokens,
-        maxTokens: genOpts.maximumResponseTokens
-    ).openAIValue
+    // collectStream already resolved .stop vs .length (cap-hit and output-side
+    // overflow); only override here when tool calls are detected.
+    let finishReason = (toolCalls != nil ? FinishReason.toolCalls : outcome.finishReason).openAIValue
 
     let payload = ChatCompletionResponse(
         id: id,
@@ -536,7 +536,38 @@ private func streamingResponse(
                 await eventBox.append("stream cancelled by client")
             } catch {
                 let classified = ApfelError.classify(error)
-                if case .refusal(let explanation) = classified {
+                // Output-side context overflow with content already streamed is
+                // a graceful length-finish, not an error. See StreamErrorResolver.
+                if case .truncated(let truncatedContent) = StreamErrorResolver.resolve(prev: prev, error: classified) {
+                    completionTokens = await TokenCounter.shared.count(truncatedContent)
+                    let lengthChunk = ChatCompletionChunk(
+                        id: id, object: "chat.completion.chunk", created: created, model: modelName,
+                        choices: [.init(
+                            index: 0,
+                            delta: .init(role: nil, content: nil, tool_calls: nil),
+                            finish_reason: FinishReason.length.openAIValue,
+                            logprobs: nil
+                        )],
+                        usage: nil
+                    )
+                    let lengthLine = sseDataLine(lengthChunk)
+                    responseLines?.append(lengthLine.trimmingCharacters(in: .whitespacesAndNewlines))
+                    continuation.yield(ByteBuffer(string: lengthLine))
+
+                    if includeUsage {
+                        let usageChunk = sseUsageChunk(
+                            id: id, created: created,
+                            promptTokens: promptTokens, completionTokens: completionTokens
+                        )
+                        let usageLine = sseDataLine(usageChunk)
+                        responseLines?.append(usageLine.trimmingCharacters(in: .whitespacesAndNewlines))
+                        continuation.yield(ByteBuffer(string: usageLine))
+                    }
+
+                    continuation.yield(ByteBuffer(string: sseDone))
+                    responseLines?.append("data: [DONE]")
+                    await eventBox.append("stream truncated by context, finish_reason=length total_chars=\(truncatedContent.count)")
+                } else if case .refusal(let explanation) = classified {
                     // OpenAI wire format: stream a refusal delta, then a final
                     // chunk with finish_reason=content_filter, then [DONE].
                     let refusalLine = sseDataLine(sseRefusalChunk(id: id, created: created, refusal: explanation))