Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 11 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ alias apfel=apfel-run # optional, every apfel flag still works
| `GET /v1/logs`, `/v1/logs/stats` | Debug only | Requires `--debug` |
| Tool calling | Supported | Native `ToolDefinition` + JSON detection. See [docs/tool-calling-guide.md](docs/tool-calling-guide.md) |
| `response_format: json_object` | Supported | System-prompt injection; markdown fences stripped from output |
| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. **`max_tokens` defaults to 1024 when omitted** (CLI + server share the constant) - see [Default response cap](#default-response-cap-max_tokens) |
| `temperature`, `max_tokens`, `seed` | Supported | Mapped to `GenerationOptions`. Omitting `max_tokens` uses the remaining context window (drop-in OpenAI semantics) - see [Default response cap](#default-response-cap-max_tokens) |
| `stream: true` | Supported | SSE; final usage chunk only when `stream_options: {"include_usage": true}` (per OpenAI spec) |
| `finish_reason` | Supported | `stop`, `tool_calls`, `length` |
| Context strategies | Supported | `x_context_strategy`, `x_context_max_turns`, `x_context_output_reserve` extension fields |
Expand All @@ -248,35 +248,24 @@ Full API spec: [openai/openai-openapi](https://github.com/openai/openai-openapi)

## Default response cap (`max_tokens`)

When `max_tokens` is not specified, **both the CLI and the OpenAI-compatible server** apply a default cap of **1024 tokens**. Same value, single source of truth ([`BodyLimits.defaultMaxResponseTokens`](https://github.com/Arthur-Ficial/apfel/blob/main/Sources/Core/Chat/BodyLimits.swift)). Best practice: **always set `max_tokens` explicitly** to a value that matches your use case.
When `max_tokens` is omitted, **CLI and OpenAI-compatible server behave identically**: the value flows through as `nil` and the model uses whatever room is left in the 4096-token context window. This is drop-in OpenAI semantics - no arbitrary fallback constant.

### Why a default exists

The on-device model has a **4096-token context window** that holds input *and* output combined. With no cap, generation runs until that window overflows, which produces an unrecoverable `[context overflow]` error after ~50 seconds of wasted generation - the client gets nothing usable. 1024 tokens covers typical structured JSON, short-to-medium chat replies, and most single-pass code blocks, while leaving 3072 tokens of the window for input.

### When the cap is hit

The response sets `finish_reason: "length"` (per the OpenAI spec) so the client can detect a truncated reply. If 1024 tokens is too short, raise it explicitly with `max_tokens` (or `--max-tokens` on the CLI) - up to whatever leaves room for your input inside the 4096-token window.
The on-device model has a **4096-token context window** that holds input *and* output combined. If generation runs into the ceiling, the response ends cleanly with `finish_reason: "length"` and the partial content is returned (server: HTTP 200; CLI: exit 0 with a stderr warning). Pass `max_tokens` explicitly when you want a tighter latency budget or a known cap for your client.

### Examples

Without `max_tokens` (default 1024 applied, fast and bounded):

```bash
# Omitted: uses remaining window, finish_reason: "stop" or "length"
curl -sS http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"apple-foundationmodel",
"messages":[{"role":"user","content":"Reply SKIP, MOVE, or RENAME."}]}'
# ~1s, returns a short reply, finish_reason: "stop" or "length"
```

With explicit `max_tokens` (recommended - sized to your need):

```bash
# Explicit cap (recommended for tight latency budgets)
curl -sS http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"apple-foundationmodel","max_tokens":1024,
"messages":[{"role":"user","content":"Summarise this paragraph: ..."}]}'
-d '{"model":"apple-foundationmodel","max_tokens":128,
"messages":[{"role":"user","content":"Summarise: ..."}]}'
```

### Picking a value
Expand All @@ -287,16 +276,16 @@ curl -sS http://localhost:11434/v1/chat/completions \
| One-line instruction | 64 - 128 |
| Short paragraph | 256 - 512 |
| Long paragraph / structured JSON | 1024 - 2048 |
| As long as the context window allows | 4096 minus your input token count |
| As long as the context window allows | omit it |

Keep `input_tokens + max_tokens` comfortably below 4096. The context trimmer drops oldest history first to fit the input, but if the requested `max_tokens` leaves no room for any output, generation overflows and the request fails with `[context overflow]`. The validator only rejects non-positive values (`max_tokens <= 0`), so sizing is your responsibility.
Keep `input_tokens + max_tokens` comfortably below 4096. If the prompt itself exceeds the window, generation cannot start and the request fails with `[context overflow]` (HTTP 400 / CLI exit 4). The validator rejects non-positive values (`max_tokens <= 0`).

### CLI parity

The CLI applies the **same 1024-token default** as the server. The two surfaces read from the same `BodyLimits.defaultMaxResponseTokens` constant - they cannot drift. Override with `--max-tokens N` or `APFEL_MAX_TOKENS=N`.
CLI and server share one rule: omitted = use remaining window. No constant to drift. Override with `--max-tokens N` or `APFEL_MAX_TOKENS=N`.

```bash
apfel "Reply SKIP." # default cap (1024) applies
apfel "Reply SKIP." # uses remaining window
apfel --max-tokens 64 "Reply SKIP." # explicit cap
APFEL_MAX_TOKENS=2048 apfel "..." # via env var
```
Expand Down
4 changes: 4 additions & 0 deletions Sources/CLI.swift
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ func singlePrompt(_ prompt: String, systemPrompt: String?, stream: Bool, options
metadata: .init(onDevice: true, version: version))
print(jsonString(obj), terminator: "")
}

if result.finishReason == .length {
printStderr("\(styled("apfel:", .yellow)) response truncated at the context window (finish_reason=length). Pass --max-tokens to control the cap explicitly.")
}
}

// MARK: - Interactive Chat
Expand Down
19 changes: 12 additions & 7 deletions Sources/Core/Chat/BodyLimits.swift
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,16 @@ package enum BodyLimits {
/// into the 4096-token context window.
public static let defaultOutputReserveTokens: Int = 512

/// Default cap applied to model responses when neither the CLI nor the
/// HTTP client provides max_tokens. Sized to cover typical short-to-medium
/// chat replies and structured JSON output, while leaving 3072 tokens of
/// the 4096-token context window for input.
/// Read by both surfaces (CLI: main.swift, server: Handlers.swift) so the
/// two stay in lock-step.
public static let defaultMaxResponseTokens: Int = 1024
// No fallback for max_tokens lives here on purpose. Omitted max_tokens
// flows through as nil; FoundationModels uses whatever room is left in
// the 4096-token window. Output-side overflow is handled by
// StreamErrorResolver as finish_reason: "length", so no arbitrary cap
// is needed. Drop-in OpenAI semantics, full window utilisation.

/// Vestigial in v1.3.3+. Omitted max_tokens now flows through as nil and
/// FoundationModels uses the remaining context window; this constant is
/// no longer consulted anywhere in the codebase. Kept solely for ApfelCore
/// API stability for one release, slated for removal in 2.0.0.
@available(*, deprecated, message: "No longer used. Omitted max_tokens flows through as nil; FoundationModels uses the remaining 4096-token context window. Output-side overflow is surfaced as finish_reason: \"length\". Will be removed in 2.0.0.")
public static let defaultMaxResponseTokens: Int = 0
}
61 changes: 61 additions & 0 deletions Sources/Core/Chat/StreamOutcome.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
// ============================================================================
// StreamOutcome.swift — Pure decision logic for handling errors thrown
// during a streaming model response.
//
// On Apple's on-device FoundationModels, hitting the 4096-token context
// ceiling after producing some content surfaces as a thrown error rather
// than as a natural EOS. That throw is morally equivalent to OpenAI's
// finish_reason: "length", not to a server error. This resolver makes the
// distinction:
//
// - prev empty + contextOverflow -> the prompt itself is too big.
// Genuine 400, propagate.
// - prev non-empty + contextOverflow -> the model ran out of room while
// generating. Treat as a graceful
// truncation; emit finish_reason:
// "length" and hand the partial
// content back to the caller.
// - any other error -> propagate.
//
// With this resolver in place, the historical 1024 max_tokens default is no
// longer load-bearing: omitted max_tokens flows through as nil, the model
// uses the remaining context window, and any overflow surfaces cleanly.
// ============================================================================

import Foundation

public struct StreamOutcome: Sendable, Equatable, Hashable {
public let content: String
public let finishReason: FinishReason

public init(content: String, finishReason: FinishReason) {
self.content = content
self.finishReason = finishReason
}
}

public enum StreamErrorResolution: Sendable {
/// The model produced partial content before the throw. Treat as a clean
/// length-finish; do not propagate the error.
case truncated(String)
/// Genuine error. Propagate.
case fatal(ApfelError)
}

public enum StreamErrorResolver {
/// Decide how a stream-time error should be handled.
///
/// - parameter prev: the content accumulated by the streaming loop before the throw.
/// Empty means no tokens were emitted (the throw is prompt-side).
/// - parameter error: the classified error.
/// - returns: `.truncated(prev)` only when the error is a context overflow AND
/// `prev` is non-empty. Everything else is fatal.
public static func resolve(prev: String, error: ApfelError) -> StreamErrorResolution {
switch error {
case .contextOverflow where !prev.isEmpty:
return .truncated(prev)
default:
return .fatal(error)
}
}
}
9 changes: 8 additions & 1 deletion Sources/Core/ToolCallHandler.swift
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,16 @@ public struct ToolDef: Sendable {
package struct ProcessPromptResult: Sendable {
public let content: String
public let toolLog: [ToolLogEntry]
public let finishReason: FinishReason

public init(content: String, toolLog: [ToolLogEntry], finishReason: FinishReason) {
self.content = content; self.toolLog = toolLog; self.finishReason = finishReason
}

/// Pre-1.3.3 initialiser preserved for source compatibility. Delegates to
/// the three-argument init with `finishReason: .stop`.
public init(content: String, toolLog: [ToolLogEntry]) {
self.content = content; self.toolLog = toolLog
self.init(content: content, toolLog: toolLog, finishReason: .stop)
}
}

Expand Down
53 changes: 42 additions & 11 deletions Sources/Handlers.swift
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ func handleChatCompletion(_ request: Request, context: some RequestContext) asyn
// Build session options from request (retry config comes from server config)
let sessionOpts = SessionOptions(
temperature: chatRequest.temperature,
maxTokens: chatRequest.max_tokens ?? BodyLimits.defaultMaxResponseTokens,
maxTokens: chatRequest.max_tokens,
seed: chatRequest.seed.map { UInt64($0) },
permissive: serverState.config.permissive,
contextConfig: contextConfig,
Expand Down Expand Up @@ -329,11 +329,12 @@ private func nonStreamingResponse(
events: [String]
) async throws -> (response: Response, trace: ChatRequestTrace) {
let nsRetryMax = serverState.config.retryEnabled ? serverState.config.retryCount : 0
let rawContent: String
let outcome: StreamOutcome
do {
rawContent = try await withRetry(maxRetries: nsRetryMax) {
let result = try await session.respond(to: prompt, options: genOpts)
return result.content
// Route non-streaming through collectStream so output-side context
// overflow surfaces as a graceful length-finish on this path too.
outcome = try await withRetry(maxRetries: nsRetryMax) {
try await collectStream(session, prompt: prompt, printDelta: false, options: genOpts)
}
} catch {
let classified = ApfelError.classify(error)
Expand All @@ -355,6 +356,7 @@ private func nonStreamingResponse(
event: "model error: \(classified.cliLabel)"
)
}
let rawContent = outcome.content

// Detect tool calls in response
let toolCalls = ToolCallHandler.detectToolCall(in: rawContent)
Expand All @@ -371,11 +373,9 @@ private func nonStreamingResponse(
}

let completionTokens = await TokenCounter.shared.count(deliveredContent)
let finishReason = FinishReasonResolver.resolve(
hasToolCalls: toolCalls != nil,
completionTokens: completionTokens,
maxTokens: genOpts.maximumResponseTokens
).openAIValue
// collectStream already resolved .stop vs .length (cap-hit and output-side
// overflow); only override here when tool calls are detected.
let finishReason = (toolCalls != nil ? FinishReason.toolCalls : outcome.finishReason).openAIValue

let payload = ChatCompletionResponse(
id: id,
Expand Down Expand Up @@ -536,7 +536,38 @@ private func streamingResponse(
await eventBox.append("stream cancelled by client")
} catch {
let classified = ApfelError.classify(error)
if case .refusal(let explanation) = classified {
// Output-side context overflow with content already streamed is
// a graceful length-finish, not an error. See StreamErrorResolver.
if case .truncated(let truncatedContent) = StreamErrorResolver.resolve(prev: prev, error: classified) {
completionTokens = await TokenCounter.shared.count(truncatedContent)
let lengthChunk = ChatCompletionChunk(
id: id, object: "chat.completion.chunk", created: created, model: modelName,
choices: [.init(
index: 0,
delta: .init(role: nil, content: nil, tool_calls: nil),
finish_reason: FinishReason.length.openAIValue,
logprobs: nil
)],
usage: nil
)
let lengthLine = sseDataLine(lengthChunk)
responseLines?.append(lengthLine.trimmingCharacters(in: .whitespacesAndNewlines))
continuation.yield(ByteBuffer(string: lengthLine))

if includeUsage {
let usageChunk = sseUsageChunk(
id: id, created: created,
promptTokens: promptTokens, completionTokens: completionTokens
)
let usageLine = sseDataLine(usageChunk)
responseLines?.append(usageLine.trimmingCharacters(in: .whitespacesAndNewlines))
continuation.yield(ByteBuffer(string: usageLine))
}

continuation.yield(ByteBuffer(string: sseDone))
responseLines?.append("data: [DONE]")
await eventBox.append("stream truncated by context, finish_reason=length total_chars=\(truncatedContent.count)")
} else if case .refusal(let explanation) = classified {
// OpenAI wire format: stream a refusal delta, then a final
// chunk with finish_reason=content_filter, then [DONE].
let refusalLine = sseDataLine(sseRefusalChunk(id: id, created: created, refusal: explanation))
Expand Down
Loading