Skip to content

fix: raise default n_ctx from 4096 to 8192 and fix UTF-8 token boundary#187

Merged
Michael-A-Kuykendall merged 1 commit intoMichael-A-Kuykendall:mainfrom
LopezNuance:fix/kvcache-ctx-and-utf8
Mar 25, 2026
Merged

fix: raise default n_ctx from 4096 to 8192 and fix UTF-8 token boundary#187
Michael-A-Kuykendall merged 1 commit intoMichael-A-Kuykendall:mainfrom
LopezNuance:fix/kvcache-ctx-and-utf8

Conversation

@LopezNuance
Copy link
Copy Markdown
Contributor

Summary

Fixes two bugs found during multi-round LLM deliberation experiments with thinking models (qwen3:8b, cogito:8b) on Shimmy v1.9.0 GPU build over several hundred experiment runs.

Fix 1 — n_ctx default 4096 → 8192

model_registry.rs (3 locations) and main.rs (5 locations) hardcode ctx_len=4096. With thinking models a single deliberation round exhausts the KV cache:

system prompt (~80t) + task (~200t) + prior draft (~1610t)
+ transcript (~500t) + CoT chain (~1000t) + output (2048t) = 5438t > 4096

This causes NoKvCacheSlot errors that surface as HTTP 502 Bad Gateway. Fixed to 8192 in all six locations. A follow-up improvement would be to read context_length from the GGUF metadata via llama_model_meta_val_str so each model uses its own native default.

Fix 2 — UTF-8 token boundary crash

engine/llama.rs generation loop called token_to_str(token, Special::Plaintext)? which calls String::from_utf8(bytes)?. Byte-level tokenizers (qwen3, qwen2.5, deepseek, and most multilingual models) emit individual bytes as separate tokens — the character 你 (U+4F60) arrives as three consecutive tokens [0xE4, 0xBD, 0xA0]. from_utf8 on a single-byte token fails with FromUtf8Error, the ? propagates it, and the server returns 502.

Fixed to:

let piece = self.model.token_to_bytes(token, Special::Plaintext)
    .map(|b| String::from_utf8_lossy(&b).into_owned())
    .unwrap_or_default();

from_utf8_lossy accepts partial sequences; the complete character is reconstructed correctly as bytes accumulate across tokens.

Test plan

  • Regression test tests/regression/issue_182_kvcache_ctx_default.rs — 4 tests verifying 8192 default and that the failing scenario (5438 tokens) fits within 8192 but not 4096
  • Regression test tests/regression/issue_183_utf8_token_boundary.rs — 5 tests verifying multi-byte character reconstruction, partial sequence tolerance, ASCII passthrough, empty token handling
  • cargo test --test regression passes (all existing + new tests)
  • Validated on RTX 4090 with qwen3:8b, cogito:8b, gemma3:1b — KV cache confirmed at 8192 cells in Shimmy log, no NoKvCacheSlot errors across 33-task tier-3 evaluation suite

Context

These fixes came out of running ACMT (Author-Critic MetaTransformer), a multi-model LLM deliberation pipeline that uses Shimmy as its exclusive inference backend. Shimmy's speed and GGUF-native loading make it ideal for multi-model chained inference workloads.

Scott Johnson (@LopezNuance)

Fixes two bugs found during multi-round LLM deliberation experiments with
qwen3:8b, cogito:8b, and gemma3:1b on Shimmy v1.9.0 GPU build.

## Fix 1 — n_ctx default 4096 → 8192 (issue Michael-A-Kuykendall#182)

model_registry.rs (3 locations) and main.rs (5 locations) hardcode
ctx_len=4096.  With thinking models (qwen3, cogito, deepseek-r1) a single
deliberation round exhausts the KV cache:

  system prompt (~80t) + task (~200t) + prior draft (~1610t)
  + transcript (~500t) + CoT chain (~1000t) + output (2048t) = 5438t > 4096

This causes NoKvCacheSlot errors that surface as HTTP 502 Bad Gateway.

Fixed to 8192 in all six locations.  A follow-up improvement would be to
read context_length from the GGUF metadata via llama_model_meta_val_str
so each model uses its own native default.

Regression test: tests/regression/issue_182_kvcache_ctx_default.rs

## Fix 2 — UTF-8 token boundary crash (issue Michael-A-Kuykendall#183)

engine/llama.rs generation loop called:
  token_to_str(token, Special::Plaintext)?

token_to_str calls String::from_utf8(bytes)?.  Byte-level tokenizers
(qwen3, qwen2.5, deepseek, and most multilingual models) emit individual
bytes as separate tokens — the character 你 (U+4F60) arrives as three
consecutive tokens [0xE4, 0xBD, 0xA0].  from_utf8 on a single-byte token
fails with FromUtf8Error, the ? propagates it, and the server returns 502.

Fixed to:
  token_to_bytes(token, Special::Plaintext)
      .map(|b| String::from_utf8_lossy(&b).into_owned())
      .unwrap_or_default()

from_utf8_lossy accepts partial sequences; the complete character is
reconstructed correctly as bytes accumulate across tokens.

Regression test: tests/regression/issue_183_utf8_token_boundary.rs

Signed-off-by: Scott Johnson <m6gmjmjwfw@liamekaens.com>
Signed-off-by: scott <scott@procyon.here>
Copy link
Copy Markdown
Owner

@Michael-A-Kuykendall Michael-A-Kuykendall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved and welcome to the Shimmy Team!

@Michael-A-Kuykendall Michael-A-Kuykendall merged commit 0db9b4a into Michael-A-Kuykendall:main Mar 25, 2026
1 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants