Sequential Hidden Decoding
WeChat AI, Tencent
Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.
Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):
| Benchmark | # Shots | 8B Baseline | 8B scale n=2 | 8B scale n=4 | 8B scale n=8 |
|---|---|---|---|---|---|
| BBH (EM) | 3-shot | 78.8 | 81.3 | 83.0 | 83.9 |
| MMLU (EM) | 5-shot | 79.8 | 80.9 | 81.9 | 82.2 |
| MBPP+ (Pass@1) | 1-shot | 66.7 | 69.4 | 68.7 | 69.4 |
| MATH (LLM-judge) | 4-shot | 56.0 | 58.2 | 60.0 | 61.1 |
| ARC-C | 25-shot | 93.9 | 94.3 | 94.4 | 94.7 |
| Hellaswag | 10-shot | 79.7 | 83.1 | 85.0 | 85.3 |
| GSM8K | 4-shot | 92.5 | 93.3 | 93.9 | 94.6 |
Note: All released models are base models (not instruction-tuned). They are intended for benchmarking, text completion, and as foundations for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.
All models share the same 8B Transformer backbone — only the Embedding parameters grow:
| Model | Scale | Embedding Params | Training Tokens | Link |
|---|---|---|---|---|
| Sequential-Hidden-Decoding-8B-n2 | 2× | 1.9B | 75B | HuggingFace |
| Sequential-Hidden-Decoding-8B-n4 | 4× | 3.1B | 150B | HuggingFace |
| Sequential-Hidden-Decoding-8B-n8 | 8× | 5.6B | 187B | HuggingFace |
docker pull aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126git clone https://github.com/exlaw/sglang.git
cd sglang
pip install -e "python[all]"Apply the patch on top of SGLang:
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 4efe844a2
git apply /path/to/hidden_decoding.patch
pip install -e "python[all]"Launch the server:
python -m sglang.launch_server \
--model-path /path/to/Sequential-Hidden-Decoding-8B-n2 \
--trust-remote-code \
--tp-size 1 \
--port 30000 --host 0.0.0.0 \
--chunked-prefill-size -1 \
--attention-backend fa3 \
--mem-fraction-static 0.82 \
--max-running-requests 32 \
--context-length 131072 \
--cuda-graph-max-bs 128 \
--cuda-graph-bs 1 2 4 8 16 32 64 128Docker
docker run --gpus all -p 30000:30000 \
-v /path/to/models:/models \
aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
python -m sglang.launch_server \
--model-path /models/Sequential-Hidden-Decoding-8B-n2 \
--trust-remote-code \
--tp-size 1 \
--port 30000 --host 0.0.0.0 \
--chunked-prefill-size -1 \
--attention-backend fa3 \
--mem-fraction-static 0.82 \
--max-running-requests 32 \
--context-length 131072 \
--cuda-graph-max-bs 128 \
--cuda-graph-bs 1 2 4 8 16 32 64 128Note: Sequential Hidden Decoding models process n×-length sequences internally, so
--chunked-prefill-size -1(disable chunked prefill),--attention-backend fa3, and reduced batch sizes are important for stability and performance. Adjust--tp-sizefor multi-GPU setups.
Query the model:
These are base models — use the /v1/completions endpoint, not chat completions:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
model="/path/to/Sequential-Hidden-Decoding-8B-n2",
prompt="The meaning of life is",
max_tokens=128,
temperature=0,
)
print(response.choices[0].text)The patch adds the qwen3_scale_seq model architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.
If you find this work useful, please cite our blog post:
@article{hidden_decoding_2026,
title = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
year = {2026},
url = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)
This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.
