Sequential Hidden Decoding

Scaling Sequence Length in Pretraining

WeChat AI, Tencent

Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.

Results

Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):

Benchmark	# Shots	8B Baseline	8B scale n=2	8B scale n=4	8B scale n=8
BBH (EM)	3-shot	78.8	81.3	83.0	83.9
MMLU (EM)	5-shot	79.8	80.9	81.9	82.2
MBPP+ (Pass@1)	1-shot	66.7	69.4	68.7	69.4
MATH (LLM-judge)	4-shot	56.0	58.2	60.0	61.1
ARC-C	25-shot	93.9	94.3	94.4	94.7
Hellaswag	10-shot	79.7	83.1	85.0	85.3
GSM8K	4-shot	92.5	93.3	93.9	94.6

Models

Note: All released models are base models (not instruction-tuned). They are intended for benchmarking, text completion, and as foundations for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.

All models share the same 8B Transformer backbone — only the Embedding parameters grow:

Model	Scale	Embedding Params	Training Tokens	Link
Sequential-Hidden-Decoding-8B-n2	2×	1.9B	75B	HuggingFace
Sequential-Hidden-Decoding-8B-n4	4×	3.1B	150B	HuggingFace
Sequential-Hidden-Decoding-8B-n8	8×	5.6B	187B	HuggingFace

Installation (Inference)

Option 1: Docker Image (Recommended)

docker pull aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126

Option 2: Use the Forked SGLang

git clone https://github.com/exlaw/sglang.git
cd sglang
pip install -e "python[all]"

Option 3: Apply Patch Manually

Apply the patch on top of SGLang:

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 4efe844a2
git apply /path/to/hidden_decoding.patch
pip install -e "python[all]"

Serving

Launch the server:

python -m sglang.launch_server \
    --model-path /path/to/Sequential-Hidden-Decoding-8B-n2 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128

Docker

docker run --gpus all -p 30000:30000 \
  -v /path/to/models:/models \
  aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
  python -m sglang.launch_server \
    --model-path /models/Sequential-Hidden-Decoding-8B-n2 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128

Note: Sequential Hidden Decoding models process n×-length sequences internally, so --chunked-prefill-size -1 (disable chunked prefill), --attention-backend fa3, and reduced batch sizes are important for stability and performance. Adjust --tp-size for multi-GPU setups.

Query the model:

These are base models — use the /v1/completions endpoint, not chat completions:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
    model="/path/to/Sequential-Hidden-Decoding-8B-n2",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

Patch Contents

The patch adds the qwen3_scale_seq model architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.

Citation

If you find this work useful, please cite our blog post:

@article{hidden_decoding_2026,
  title   = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

Contact

Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)

License

This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
LICENSE		LICENSE
README.md		README.md
hidden_decoding.patch		hidden_decoding.patch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequential Hidden Decoding

Scaling Sequence Length in Pretraining

Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token

Key Idea

Results

Models

Installation (Inference)

Option 1: Docker Image (Recommended)

Option 2: Use the Forked SGLang

Option 3: Apply Patch Manually

Serving

Patch Contents

Citation

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Sequential Hidden Decoding

Scaling Sequence Length in Pretraining

Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token

Key Idea

Results

Models

Installation (Inference)

Option 1: Docker Image (Recommended)

Option 2: Use the Forked SGLang

Option 3: Apply Patch Manually

Serving

Patch Contents

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages