Skip to content

Tencent/Sequential-Hidden-Decoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Sequential Hidden Decoding

Scaling Sequence Length in Pretraining

WeChat AI, Tencent

Blog   License   Models


Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token


Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.


Results

Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):

Benchmark # Shots 8B Baseline 8B scale n=2 8B scale n=4 8B scale n=8
BBH (EM) 3-shot 78.8 81.3 83.0 83.9
MMLU (EM) 5-shot 79.8 80.9 81.9 82.2
MBPP+ (Pass@1) 1-shot 66.7 69.4 68.7 69.4
MATH (LLM-judge) 4-shot 56.0 58.2 60.0 61.1
ARC-C 25-shot 93.9 94.3 94.4 94.7
Hellaswag 10-shot 79.7 83.1 85.0 85.3
GSM8K 4-shot 92.5 93.3 93.9 94.6

Models

Note: All released models are base models (not instruction-tuned). They are intended for benchmarking, text completion, and as foundations for downstream fine-tuning (SFT / RLHF). For conversational or instruction-following use cases, please fine-tune on your own data.

All models share the same 8B Transformer backbone — only the Embedding parameters grow:

Model Scale Embedding Params Training Tokens Link
Sequential-Hidden-Decoding-8B-n2 1.9B 75B HuggingFace
Sequential-Hidden-Decoding-8B-n4 3.1B 150B HuggingFace
Sequential-Hidden-Decoding-8B-n8 5.6B 187B HuggingFace

Installation (Inference)

Option 1: Docker Image (Recommended)

docker pull aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126

Option 2: Use the Forked SGLang

git clone https://github.com/exlaw/sglang.git
cd sglang
pip install -e "python[all]"

Option 3: Apply Patch Manually

Apply the patch on top of SGLang:

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 4efe844a2
git apply /path/to/hidden_decoding.patch
pip install -e "python[all]"

Serving

Launch the server:

python -m sglang.launch_server \
    --model-path /path/to/Sequential-Hidden-Decoding-8B-n2 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128
Docker
docker run --gpus all -p 30000:30000 \
  -v /path/to/models:/models \
  aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
  python -m sglang.launch_server \
    --model-path /models/Sequential-Hidden-Decoding-8B-n2 \
    --trust-remote-code \
    --tp-size 1 \
    --port 30000 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128

Note: Sequential Hidden Decoding models process n×-length sequences internally, so --chunked-prefill-size -1 (disable chunked prefill), --attention-backend fa3, and reduced batch sizes are important for stability and performance. Adjust --tp-size for multi-GPU setups.

Query the model:

These are base models — use the /v1/completions endpoint, not chat completions:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")
response = client.completions.create(
    model="/path/to/Sequential-Hidden-Decoding-8B-n2",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

Patch Contents

The patch adds the qwen3_scale_seq model architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.


Citation

If you find this work useful, please cite our blog post:

@article{hidden_decoding_2026,
  title   = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

Contact

Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)

License

This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors