MiniCPM5-1B is a standard LlamaForCausalLM, so it loads directly via AutoModelForCausalLM — no custom modeling code, no trust_remote_code.
pip install -U "transformers>=5.6,<6" "torch>=2.11" accelerate # latest (CUDA 13.x driver hosts)
# pip install -U "transformers==4.57.3" "torch==2.7.1" accelerate # fallback for CUDA 12.x driver hostsimport torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "openbmb/MiniCPM5-1B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
).eval()
messages = [{"role": "user", "content": "用一句话解释什么是 GQA。"}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=True, # set False for fast / no-think mode
return_tensors="pt",
).to(model.device)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.9,
top_p=0.95,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))The whole 1.08B-param model is < 4.5 GB in fp32, so it can run on CPU only (laptops, CI machines, no-GPU sanity checks):
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "openbmb/MiniCPM5-1B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float32, # bf16 also works on AVX-512 BF16 / AMX hosts
device_map="cpu",
).eval()
messages = [{"role": "user", "content": "用一句话解释什么是 GQA。"}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=False, # nothink is recommended for CPU latency
return_tensors="pt",
)
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))| Mode | enable_thinking |
temperature |
top_p |
When to use |
|---|---|---|---|---|
| Think | True |
0.9 | 0.95 | hard reasoning, math, code, multi-step |
| No-think | False |
0.7 | 0.95 | fast assistant, latency-bound |
generation_config.json is tuned for think mode by default.
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(
"openbmb/MiniCPM5-1B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "<your_lora_dir>").eval()Adapters trained against this base load directly with no surgery.