A collection of Claude Code skills for deploying and invoking FastDeploy โ the PaddlePaddle-based LLM/VLM inference and deployment toolkit.
FastDeploy (v2.4) is an inference and deployment toolkit for large language models (LLM) and visual language models (VLM) based on PaddlePaddle. It provides production-ready, out-of-the-box deployment with:
- ๐ PD Disaggregated Deployment โ Separate Prefill/Decode for higher throughput
- ๐ Unified KV Cache Transmission โ High-performance NVLink/RDMA transport
- ๐ค OpenAI API Compatible โ One-command deployment, vLLM interface compatible
- ๐งฎ Comprehensive Quantization โ W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more
- โฉ Advanced Acceleration โ Speculative decoding (MTP/Ngram/Suffix), Chunked Prefill
- ๐ฅ๏ธ Multi-Hardware โ NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi
This repository provides modular, reusable Claude Code agent skills for installing, deploying, and operating FastDeploy services. Each skill is a self-contained directory with a SKILL.md implementing automation and documentation for a specific task.
Important: FastDeploy packages are NOT available on PyPI. All skills use the official PaddlePaddle package index for installation.
- OS: Linux (X86_64)
- Python: 3.10 โ 3.12
fastdeploy-skills/
โโโ skills/
โ โโโ fastdeploy-deploy-simple/ # Local pip installation + server deployment
โ โ โโโ SKILL.md # Skill documentation
โ โ โโโ scripts/
โ โ โโโ quickstart.sh # Install, start, test, stop helper script
โ โโโ fastdeploy-deploy-docker/ # Docker-based deployment
โ โ โโโ SKILL.md
โ โโโ fastdeploy-offline-inference/ # Offline batch inference with LLM Python API
โ โ โโโ SKILL.md
โ โโโ fastdeploy-advanced-features/ # Quantization, PD disaggregation, speculative decoding, Router
โ โโโ SKILL.md
โโโ README.md
Install FastDeploy and deploy an OpenAI-compatible server locally.
Features:
- Auto-detect hardware (NVIDIA CUDA SM80/90, SM86/89, Kunlunxin XPU, Hygon DCU, CPU)
- Install PaddlePaddle + FastDeploy from the official package index
- Start/stop/restart server via
scripts/quickstart.sh - Test
/v1/chat/completionsand/v1/modelsendpoints - Virtual environment support
Quick Start:
# Clone and copy the skill
git clone https://github.com/PaddlePaddle/fastdeploy-skills.git
cp -r fastdeploy-skills/skills/fastdeploy-deploy-simple ~/.claude/skills/
# Use in Claude Code
/fastdeploy-deploy-simpleOr with natural language:
Install FastDeploy and start a server with ERNIE-4.5-0.3B-Paddle on port 8180
Deploy FastDeploy using the official Docker images or build from source.
Features:
- Pull and run
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0 - Kunlunxin XPU Docker image support
- Multi-GPU tensor parallel deployment
- Docker Compose example
- Build from source with
dockerfiles/Dockerfile.gpu - Custom Dockerfile with optional dependencies
Quick Start:
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0
docker run --rm --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8180:8180 --ipc=host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0 \
bash -c "export ENABLE_V1_KVCACHE_SCHEDULER=1 && \
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-0.3B-Paddle --port 8180"Offline batch inference using the FastDeploy LLM Python API โ no HTTP server required.
Features:
LLM.chatfor chat models (recommended)LLM.generatefor base/completion models- Multimodal inference (images, video) with VLM models
- Reasoning/thinking models (ERNIE-4.5-VL-Thinking, ERNIE-4.5-21B-A3B-Thinking)
SamplingParamsreference- Output metrics (latency, token counts)
Quick Start:
from fastdeploy import LLM, SamplingParams
llm = LLM(model="baidu/ERNIE-4.5-0.3B-Paddle", max_model_len=8192)
outputs = llm.chat(
messages=[[{"role": "user", "content": "Hello!"}]],
sampling_params=SamplingParams(max_tokens=100)
)
print(outputs[0].outputs.text)Advanced production features for FastDeploy v2.4.
Features:
- Quantization: WINT4, WINT8, Block-wise FP8, MixQuant
- Prefix Caching: GPU + CPU (swap-space) caching
- Speculative Decoding: MTP, Ngram, Suffix Decoding, Hybrid MTP+Ngram
- Chunked Prefill: Dynamic chunking for long inputs
- PD Disaggregated Deployment: Router-based Prefill/Decode separation
- Load-Balancing Router (
fd-router): Golang router with multiple scheduling policies - API Authentication:
--api-keyandFD_API_KEY - Structured Output: JSON mode with guided decoding
# Copy all skills to global Claude Code skills directory
cp -r skills/fastdeploy-deploy-simple ~/.claude/skills/
cp -r skills/fastdeploy-deploy-docker ~/.claude/skills/
cp -r skills/fastdeploy-offline-inference ~/.claude/skills/
cp -r skills/fastdeploy-advanced-features ~/.claude/skills/# Copy to project-level skills directory
mkdir -p .claude/skills/
cp -r skills/fastdeploy-deploy-simple .claude/skills//fastdeploy-deploy-simple
Or with natural language:
Deploy FastDeploy with ERNIE-4.5-0.3B-Paddle on port 8180
Run offline batch inference using FastDeploy on a list of prompts
Configure PD disaggregated deployment with the FastDeploy Router
| Model | Precisions | Notes |
|---|---|---|
baidu/ERNIE-4.5-0.3B-Paddle |
BF16 | Lightweight, quick start |
baidu/ERNIE-4.5-21B-A3B-Paddle |
BF16, WINT4, WINT8 | Mid-size MoE |
baidu/ERNIE-4.5-21B-A3B-Thinking |
BF16 | Reasoning model |
baidu/ERNIE-4.5-300B-A47B-Paddle |
BF16, WINT4, WINT8, FP8 | Large MoE |
Qwen/qwen3-8B |
BF16, WINT8, FP8 | |
Qwen/Qwen3-30B-A3B |
BF16, WINT4, FP8 | MoE |
Qwen/qwen2.5-7B |
BF16, WINT8, FP8 | |
unsloth/DeepSeek-V3-0324-BF16 |
BF16, WINT4 | |
zai-org/GLM-4.5-Air |
BF16, wfp8afp8 |
| Model | Precisions | Notes |
|---|---|---|
baidu/ERNIE-4.5-VL-28B-A3B-Paddle |
BF16, WINT4, WINT8 | |
baidu/ERNIE-4.5-VL-28B-A3B-Thinking |
BF16 | Reasoning + Vision |
baidu/ERNIE-4.5-VL-424B-A47B-Paddle |
BF16, WINT4, WINT8 | Large VLM |
PaddlePaddle/PaddleOCR-VL |
BF16, WINT4, WINT8 | OCR-specialized |
Qwen/Qwen2.5-VL-7B-Instruct |
BF16, WINT4, FP8 |
Models auto-download from AIStudio (default), ModelScope, or HuggingFace. See supported models list for the full list.
# Configure download source and cache
export FD_MODEL_SOURCE=AISTUDIO # AISTUDIO | MODELSCOPE | HUGGINGFACE
export FD_MODEL_CACHE=/ssd1/modelsWhen adding new skills:
- Create a directory under
skills/(e.g.,skills/your-skill/) - Add a
SKILL.mdwith YAML frontmatter:--- name: your-skill description: Brief description of what this skill does ---
- Add optional
scripts/,references/, andassets/directories - Update this README with your skill documentation
Licensed under the Apache License 2.0. See LICENSE.