FastDeploy Skills

A collection of Claude Code skills for deploying and invoking FastDeploy — the PaddlePaddle-based LLM/VLM inference and deployment toolkit.

Overview

FastDeploy (v2.4) is an inference and deployment toolkit for large language models (LLM) and visual language models (VLM) based on PaddlePaddle. It provides production-ready, out-of-the-box deployment with:

🚀 PD Disaggregated Deployment — Separate Prefill/Decode for higher throughput
🔄 Unified KV Cache Transmission — High-performance NVLink/RDMA transport
🤝 OpenAI API Compatible — One-command deployment, vLLM interface compatible
🧮 Comprehensive Quantization — W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more
⏩ Advanced Acceleration — Speculative decoding (MTP/Ngram/Suffix), Chunked Prefill
🖥️ Multi-Hardware — NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi

This repository provides modular, reusable Claude Code agent skills for installing, deploying, and operating FastDeploy services. Each skill is a self-contained directory with a SKILL.md implementing automation and documentation for a specific task.

Important: FastDeploy packages are NOT available on PyPI. All skills use the official PaddlePaddle package index for installation.

Requirements

OS: Linux (X86_64)
Python: 3.10 – 3.12

Project Structure

fastdeploy-skills/
├── skills/
│   ├── fastdeploy-deploy-simple/        # Local pip installation + server deployment
│   │   ├── SKILL.md                     # Skill documentation
│   │   └── scripts/
│   │       └── quickstart.sh            # Install, start, test, stop helper script
│   ├── fastdeploy-deploy-docker/        # Docker-based deployment
│   │   └── SKILL.md
│   ├── fastdeploy-offline-inference/    # Offline batch inference with LLM Python API
│   │   └── SKILL.md
│   └── fastdeploy-advanced-features/    # Quantization, PD disaggregation, speculative decoding, Router
│       └── SKILL.md
└── README.md

Skills

`fastdeploy-deploy-simple`

Install FastDeploy and deploy an OpenAI-compatible server locally.

Features:

Auto-detect hardware (NVIDIA CUDA SM80/90, SM86/89, Kunlunxin XPU, Hygon DCU, CPU)
Install PaddlePaddle + FastDeploy from the official package index
Start/stop/restart server via scripts/quickstart.sh
Test /v1/chat/completions and /v1/models endpoints
Virtual environment support

Quick Start:

# Clone and copy the skill
git clone https://github.com/PaddlePaddle/fastdeploy-skills.git
cp -r fastdeploy-skills/skills/fastdeploy-deploy-simple ~/.claude/skills/

# Use in Claude Code
/fastdeploy-deploy-simple

Or with natural language:

Install FastDeploy and start a server with ERNIE-4.5-0.3B-Paddle on port 8180

`fastdeploy-deploy-docker`

Deploy FastDeploy using the official Docker images or build from source.

Features:

Pull and run ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0
Kunlunxin XPU Docker image support
Multi-GPU tensor parallel deployment
Docker Compose example
Build from source with dockerfiles/Dockerfile.gpu
Custom Dockerfile with optional dependencies

Quick Start:

docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0

docker run --rm --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8180:8180 --ipc=host \
  ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0 \
  bash -c "export ENABLE_V1_KVCACHE_SCHEDULER=1 && \
    python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-0.3B-Paddle --port 8180"

`fastdeploy-offline-inference`

Offline batch inference using the FastDeploy LLM Python API — no HTTP server required.

Features:

LLM.chat for chat models (recommended)
LLM.generate for base/completion models
Multimodal inference (images, video) with VLM models
Reasoning/thinking models (ERNIE-4.5-VL-Thinking, ERNIE-4.5-21B-A3B-Thinking)
SamplingParams reference
Output metrics (latency, token counts)

Quick Start:

from fastdeploy import LLM, SamplingParams

llm = LLM(model="baidu/ERNIE-4.5-0.3B-Paddle", max_model_len=8192)
outputs = llm.chat(
    messages=[[{"role": "user", "content": "Hello!"}]],
    sampling_params=SamplingParams(max_tokens=100)
)
print(outputs[0].outputs.text)

`fastdeploy-advanced-features`

Advanced production features for FastDeploy v2.4.

Features:

Quantization: WINT4, WINT8, Block-wise FP8, MixQuant
Prefix Caching: GPU + CPU (swap-space) caching
Speculative Decoding: MTP, Ngram, Suffix Decoding, Hybrid MTP+Ngram
Chunked Prefill: Dynamic chunking for long inputs
PD Disaggregated Deployment: Router-based Prefill/Decode separation
Load-Balancing Router (fd-router): Golang router with multiple scheduling policies
API Authentication: --api-key and FD_API_KEY
Structured Output: JSON mode with guided decoding

Using Skills with Claude Code

Install globally

# Copy all skills to global Claude Code skills directory
cp -r skills/fastdeploy-deploy-simple ~/.claude/skills/
cp -r skills/fastdeploy-deploy-docker ~/.claude/skills/
cp -r skills/fastdeploy-offline-inference ~/.claude/skills/
cp -r skills/fastdeploy-advanced-features ~/.claude/skills/

Install per-project

# Copy to project-level skills directory
mkdir -p .claude/skills/
cp -r skills/fastdeploy-deploy-simple .claude/skills/

Use in Claude Code

/fastdeploy-deploy-simple

Or with natural language:

Deploy FastDeploy with ERNIE-4.5-0.3B-Paddle on port 8180

Run offline batch inference using FastDeploy on a list of prompts

Configure PD disaggregated deployment with the FastDeploy Router

Supported Models

Large Language Models (LLM)

Model	Precisions	Notes
`baidu/ERNIE-4.5-0.3B-Paddle`	BF16	Lightweight, quick start
`baidu/ERNIE-4.5-21B-A3B-Paddle`	BF16, WINT4, WINT8	Mid-size MoE
`baidu/ERNIE-4.5-21B-A3B-Thinking`	BF16	Reasoning model
`baidu/ERNIE-4.5-300B-A47B-Paddle`	BF16, WINT4, WINT8, FP8	Large MoE
`Qwen/qwen3-8B`	BF16, WINT8, FP8
`Qwen/Qwen3-30B-A3B`	BF16, WINT4, FP8	MoE
`Qwen/qwen2.5-7B`	BF16, WINT8, FP8
`unsloth/DeepSeek-V3-0324-BF16`	BF16, WINT4
`zai-org/GLM-4.5-Air`	BF16, wfp8afp8

Multimodal Language Models (VLM)

Model	Precisions	Notes
`baidu/ERNIE-4.5-VL-28B-A3B-Paddle`	BF16, WINT4, WINT8
`baidu/ERNIE-4.5-VL-28B-A3B-Thinking`	BF16	Reasoning + Vision
`baidu/ERNIE-4.5-VL-424B-A47B-Paddle`	BF16, WINT4, WINT8	Large VLM
`PaddlePaddle/PaddleOCR-VL`	BF16, WINT4, WINT8	OCR-specialized
`Qwen/Qwen2.5-VL-7B-Instruct`	BF16, WINT4, FP8

Models auto-download from AIStudio (default), ModelScope, or HuggingFace. See supported models list for the full list.

# Configure download source and cache
export FD_MODEL_SOURCE=AISTUDIO   # AISTUDIO | MODELSCOPE | HUGGINGFACE
export FD_MODEL_CACHE=/ssd1/models

Contributing

When adding new skills:

Create a directory under skills/ (e.g., skills/your-skill/)

Add a SKILL.md with YAML frontmatter:

---
name: your-skill
description: Brief description of what this skill does
---

Add optional scripts/, references/, and assets/ directories
Update this README with your skill documentation

License

Licensed under the Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
skills		skills
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastDeploy Skills

Overview

Requirements

Project Structure

Skills

`fastdeploy-deploy-simple`

`fastdeploy-deploy-docker`

`fastdeploy-offline-inference`

`fastdeploy-advanced-features`

Using Skills with Claude Code

Install globally

Install per-project

Use in Claude Code

Supported Models

Large Language Models (LLM)

Multimodal Language Models (VLM)

Contributing

License

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastDeploy Skills

Overview

Requirements

Project Structure

Skills

fastdeploy-deploy-simple

fastdeploy-deploy-docker

fastdeploy-offline-inference

fastdeploy-advanced-features

Using Skills with Claude Code

Install globally

Install per-project

Use in Claude Code

Supported Models

Large Language Models (LLM)

Multimodal Language Models (VLM)

Contributing

License

Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`fastdeploy-deploy-simple`

`fastdeploy-deploy-docker`

`fastdeploy-offline-inference`

`fastdeploy-advanced-features`

Packages