Skip to content

Latest commit

 

History

History
239 lines (180 loc) · 8.49 KB

File metadata and controls

239 lines (180 loc) · 8.49 KB

FastDeploy Skills

A collection of Claude Code skills for deploying and invoking FastDeploy — the PaddlePaddle-based LLM/VLM inference and deployment toolkit.

Overview

FastDeploy (v2.4) is an inference and deployment toolkit for large language models (LLM) and visual language models (VLM) based on PaddlePaddle. It provides production-ready, out-of-the-box deployment with:

  • 🚀 PD Disaggregated Deployment — Separate Prefill/Decode for higher throughput
  • 🔄 Unified KV Cache Transmission — High-performance NVLink/RDMA transport
  • 🤝 OpenAI API Compatible — One-command deployment, vLLM interface compatible
  • 🧮 Comprehensive Quantization — W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more
  • Advanced Acceleration — Speculative decoding (MTP/Ngram/Suffix), Chunked Prefill
  • 🖥️ Multi-Hardware — NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi

This repository provides modular, reusable Claude Code agent skills for installing, deploying, and operating FastDeploy services. Each skill is a self-contained directory with a SKILL.md implementing automation and documentation for a specific task.

Important: FastDeploy packages are NOT available on PyPI. All skills use the official PaddlePaddle package index for installation.

Requirements

  • OS: Linux (X86_64)
  • Python: 3.10 – 3.12

Project Structure

fastdeploy-skills/
├── skills/
│   ├── fastdeploy-deploy-simple/        # Local pip installation + server deployment
│   │   ├── SKILL.md                     # Skill documentation
│   │   └── scripts/
│   │       └── quickstart.sh            # Install, start, test, stop helper script
│   ├── fastdeploy-deploy-docker/        # Docker-based deployment
│   │   └── SKILL.md
│   ├── fastdeploy-offline-inference/    # Offline batch inference with LLM Python API
│   │   └── SKILL.md
│   └── fastdeploy-advanced-features/    # Quantization, PD disaggregation, speculative decoding, Router
│       └── SKILL.md
└── README.md

Skills

fastdeploy-deploy-simple

Install FastDeploy and deploy an OpenAI-compatible server locally.

Features:

  • Auto-detect hardware (NVIDIA CUDA SM80/90, SM86/89, Kunlunxin XPU, Hygon DCU, CPU)
  • Install PaddlePaddle + FastDeploy from the official package index
  • Start/stop/restart server via scripts/quickstart.sh
  • Test /v1/chat/completions and /v1/models endpoints
  • Virtual environment support

Quick Start:

# Clone and copy the skill
git clone https://github.com/PaddlePaddle/fastdeploy-skills.git
cp -r fastdeploy-skills/skills/fastdeploy-deploy-simple ~/.claude/skills/

# Use in Claude Code
/fastdeploy-deploy-simple

Or with natural language:

Install FastDeploy and start a server with ERNIE-4.5-0.3B-Paddle on port 8180

fastdeploy-deploy-docker

Deploy FastDeploy using the official Docker images or build from source.

Features:

  • Pull and run ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0
  • Kunlunxin XPU Docker image support
  • Multi-GPU tensor parallel deployment
  • Docker Compose example
  • Build from source with dockerfiles/Dockerfile.gpu
  • Custom Dockerfile with optional dependencies

Quick Start:

docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0

docker run --rm --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8180:8180 --ipc=host \
  ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.4.0 \
  bash -c "export ENABLE_V1_KVCACHE_SCHEDULER=1 && \
    python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-0.3B-Paddle --port 8180"

fastdeploy-offline-inference

Offline batch inference using the FastDeploy LLM Python API — no HTTP server required.

Features:

  • LLM.chat for chat models (recommended)
  • LLM.generate for base/completion models
  • Multimodal inference (images, video) with VLM models
  • Reasoning/thinking models (ERNIE-4.5-VL-Thinking, ERNIE-4.5-21B-A3B-Thinking)
  • SamplingParams reference
  • Output metrics (latency, token counts)

Quick Start:

from fastdeploy import LLM, SamplingParams

llm = LLM(model="baidu/ERNIE-4.5-0.3B-Paddle", max_model_len=8192)
outputs = llm.chat(
    messages=[[{"role": "user", "content": "Hello!"}]],
    sampling_params=SamplingParams(max_tokens=100)
)
print(outputs[0].outputs.text)

fastdeploy-advanced-features

Advanced production features for FastDeploy v2.4.

Features:

  • Quantization: WINT4, WINT8, Block-wise FP8, MixQuant
  • Prefix Caching: GPU + CPU (swap-space) caching
  • Speculative Decoding: MTP, Ngram, Suffix Decoding, Hybrid MTP+Ngram
  • Chunked Prefill: Dynamic chunking for long inputs
  • PD Disaggregated Deployment: Router-based Prefill/Decode separation
  • Load-Balancing Router (fd-router): Golang router with multiple scheduling policies
  • API Authentication: --api-key and FD_API_KEY
  • Structured Output: JSON mode with guided decoding

Using Skills with Claude Code

Install globally

# Copy all skills to global Claude Code skills directory
cp -r skills/fastdeploy-deploy-simple ~/.claude/skills/
cp -r skills/fastdeploy-deploy-docker ~/.claude/skills/
cp -r skills/fastdeploy-offline-inference ~/.claude/skills/
cp -r skills/fastdeploy-advanced-features ~/.claude/skills/

Install per-project

# Copy to project-level skills directory
mkdir -p .claude/skills/
cp -r skills/fastdeploy-deploy-simple .claude/skills/

Use in Claude Code

/fastdeploy-deploy-simple

Or with natural language:

Deploy FastDeploy with ERNIE-4.5-0.3B-Paddle on port 8180
Run offline batch inference using FastDeploy on a list of prompts
Configure PD disaggregated deployment with the FastDeploy Router

Supported Models

Large Language Models (LLM)

Model Precisions Notes
baidu/ERNIE-4.5-0.3B-Paddle BF16 Lightweight, quick start
baidu/ERNIE-4.5-21B-A3B-Paddle BF16, WINT4, WINT8 Mid-size MoE
baidu/ERNIE-4.5-21B-A3B-Thinking BF16 Reasoning model
baidu/ERNIE-4.5-300B-A47B-Paddle BF16, WINT4, WINT8, FP8 Large MoE
Qwen/qwen3-8B BF16, WINT8, FP8
Qwen/Qwen3-30B-A3B BF16, WINT4, FP8 MoE
Qwen/qwen2.5-7B BF16, WINT8, FP8
unsloth/DeepSeek-V3-0324-BF16 BF16, WINT4
zai-org/GLM-4.5-Air BF16, wfp8afp8

Multimodal Language Models (VLM)

Model Precisions Notes
baidu/ERNIE-4.5-VL-28B-A3B-Paddle BF16, WINT4, WINT8
baidu/ERNIE-4.5-VL-28B-A3B-Thinking BF16 Reasoning + Vision
baidu/ERNIE-4.5-VL-424B-A47B-Paddle BF16, WINT4, WINT8 Large VLM
PaddlePaddle/PaddleOCR-VL BF16, WINT4, WINT8 OCR-specialized
Qwen/Qwen2.5-VL-7B-Instruct BF16, WINT4, FP8

Models auto-download from AIStudio (default), ModelScope, or HuggingFace. See supported models list for the full list.

# Configure download source and cache
export FD_MODEL_SOURCE=AISTUDIO   # AISTUDIO | MODELSCOPE | HUGGINGFACE
export FD_MODEL_CACHE=/ssd1/models

Contributing

When adding new skills:

  1. Create a directory under skills/ (e.g., skills/your-skill/)
  2. Add a SKILL.md with YAML frontmatter:
    ---
    name: your-skill
    description: Brief description of what this skill does
    ---
  3. Add optional scripts/, references/, and assets/ directories
  4. Update this README with your skill documentation

License

Licensed under the Apache License 2.0. See LICENSE.

Resources