Skip to content

YangXu1990uiuc/PicoLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PicoLM

Minimal CUDA inference engine for LLMs on consumer GPUs.

Built for OpenCLAW — robotics inference on a single card.

Quick Start

pip install -e .

# Run inference (BF16, greedy decode)
python main.py \
  --model-path ~/models/Llama-3.2-1B \
  --model-type llama-3.2-1b \
  --prompt "Hello, world!"

Supported Models

Dense:

Model Params VRAM (BF16)
Llama 3.2 1B 1.2B ~2.4 GB
Llama 3.2 3B 3.2B ~6.4 GB
Qwen3 0.6B 0.6B ~1.2 GB
Qwen3 1.7B 1.7B ~3.4 GB
Qwen3 4B 4B ~8 GB

Precision Roadmap

Precision GPU Status
BF16 sm_80+ MVP
FP8 (E4M3) sm_89+ (RTX 40x0) planned
FP4 (E2M1) sm_120 (RTX 50x0) planned

No quantization pipeline — loads pre-quantized checkpoints from HuggingFace (NVFP4 format).

Architecture

Python + CUDA hybrid — Python controls flow, CUDA does compute:

picolm/
├── model.py          # ModelConfig + Transformer forward pass
├── weights.py        # safetensors loader + HF weight mapping
├── tokenizer.py      # HuggingFace AutoTokenizer wrapper
├── generate.py       # generation loop (greedy/sampling)
└── kernels/
    ├── gemm_sm12x.cuh  # TMA warp-specialized GEMM (from PTX-Forge)
    ├── gemm_sm8x.cuh   # cooperative ldgsts GEMM (from PTX-Forge)
    └── gemm_api.cuh    # GEMM dispatch API
main.py               # CLI entry point

Non-Goals

  • General-purpose serving (use vLLM/ollama for that)
  • Quantization (use TensorRT Model Optimizer, then load the checkpoint here)
  • Multi-GPU / tensor parallelism

License

MIT

About

Minimal LLM inference engine targeting CUDA (BF16/FP8/FP4)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors