Minimal CUDA inference engine for LLMs on consumer GPUs.
Built for OpenCLAW — robotics inference on a single card.
pip install -e .
# Run inference (BF16, greedy decode)
python main.py \
--model-path ~/models/Llama-3.2-1B \
--model-type llama-3.2-1b \
--prompt "Hello, world!"Dense:
| Model | Params | VRAM (BF16) |
|---|---|---|
| Llama 3.2 1B | 1.2B | ~2.4 GB |
| Llama 3.2 3B | 3.2B | ~6.4 GB |
| Qwen3 0.6B | 0.6B | ~1.2 GB |
| Qwen3 1.7B | 1.7B | ~3.4 GB |
| Qwen3 4B | 4B | ~8 GB |
| Precision | GPU | Status |
|---|---|---|
| BF16 | sm_80+ | MVP |
| FP8 (E4M3) | sm_89+ (RTX 40x0) | planned |
| FP4 (E2M1) | sm_120 (RTX 50x0) | planned |
No quantization pipeline — loads pre-quantized checkpoints from HuggingFace (NVFP4 format).
Python + CUDA hybrid — Python controls flow, CUDA does compute:
picolm/
├── model.py # ModelConfig + Transformer forward pass
├── weights.py # safetensors loader + HF weight mapping
├── tokenizer.py # HuggingFace AutoTokenizer wrapper
├── generate.py # generation loop (greedy/sampling)
└── kernels/
├── gemm_sm12x.cuh # TMA warp-specialized GEMM (from PTX-Forge)
├── gemm_sm8x.cuh # cooperative ldgsts GEMM (from PTX-Forge)
└── gemm_api.cuh # GEMM dispatch API
main.py # CLI entry point
- General-purpose serving (use vLLM/ollama for that)
- Quantization (use TensorRT Model Optimizer, then load the checkpoint here)
- Multi-GPU / tensor parallelism
MIT