This tutorial walks through the complete workflow for quantizing a YOLOv26n model for deployment on ESP32-P4 and ESP32-S3 using ESP-PPQ.
The pipeline leverages the YOLO26n architecture, optimized for embedded deployment, and applies a three-stage hardware-aware quantization process.
- NMS-Free Inference: One2One detection head no post-processing NMS required on the MCU.
- RegMax=1: Eliminates the DFL layer entirely, reducing compute by ~30%.
- INT8 + INT16 Hybrid: Sensitive layers (box/class heads, neck exits) run in INT16; backbone in INT8.
- INT16 Step-Interpolated LUT for Swish: Swish activations are replaced by compact INT16 Look-Up Tables with configurable step interpolation, enabling hardware-accurate emulation on ESP32-P4/S3.
- Generic: Supports any input resolution (160–640) and any dataset (COCO, Roboflow, custom).
pip install -r requirements.txt
pip install roboflow # only needed for the Roboflow notebookTwo notebooks cover the full pipeline:
quantize_yolo26_roboflow.ipynb
- Paste your Roboflow API key & dataset URL.
- Runs fine-tuning (4 epochs by default) on your custom dataset.
- Runs the full PTQ → TQT → LUT quantization pipeline.
- Exports the optimized
.espdlmodel.
Supports any Roboflow dataset with any number of classes and any image resolution.
quantize_yolo26_coco.ipynb
Reproduce our official mAP benchmarks or build a generic 80-class object detector.
The pipeline runs sequentially across the following stages:
| Step | Name | Description |
|---|---|---|
| 1 | ONNX Export | Exports PyTorch weights with the RegMax=1 patch applied (removes DFL layer). |
| 2 | PTQ Calibration | Feeds calibration images through the graph to determine per-layer dynamic ranges. |
| 3 | TQT (Trained Quantization Threshold) | Block-by-block scale optimization using reconstruction loss fast, no real backprop needed. |
| 4 | Passive + Alignment Passes | Derives bias/passive scales; aligns elementwise ops (Add, Concat) to a common quantization scale. |
| 5 | INT16 Step-Interpolated LUT Fusion | Converts INT16 Swish activations into compact Look-Up Tables (step size configurable, default=32) for hardware-accurate emulation on the ESP-DL accelerator. |
| 6 | Graph Surgery | Splits Concat output nodes into 6 discrete tensors (one2one_p3_box, one2one_p3_cls, …). |
| 7 | .espdl Export |
Writes the final deployment model with LUT tables embedded. |
After graph surgery and before export, Cell 10.1 runs eval_espdl_model to visualize model predictions
on a test image using bit-exact ESP-DL emulated preprocessing. The annotated output is saved to results/.
After running a notebook, the output is saved under:
output/
├── coco_512_s8_p4/
│ ├── yolo26n_512_s8_p4.espdl ← FIRMWARE DEPLOYMENT MODEL
│ ├── yolo26n_512_s8_p4.info ← Per-layer debug info (~15 MB)
│ ├── yolo26n_512_s8_p4.json ← Quantization scales/config
│ └── yolo26n_export.onnx ← Intermediate ONNX (pre-quantization)
└── lego_512_s8_p4/
└── yolo26n_lego_512_s8_p4.espdl
Naming convention: <model>_<img_sz>_s8_<platform>.*
Once you have your .espdl file:
-
Copy it to your firmware project:
cp output/coco_512_s8_p4/yolo26n_512_s8_p4.espdl \ ../../../../examples/yolo26_detect/main/models/p4/
-
Update CMake: Edit
examples/yolo26_detect/main/CMakeLists.txtto select the new model file. -
Build & Flash:
idf.py build flash monitor
| File | Description |
|---|---|
export.py |
ONNX export with ESP-DL graph patches (RegMax=1, Attention static reshape, Detect head). |
dataset.py |
Calibration dataloader works with COCO, Roboflow, or any data.yaml dataset. |
notebook_helpers.py |
Core helpers: extract_model_meta, prepare_onnx, prune_graph_safely, espdl_preprocess, eval_espdl_model. |
trainer.py |
QATTrainer mAP evaluation using the quantized graph (emulates ESP-DL hardware). |
validator.py |
Validation loop utilities. |
utils.py |
seed_everything, register_mod_op, get_exclusive_ancestors. |
esp_ppq_patch.py |
Runtime patches for ESP-PPQ: OnnxParser, Slice, Gather backends. |
esp_ppq_patch_2.py |
AddLUTPattern.export patch for correct LUT step propagation. |
Provides EspdlLUTFusionPass converts INT16 Swish ops into compact step-interpolated Look-Up Tables (controllable step size via INT16_LUT_STEP, default=32) and HardwareAwareEspdlExporter with LUT tables embedded in the .espdl output.