A minimal and educational implementation of DETR, built using PyTorch. This project recreates the core ideas behind DETR for object detection in a simple and readable way, making it suitable for learning and experimentation.
- Understand and reimplement the core ideas of DETR
- Explore transformer-based object detection
- Apply the model to Pascal VOC-style datasets
- Share a clean and working implementation for others to learn from
- A full-fledged implementation with all bells and whistles
- A high-performance model for production use
- A replacement for the original DETR or other advanced object detectors
- Use a ResNet backbone to extract features
- Project features & add 2D positional encoding
- Use transformer encoder-decoder with learnable queries
- Predict object class and bounding boxes
- Use Hungarian matching for bipartite target assignment
- Optimize classification and bounding box regression losses
Install dependencies with:
pip install -r requirements.txt
This repo uses the Pascal VOC 2007 dataset. You can download it from:
or run download_voc2007.py
to fetch the dataset automatically.
After downloading, convert the annotations to YOLO format (using convert_voc_to_yolo_format.py
script) and place them like:
./VOC2007/
└── VOCdevkit/
└── VOC2007/
├── JPEGImages/
└── labels/
We will use the labels
directory for training. VOC2007 has 20 object classes:
VOC_CLASSES = {
0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle',
5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow',
10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person',
15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'
}
To train the model, run the train.py
script. You can adjust hyperparameters, model configuration, and training settings as needed.
python train.py
For inference, modify and run inference.py
script as needed. You can adjust the model checkpoint, input image, and output directory.
- Model works well for simple scenes with few objects.
- Performance drops for crowded scenes.
- You can tweak transformer layers, query count, backbone, or loss weights for better results.
I would like feedback on how to improve the model for better performance and generalization specifically for crowded scenes. Or to help explain the behavior of the model in such cases. Feel free to open an issue or PR.
Inspired by the original DETR paper and its official implementation.