Simple DETR (DEtection TRansformer)

A minimal and educational implementation of DETR, built using PyTorch. This project recreates the core ideas behind DETR for object detection in a simple and readable way, making it suitable for learning and experimentation.

🚀 Goals

Understand and reimplement the core ideas of DETR
Explore transformer-based object detection
Apply the model to Pascal VOC-style datasets
Share a clean and working implementation for others to learn from

👀 What this is NOT:

A full-fledged implementation with all bells and whistles
A high-performance model for production use
A replacement for the original DETR or other advanced object detectors

🧪 Overall Approach

Use a ResNet backbone to extract features
Project features & add 2D positional encoding
Use transformer encoder-decoder with learnable queries
Predict object class and bounding boxes
Use Hungarian matching for bipartite target assignment
Optimize classification and bounding box regression losses

📦 Requirements

Install dependencies with:

pip install -r requirements.txt

📁 Dataset

This repo uses the Pascal VOC 2007 dataset. You can download it from:

VOC2007 Download Page

or run download_voc2007.py to fetch the dataset automatically.

After downloading, convert the annotations to YOLO format (using convert_voc_to_yolo_format.py script) and place them like:

./VOC2007/
└── VOCdevkit/
    └── VOC2007/
        ├── JPEGImages/
        └── labels/

We will use the labels directory for training. VOC2007 has 20 object classes:

VOC_CLASSES = {
    0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle',
    5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow',
    10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person',
    15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'
}

🤸‍♂️ Training

To train the model, run the train.py script. You can adjust hyperparameters, model configuration, and training settings as needed.

python train.py

🔍 Inference

For inference, modify and run inference.py script as needed. You can adjust the model checkpoint, input image, and output directory.

📌 Notes and Observations

Model works well for simple scenes with few objects.
Performance drops for crowded scenes.
You can tweak transformer layers, query count, backbone, or loss weights for better results.

I would like feedback on how to improve the model for better performance and generalization specifically for crowded scenes. Or to help explain the behavior of the model in such cases. Feel free to open an issue or PR.

📚 Acknowledgements

Inspired by the original DETR paper and its official implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
ckpt		ckpt
demo		demo
README.md		README.md
convert_voc_to_yolo_format.py		convert_voc_to_yolo_format.py
datasets.py		datasets.py
detr.py		detr.py
download_voc2007.py		download_voc2007.py
inference.py		inference.py
loss.py		loss.py
requirments.txt		requirments.txt
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple DETR (DEtection TRansformer)

🚀 Goals

👀 What this is NOT:

🧪 Overall Approach

📦 Requirements

📁 Dataset

🤸‍♂️ Training

🔍 Inference

📌 Notes and Observations

📚 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

MjdMahasneh/Simple-DETR-From-Scratch

Folders and files

Latest commit

History

Repository files navigation

Simple DETR (DEtection TRansformer)

🚀 Goals

👀 What this is NOT:

🧪 Overall Approach

📦 Requirements

📁 Dataset

🤸‍♂️ Training

🔍 Inference

📌 Notes and Observations

📚 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages