This project implements DETR (Detection Transformer) for object detection on the VisDrone2019 dataset. DETR is an end-to-end object detection model that uses Transformers to directly predict object bounding boxes and classes.
Sample detection result showing cars, people, and other objects with confidence scores
VisDrone2019 is a large-scale benchmark for drone-based computer vision tasks, containing images captured by various drone platforms. The dataset includes 11 object categories commonly found in aerial imagery:
ignored-regions- Areas to be ignored during evaluationpedestrian- People walkingpeople- Stationary peoplebicycle- Bicyclescar- Carsvan- Vanstruck- Truckstricycle- Tricyclesawning-tricycle- Covered tricyclesbus- Busesmotor- Motorcycles
| Split | Images | Annotations | Size |
|---|---|---|---|
| Train | 6,471 | 390,651 | 64MB |
| Val | 548 | 33,910 | 7.2MB |
| Test | 1,610 | 75,102 | 14MB |
detr_for_VisDrone/
βββ VisDrone/ # VisDrone dataset & annotations
β βββ VisDrone_COCO/ # COCO format dataset
β βββ VisDrone2019-DET-*/ # Original dataset splits
β βββ *.json # COCO format annotations
βββ outputs/ # Training outputs & checkpoints
βββ visualization_results/ # Visualization outputs
βββ models/ # DETR model implementations
βββ datasets/ # Dataset loading & evaluation
βββ util/ # Utility functions
βββ main.py # Main training script
βββ simple_visualize.py # Simple visualization
βββ visualize_results.py # Advanced visualization
βββ test_*.py # Testing scripts
βββ requirements.txt # Dependencies
βββ README.md # This file
# Install PyTorch
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Install dependencies
pip install -r requirements.txt
pip install 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'Train DETR on VisDrone dataset:
# Single GPU training
python main.py --coco_path ./VisDrone/VisDrone_COCO --output_dir ./outputs/visdrone_detr_300ep --epochs 300
# Multi-GPU training (8 GPUs)
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
--coco_path ./VisDrone/VisDrone_COCO \
--output_dir ./outputs/visdrone_detr_300ep \
--epochs 300Evaluate trained model on validation set:
python main.py --batch_size 2 --no_aux_loss --eval \
--resume ./outputs/visdrone_detr_300ep/checkpoint.pth \
--coco_path ./VisDrone/VisDrone_COCOGenerate detection visualizations:
# Simple visualization
python simple_visualize.py
# Advanced visualization with more options
python visualize_results.py
# Test on test set
python test_on_test_set.py
# Comprehensive testing
python test_all.pyOur DETR model was trained for 100+ epochs on VisDrone with the following performance:
- Learning Rate: 1e-4 (transformer), 1e-5 (backbone)
- Batch Size: 2 per GPU
- Optimizer: AdamW
- Training Time: ~100 epochs completed
- Training Loss: 39.22 β 18.00 (54% reduction)
- Validation Loss: 44.38 β 19.05 (57% reduction)
- Classification Error: 52% β 30% (training), 60% β 35% (validation)
| Metric | Epoch 0 | Epoch 99 | Improvement |
|---|---|---|---|
| Train Loss | 39.22 | 18.00 | -54% |
| Val Loss | 44.38 | 19.05 | -57% |
| Train Class Error | 52.0% | 30.0% | -42% |
| Val Class Error | 60.5% | 35.0% | -42% |
- Backbone: ResNet-50
- Transformer: 6 encoder + 6 decoder layers
- Object Queries: 100
- Hidden Dimension: 256
- Feed-forward Dimension: 2048
- Attention Heads: 8
import torch
from models import build_model
# Create model
args = create_args() # Define your arguments
model, criterion, postprocessors = build_model(args)
# Load checkpoint
checkpoint = torch.load('outputs/visdrone_detr_300ep/checkpoint.pth')
model.load_state_dict(checkpoint['model'])
model.eval()from PIL import Image
import torchvision.transforms as T
# Load and preprocess image
image = Image.open('your_image.jpg')
transform = T.Compose([
T.Resize(800),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
# Run inference
with torch.no_grad():
outputs = model(transform(image).unsqueeze(0))Key training parameters can be modified in main.py:
# Learning rates
--lr 1e-4 # Transformer learning rate
--lr_backbone 1e-5 # Backbone learning rate
# Training schedule
--epochs 300 # Total epochs
--lr_drop 200 # LR drop epoch
# Loss weights
--bbox_loss_coef 5 # Bounding box loss weight
--giou_loss_coef 2 # GIoU loss weight
--eos_coef 0.1 # End-of-sequence loss weightThe VisDrone dataset has been converted to COCO format for compatibility with DETR:
- Original Format: VisDrone annotation format (txt files)
- Converted Format: COCO JSON format
- Directory Structure: Standard COCO layout (train2017/, val2017/, annotations/)
See VisDrone/VisDrone_COCO_README.md for detailed dataset information.
Issue: np.float deprecated in newer NumPy versions
Solution: Added compatibility code in datasets/coco_eval.py
Issue: High GPU memory usage with large batch sizes Solution: Use smaller batch sizes (batch_size=2) or gradient accumulation
Issue: Long training times may require interruption Solution: Model automatically saves checkpoints for resuming training
To continue training from a checkpoint:
python main.py --resume ./outputs/visdrone_detr_300ep/checkpoint.pth \
--coco_path ./VisDrone/VisDrone_COCO \
--output_dir ./outputs/visdrone_detr_300ep \
--epochs 300This project is based on the original DETR implementation by Facebook Research. See LICENSE file for details.
- DETR: End-to-End Object Detection with Transformers
- VisDrone Dataset
- PyTorch and torchvision teams
For questions or issues, please open a GitHub issue or contact the maintainer.
Happy Detecting! πβ¨
