This project implements a comprehensive computer vision perception system for detecting, tracking, and analyzing objects in images and videos. It's designed as a learning resource for understanding how modern perception systems work, combining multiple vision tasks into a cohesive pipeline.
- Architecture Overview
- Installation
- Quick Start
- Core Modules Explained
- Perception Pipeline
- Configuration System
- Visualization
- Examples and Demos
- Advanced Topics
- Common Issues and Solutions
- Design Principles and Concepts
The perception system follows a modular pipeline architecture:
- Input Sources: Camera feeds, video files, or images
- Object Detection: Identifies and localizes objects in each frame
- Object Tracking: Associates detections across frames to maintain object identity
- Depth Estimation: Infers distance information from monocular imagery
- Object Fusion: Combines outputs from different modules to create a unified representation
- Visualization: Renders results for human interpretation
The system is designed with these key principles:
- Modularity: Each component has a specific responsibility
- Extensibility: Easy to add or replace components
- Configuration-driven: Behavior can be modified without code changes
- Pythonic implementation: Clear, readable code with thorough documentation
- Python 3.8+
- CUDA-capable GPU (recommended for real-time performance)
- Clone the repository:
git clone https://github.com/akramsalim/perception-kit.git
cd perception-system- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtpython example_perception.py --config config/perception_config.yaml --image data/images/example.jpgpython example_perception.py --config config/perception_config.yaml --video data/videos/example.mp4 --output output.mp4python example_perception.py --config config/perception_config.yaml --camera 0jupyter notebook notebooks/demo_full_pipeline.ipynbPurpose: Identify and localize objects in images.
Key files:
perception/detection/detector.py: Abstract base class defining the detector interfaceperception/detection/yolo_detector.py: Implementation using YOLOv5
How it works:
The detection module uses a neural network (in this case YOLOv5) to process images and identify objects. Each detection includes:
- A bounding box (x1, y1, x2, y2)
- A class label (e.g., "person", "car")
- A confidence score
Design considerations:
-
Abstract base class pattern: The
Detectorabstract class defines a common interface that all detector implementations must follow, allowing different detection algorithms to be swapped seamlessly. -
Pretrained models: YOLOv5 is used because it offers an excellent balance of speed and accuracy. The system downloads pretrained weights automatically, making it easy to get started.
-
Configuration-driven: Detection parameters (confidence thresholds, model size, etc.) are configurable through YAML files.
-
GPU acceleration: The detector automatically uses GPU if available for faster inference.
Implementation details:
# YOLODetector loads a pretrained model from torch hub
self.model = torch.hub.load('ultralytics/yolov5', model_name, pretrained=True)
# Detection produces a list of dictionaries with standardized keys
detections = [
{
'box': [x1, y1, x2, y2],
'score': confidence,
'class_id': class_id,
'class_name': class_name,
'center': [(x1+x2)/2, (y1+y2)/2],
'dimensions': [x2-x1, y2-y1]
}
# ... more detections
]Customization options:
- Change model size ('n', 's', 'm', 'l', 'x') for different speed/accuracy tradeoffs
- Adjust confidence thresholds
- Filter specific object classes
- Use custom-trained YOLO models
Purpose: Associate detections across frames to maintain object identity and estimate motion.
Key files:
perception/tracking/tracker.py: Abstract base class defining the tracker interfaceperception/tracking/sort_tracker.py: Implementation using the SORT algorithm
How it works:
The tracking module takes detections from consecutive frames and:
- Associates new detections with existing tracks
- Updates track states using Kalman filtering
- Manages track creation, confirmation, and deletion
- Estimates object velocity
Design considerations:
-
Kalman filtering: The tracker uses Kalman filters to model object motion and handle occlusions/missed detections.
-
Hungarian algorithm: For optimal assignment between predictions and detections.
-
Track management: Rules for creating, confirming, and deleting tracks to handle noise and temporary occlusions.
-
State representation: Tracks maintain position, velocity, and age information.
Implementation details:
# Each track uses a Kalman filter to model motion
self.kf = KalmanFilter(dim_x=8, dim_z=4) # State: [x1,y1,x2,y2,vx,vy,vw,vh]
# The tracking process:
# 1. Predict new locations using Kalman filter
# 2. Associate detections to tracks using IOU and Hungarian algorithm
matched, unmatched_dets, unmatched_trks = self._associate_detections_to_trackers(...)
# 3. Update matched tracks with new detections
for m in matched:
self.trackers[m[1]].update(detection_boxes[m[0]])
# 4. Create new tracks for unmatched detections
for i in unmatched_dets:
trk = KalmanBoxTracker(detection_boxes[i])
self.trackers.append(trk)
# Track results include unique IDs and velocity estimatesAlgorithmic insights:
- SORT algorithm: Simple Online and Realtime Tracking balances efficiency and accuracy
- IOU-based matching: Uses spatial overlap to associate detections with tracks
- Constant velocity model: The Kalman filter assumes objects move with constant velocity
- Track confirmation logic: Requires multiple consecutive detections to confirm a track
Purpose: Infer depth/distance information from monocular images.
Key files:
perception/depth/depth_estimator.py: Abstract base class for depth estimatorsperception/depth/monodepth.py: Implementation using the MiDaS neural network
How it works:
The depth estimation module uses a neural network trained to predict relative depth from a single image. This is challenging without stereo vision, but modern networks like MiDaS can produce remarkably accurate depth maps by learning depth cues from large datasets.
Design considerations:
-
Model selection: MiDaS is chosen for its robust performance across diverse scenes.
-
Efficient inference: The implementation includes optimization options for real-time performance.
-
Depth normalization: Output is normalized to a consistent range for easier interpretation.
-
Integration with object data: Methods to calculate object distances from the depth map.
Implementation details:
# Load MiDaS model from torch hub
midas = torch.hub.load("intel-isl/MiDaS", model_type)
# The depth estimation process:
# 1. Apply input transforms
input_batch = self.transform(frame_rgb).to(self.device)
# 2. Run inference
with torch.no_grad():
prediction = self.model(input_batch)
# 3. Post-process and normalize
depth_map = (depth_map - depth_min) / depth_range # 0-1 range
# Helper methods calculate object distance using the depth map region
def get_object_distance(self, depth_map, box):
depth_region = depth_map[y1:y2, x1:x2]
median_depth = np.median(depth_region)
return float(median_depth)Limitations and considerations:
- Monocular depth is relative, not absolute (without calibration)
- Performance varies based on scene complexity and lighting
- The depth maps are most accurate for general scene structure rather than precise measurements
Purpose: Combine information from different perception modules into a unified representation.
Key files:
perception/fusion/object_fusion.py: Implementation of object-level fusion
How it works:
The fusion module takes outputs from detection, tracking, and depth estimation to create enriched object representations that include:
- Detection information (class, confidence)
- Tracking information (ID, velocity)
- Depth information (distance)
- Additional attributes (color, estimated physical size)
Design considerations:
-
Complementary information: Each module provides different aspects of scene understanding.
-
Attribute extraction: The fusion module adds derived attributes like dominant color.
-
3D position estimation: Converts 2D positions and depth to approximate 3D coordinates.
-
Unified representation: Creates a standardized format for downstream components.
Implementation details:
# The fusion process enriches objects with additional information
def fuse_objects(self, objects, depth_map, frame):
fused_objects = []
for obj in objects:
fused_obj = obj.copy()
# Add distance information from depth map
if depth_map is not None:
distance = self._calculate_object_distance(depth_map, obj['box'])
fused_obj['distance'] = distance
# Estimate 3D position
position_3d = self._estimate_3d_position(obj['center'], distance)
fused_obj['position_3d'] = position_3d
# Extract object attributes like color
if frame is not None:
color = self._extract_dominant_color(frame, obj['box'])
fused_obj['color'] = color
# Estimate physical size using distance and pixel dimensions
if 'distance' in fused_obj and 'dimensions' in obj:
physical_size = self._calculate_physical_size(obj['dimensions'], fused_obj['distance'])
fused_obj['physical_size'] = physical_size
fused_objects.append(fused_obj)
return fused_objectsKey fusion concepts:
- Object-centric fusion: Focused on creating rich object representations
- Attribute extraction: Derives additional information from raw sensor data
- Distance estimation: Uses depth map regions to estimate object distance
- 3D projection: Simple pinhole camera model for 3D positioning
Purpose: Coordinate the flow of data between different perception modules.
Key file: pipeline/perception_pipeline.py
How it works:
The pipeline is the central coordinator that:
- Receives input frames
- Passes data through each module in sequence
- Collects and structures the results
- Provides visualization and timing functions
Design considerations:
-
Sequential processing: Each module builds upon the results of previous modules.
-
Result container: The
PerceptionResultclass organizes outputs from all modules. -
Performance monitoring: Tracks processing times for each component.
-
Flexible configuration: Components can be enabled/disabled as needed.
Implementation details:
def process_frame(self, frame: np.ndarray, timestamp: float = None) -> PerceptionResult:
"""Process a single frame through the perception pipeline."""
# Create result container
result = PerceptionResult()
result.frame_id = self.frame_id
result.timestamp = timestamp or time.time()
# 1. Object Detection
detections = self.detector.detect(frame)
result.detections = detections
# 2. Object Tracking (if available)
if self.tracker:
tracks = self.tracker.track(detections, frame, result.timestamp)
result.tracks = tracks
# 3. Depth Estimation (if available)
if self.depth_estimator:
depth_map = self.depth_estimator.estimate_depth(frame)
result.depth_map = depth_map
# 4. Object Fusion (if available)
if self.fusion:
objects_to_fuse = result.tracks if result.tracks else result.detections
fused_objects = self.fusion.fuse_objects(objects_to_fuse, result.depth_map, frame)
result.fused_objects = fused_objects
# Increment frame counter
self.frame_id += 1
return resultPipeline concepts:
- Frame-by-frame processing: Each frame is processed independently
- Module dependencies: Later modules depend on outputs from earlier ones
- Optional components: The pipeline works even if some modules are disabled
- Result aggregation: All module outputs are collected in a single result object
Purpose: Allow customization of system behavior without code changes.
Key files:
config/perception_config.yaml: Main configuration fileconfig/models_config.yaml: Model-specific parameters
How it works:
The configuration system uses YAML files to define parameters for all modules. These values are loaded at runtime and passed to the appropriate components.
Design considerations:
-
Hierarchical structure: Organized by module for clarity.
-
Default values: Code provides sensible defaults for all parameters.
-
Configuration overrides: Loaded configs override defaults.
-
Documentation: Comments explain the purpose and options for each parameter.
Example configuration:
# Detection configuration
detection:
model: "yolo"
confidence_threshold: 0.25
nms_threshold: 0.45
filter_classes: [] # Empty list means detect all classes
# Tracking configuration
tracking:
enabled: true
algorithm: "sort"
max_age: 30
min_hits: 3
iou_threshold: 0.3Configuration patterns:
- Feature flags: Enable/disable components (e.g., tracking.enabled)
- Algorithm selection: Choose between implementation options
- Threshold tuning: Adjust sensitivity parameters
- Resource control: Parameters to manage computational resources
Purpose: Render perception results for human interpretation.
Key file: perception/utils/visualization.py
How it works:
The visualization module creates visual representations of perception results, including:
- Bounding boxes for detected objects
- Tracking IDs and history trails
- Color-coded depth maps
- Performance metrics display
Design considerations:
-
Customizable visualization: Options to show/hide different result types.
-
Intuitive color coding: Consistent color schemes for different object classes and tracks.
-
Information density: Balances showing enough information without cluttering the display.
-
Track history: Shows motion paths of tracked objects.
Implementation details:
def visualize_results(self, frame, result, show_detections=True,
show_tracks=True, show_depth=True):
# Create a copy of the frame
vis_frame = frame.copy()
# 1. Show depth map if available
if show_depth and result.depth_map is not None:
vis_frame = self.overlay_depth_map(vis_frame, result.depth_map)
# 2. Show object detections and tracks
for obj in result.fused_objects:
# Show detection box
if show_detections and 'box' in obj:
vis_frame = self.draw_box(vis_frame, obj)
# Show tracking information
if show_tracks and 'track_id' in obj:
vis_frame = self.draw_track(vis_frame, obj)
# 3. Add performance overlay
if hasattr(result, 'processing_time'):
fps = 1.0 / result.processing_time
cv2.putText(vis_frame, f"FPS: {fps:.1f}", (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
return vis_frameVisualization techniques:
- Color generation: Consistent colors for each tracking ID/class
- Track visualization: Drawing lines showing object motion history
- Depth colorization: Using color maps to visualize depth
- Information overlay: Adding text with object properties and system metrics
Purpose: Demonstrate system capabilities and provide usage examples.
Key files:
example_perception.py: Command-line applicationnotebooks/demo_full_pipeline.ipynb: Interactive Jupyter notebook
How they work:
The examples show how to:
- Configure the perception system
- Process images, videos, or camera feeds
- Visualize and interpret results
- Customize components for different scenarios
Design considerations:
-
Progressive introduction: The notebook builds understanding incrementally.
-
Interactive exploration: Allows experimenting with different parameters.
-
Real-world usage: The command-line tool shows how to use the system in applications.
-
Configurability: Examples demonstrate configuration options.
Example usage from code:
# Create components
detector = YOLODetector(config=det_config)
tracker = SORTTracker(config=track_config)
depth_estimator = MonocularDepthEstimator(config=depth_config)
fusion = ObjectFusion(config=fusion_config)
# Create pipeline
pipeline = PerceptionPipeline(
detector=detector,
tracker=tracker,
depth_estimator=depth_estimator,
fusion=fusion
)
# Process a frame
result = pipeline.process_frame(frame)
# Visualize results
vis_frame = pipeline.visualize(frame, result)The system includes several optimization strategies:
-
Model selection: Different model sizes balance speed and accuracy.
-
GPU acceleration: All neural networks use GPU for faster inference when available.
-
Half-precision: Option to use FP16 for faster GPU computation.
-
Frame skipping: Option to process every Nth frame to increase throughput.
-
Component selection: Disable unnecessary modules for faster processing.
Example optimization code:
# Use half-precision for GPU acceleration
if self.config['optimize'] and self.config['device'] == 'cuda':
midas = midas.to(memory_format=torch.channels_last)
midas = midas.half() # Use half precision
# Process frames at reduced rate
if frame_idx % self.config['skip_frames'] != 0:
continue # Skip this frameThe system is designed for extension. Here are some ways to extend it:
-
New detector implementations: Add support for different detection models.
-
Advanced tracking algorithms: Implement DeepSORT or other tracking methods.
-
Sensor fusion: Add support for multiple cameras or other sensor types.
-
Additional perception tasks: Integrate segmentation, pose estimation, etc.
-
Custom visualization: Create domain-specific visualizations.
Extension pattern:
# Adding a new detector implementation
class CustomDetector(Detector):
"""Custom detector implementation."""
def initialize(self) -> None:
# Initialize your custom detector
pass
def detect(self, frame: np.ndarray) -> List[Dict]:
# Implement detection logic
return detections-
GPU memory errors:
- Try a smaller model size
- Reduce batch size or input resolution
- Enable optimization flags
-
Slow performance:
- Check GPU utilization
- Consider frame skipping
- Disable unnecessary modules
-
Poor detection quality:
- Adjust confidence thresholds
- Try different model sizes
- Consider domain-specific fine-tuning
-
Tracking issues:
- Adjust IOU threshold for better association
- Modify max_age and min_hits for track management
- Ensure detection quality is sufficient
-
Depth estimation problems:
- Try different MiDaS model variants
- Be aware of limitations in certain scenes
- Consider using depth just for relative measurements
The system demonstrates several key perception engineering principles:
-
Modularity and abstraction: Clean interfaces between components.
-
Pipeline architecture: Sequential processing with well-defined data flow.
-
Uncertainty handling: Confidence scores, probabilistic tracking, etc.
-
Temporal integration: Using information across frames for better results.
-
Multi-task perception: Combining detection, tracking, and depth for richer understanding.
Core computer vision concepts demonstrated:
-
Object detection: Finding and classifying objects in images.
-
Multi-object tracking: Maintaining identity across frames.
-
Monocular depth estimation: Inferring depth from a single camera.
-
Visual odometry: Estimating motion from visual cues.
-
Feature extraction: Deriving additional attributes from visual data.
Software patterns used:
-
Abstract base classes: Defining interfaces for implementations.
-
Factory pattern: Creating appropriate components based on configuration.
-
Strategy pattern: Swappable algorithms with the same interface.
-
Observer pattern: Visualization observes pipeline results.
-
Dependency injection: Components receive their dependencies from outside.
This README serves as both documentation and a learning resource to understand perception system design and implementation.