NVIDIA-accelerated, deep learned stereo disparity estimation
Learn how to use this package by watching our on-demand webinar: Using ML Models in ROS 2 to Robustly Estimate Distance to Obstacles
Deep Neural Network (DNN)–based stereo models have become essential for depth estimation because they overcome many of the fundamental limitations of classical and geometry-based stereo algorithms.
Traditional stereo matching relies on explicitly finding pixel correspondences between left and right images using handcrafted features. While effective in well-textured, ideal conditions, these approaches often fail in “ill-posed” regions such as areas with reflections, specular highlights, texture-less surfaces, repetitive patterns, occlusions, or even minor camera calibration errors. In such cases, classical algorithms may produce incomplete or inaccurate depth maps, or be forced to discard information entirely, especially when context-dependent filtering is not possible.
DNN-based stereo methods learn rich, hierarchical feature representations and context-aware matching costs directly from data. These models leverage semantic understanding and global scene context to infer depth, even in challenging environments where traditional correspondence measures break down. Through training, DNNs can implicitly account for real-world imperfections such as:
- calibration errors
 - exposure differences
 - hardware noise
 
Training increases DNN’s ability to recognize and handle difficult regions like reflections or transparent surfaces. This results in more robust, accurate, and dense depth predictions.
These advances are critical for robotics and autonomous systems, enabling applications where both speed and accuracy of depth perception are essential, such as:
- precise robotic arm manipulation
 - reliable obstacle avoidance and navigation
 - robust target tracking in dynamic or cluttered environments
 
DNN-based stereo methods consistently outperform classical techniques, making them the preferred choice for modern depth perception tasks.
The superiority of DNN-based stereo methods is clearly demonstrated in the figure above where we compare the output from a classical stereo algorithm, SGM, with DNN-based methods, ESS, and FoundationStereo.
SGM produces a very noisy and error-prone disparity map, while ESS and FoundationStereo produce much smoother and more accurate disparity maps. A closer look reveals that FoundationStereo produces the most accurate map because it is better at handling the plant in the distance and the railings on the left with smoother estimates. Overall, you can see that FoundationStereo is better than ESS, and better than SGM, in terms of accuracy and quality.
DNN‐based stereo systems begin by passing the left and right images through shared Convolutional backbones to extract multi‐scale feature maps that encode both texture and semantic information. These feature maps are then compared across potential disparities by constructing a learnable cost volume, which effectively represents the matching likelihood of each pixel at different disparities. Successive 3D Convolutional (or 2D convolution + aggregation) stages then regularize and refine this cost volume, integrating strong local cues—like edges and textures—and global scene context—such as object shapes and layout priors—to resolve ambiguities. Finally, a soft‐argmax or classification layer converts the refined cost volume into a dense disparity map, often followed by lightweight refinement modules that enforce sub-pixel accuracy and respect learned priors (for example, smoothness within objects, sharp transitions at boundaries), yielding a coherent estimate that gracefully handles challenging scenarios where classical algorithms falter.
This package is powered by NVIDIA Isaac Transport for ROS (NITROS), which leverages type adaptation and negotiation to optimize message formats and dramatically accelerate communication between participating nodes.
| Sample Graph | 
Input Size | 
AGX Thor | 
x86_64 w/ RTX 5090 | 
|---|---|---|---|
| DNN Stereo Disparity Node Full  | 
576p | 
178 fps 22 ms @ 30Hz  | 
350 fps 5.6 ms @ 30Hz  | 
| DNN Stereo Disparity Node Light  | 
288p | 
350 fps 9.4 ms @ 30Hz  | 
350 fps 5.0 ms @ 30Hz  | 
| DNN Stereo Disparity Graph Full  | 
576p | 
73.6 fps 29 ms @ 30Hz  | 
348 fps 8.5 ms @ 30Hz  | 
| DNN Stereo Disparity Graph Light  | 
288p | 
219 fps 17 ms @ 30Hz  | 
350 fps 7.3 ms @ 30Hz  | 
Please visit the Isaac ROS Documentation to learn how to use this repository.
Update 2025-10-24: Added FoundationStereo package
