Skip to content

Conversation

@shivin101
Copy link
Collaborator

No description provided.

Shivin Yadav and others added 30 commits October 24, 2025 10:07
- Add comprehensive profiling to SLAM system with record_function markers
- Add profiling to pipeline execution with component-wise tracking
- Add environment variable VIPE_PROFILE=1 to enable/disable profiling
- Add profiler statistics formatter with structured console output
- Track key components: SLAM loop, bundle adjustment, depth estimation
- Default to disabled to avoid performance overhead in production
- CPU doesn't support float16 operations in PyTorch
- Load models with float32 on CPU for Modal snapshot
- Convert to float16 on GPU for efficiency
- Add device parameter to MogeModel.__init__ (defaults to 'cuda')
- Update modal_helpers to load MogeModel on CPU during snapshot phase
- Ensure input tensors are moved to correct device in estimate method
- Add prebuild_slam_components_cpu() to build SLAM on CPU during snapshot
- Add move_slam_to_gpu() to transfer SLAM components to GPU after restore
- Update DefaultAnnotationPipeline to accept prebuilt_slam_system parameter
- Use pre-built SLAM system when available, avoiding rebuild on every request
- Add camera_type parameter to prebuild_slam_components_cpu()
- Fix missing camera_type error during SLAM component initialization
- Pass camera_type from pipeline.init config to SLAM build
- Change default width from 854 to 848 (848 % 8 == 0)
- GraphBuffer requires dimensions divisible by 8
- Fixes AssertionError during SLAM component prebuild
- Add SE3.Identity(1) rig initialization
- Required by _build_components_cpu() which accesses self.rig
- Rig will be overridden during actual run() with real values
- Fixes AttributeError: 'SLAMSystem' object has no attribute 'rig'
- Set has_init_pose=False as default for prebuild
- Required by SLAMFrontend.__init__() which accesses args.has_init_pose
- Will be overridden during actual run() based on video attributes
- Fixes ConfigAttributeError: Missing key has_init_pose
Problem: SLAM GraphBuffer is video-specific (dimensions), so reusing
entire pipeline/SLAM system across videos causes dimension mismatches.

Solution: Only preload heavy models (DroidNet) in snapshot, create
fresh SLAM system for each video with correct dimensions.

Changes:
- Replace prebuild_slam_components_cpu() with preload_slam_models_cpu()
- Only load DroidNet model, not GraphBuffer or other video-specific components
- SLAMSystem now accepts preloaded_models parameter
- Pipeline creates fresh SLAM system per video using preloaded models
- Remove pipeline reuse optimization (video-specific state)

This fixes: RuntimeError: expanded size mismatch (848 vs 592)
Implement CPU pre-loading and GPU transfer for UniDepthV2 and TrackAnything
models to optimize Modal cold start times.

Changes:
- UniDepthV2: Add device parameter to support CPU/GPU placement
- TrackAnything: Add preloaded model support for SAM and AOT
- Pipeline: Pass pre-loaded models to processors
- Modal helpers: Add load/transfer functions for UniDepthV2 and TrackAnything

Benefits:
- UniDepthV2 loaded once and shared between SLAM and AdaptiveDepthProcessor
- ~15s model pre-loading in Modal snapshot
- ~2s GPU transfer after snapshot restore
- Eliminates duplicate model loading

Tested:
- CPU loading: ✓ UniDepthV2 (11.6s), TrackAnything (5.5s)
- GPU transfer: ✓ All models (~1-1.2s each)
- End-to-end pipeline integration: ✓ Working correctly
…atibility

- DroidNet.load_weights() now detects the device from model parameters
- Uses map_location in torch.load() to properly deserialize on CPU or GPU
- Fixes Modal snapshot creation error where CUDA checkpoint couldn't load on CPU

This enables proper CPU-only loading during Modal's snapshot phase.
Critical optimization fix:
- SLAM system now uses preloaded UniDepthV2 for keyframe depth instead of creating a new instance
- Pipeline passes UniDepthV2 in preloaded_models dict to SLAM system
- Eliminates duplicate UniDepthV2 loading (was loaded twice before)

This significantly reduces cold start time and memory usage by truly sharing
the UniDepthV2 model between SLAM keyframe depth estimation and post-processing
depth alignment.
- Removed single binary format
- Added save_manifest() to create manifest.json with metadata
- Manifest includes file paths, formats, dtypes, shapes
- Updated save_pose_artifacts() to include poses_inv
- Separate files: rgb (mp4), depth (zipped fp16), poses, intrinsics
- Metadata: depth_range, fps, fov, resolution, aspect_ratio
- Changed pose files: pose.npz -> poses.bin + poses_inv.bin
- Changed intrinsics: intrinsics.npz -> intrinsics.bin
- Added {}_meta.json files with shape, dtype, frame_indices
- All binary files use float32, little-endian (standard JS)
- Updated manifest.json to reference .bin files and metadata
- Updated file structure (npz -> bin files)
- Added JavaScript usage examples with fetch/TypedArray
- Updated Python examples for binary format
- Added byte_order and meta_file references
- Emphasized JS compatibility benefits
- Add quantize_depth, max_depth, depth_image_format config options
- Quantize depth to uint16 and save as PNG/WebP images for better compression
- Backward compatible with fp16 binary format
- Update save/read_depth_artifacts to handle both formats
- Add pack_depth_24bit() and unpack_depth_24bit() helper functions
- Auto-compute min_depth from actual depth data for better precision
- Replace uint16 (65K levels) with 24-bit RGB (16.7M levels) encoding
- Update save_depth_artifacts() to use 24-bit RGB encoding
- Update read_depth_artifacts() with backward compatibility for uint16
- Update save_manifest() to reflect quantized_rgb24 format
- Update config and docstrings to mention 24-bit RGB encoding
- Add save_rgb_frames_as_webp() to save RGB frames as individual WebP images
- Add save_depth_frames_individual() to save depth frames without zipping
- Add save_individual_files and rgb_webp_quality to config
- Simplify pack/unpack depth functions and remove verbose logging
- pack_depth_24bit returns RGB but OpenCV expects BGR
- Add cv2.cvtColor to convert RGB->BGR before encoding
- Add cv2.cvtColor to convert BGR->RGB after decoding
- Fixes massive quantization errors (40m) caused by swapped channels
- Remove 24-bit RGB quantization (pack_depth_24bit, unpack_depth_24bit)
- Remove save_depth_frames_individual function
- Replace save_depth_artifacts with single-file zlib implementation
- Replace read_depth_artifacts with simplified zlib reader
- Update depth file extension from .zip to .bin.zlib
- Simplify save_artifacts and save_manifest signatures
- Remove quantize_depth, max_depth, depth_image_format parameters
- Add zlib_level parameter (default: 6)

Format: fp16 + zlib in single binary file
Structure: [4-byte metadata length][JSON metadata][zlib-compressed data]
Benefits: 4-5x faster encoding, 2x faster decoding, simpler code
File size: ~116MB for 81 frames @ 704x1280 (1.2x compression)
Precision: fp16 (~5mm mean error, ~15mm max error)
- Replace save_rgb_artifacts (H.264 video) with save_rgb_frames_as_webp
- Save RGB as individual WebP frames in rgb/<artifact_name>/ directory
- Add rgb_quality parameter to save_artifacts (default: 95)
- Update manifest to reflect webp_frames format with directory structure
- Fix depth file extension bug: remove with_suffix() call that created .bin.bin.zlib
- Depth now correctly saved as .bin.zlib (path already has correct extension)
afridi-morphic and others added 19 commits November 5, 2025 14:59
OpenCV's imencode expects BGR format, so we need to convert
from RGB to BGR before encoding to WebP/PNG
…aving

- Changed from online=True to online=False caching to process all frames upfront
- Added clear_processor_models() to ProcessedVideoStream to free depth models
- Delete slam_pipeline and clear processor models before saving artifacts
- Prevents OOM by ensuring GPU models are unloaded before artifact I/O

This fixes the CUDA OOM error that occurred during artifact saving when
the stream tried to lazily load more frames while models were still in memory.
- Changed back to online=True caching (lazy/streaming mode)
- Clear SLAM pipeline immediately after getting slam_output
- Eager caching all 618 frames at once causes OOM
- Online caching processes frames lazily, more memory efficient

Note: ViPE processes ALL frames in the video, not limited to 200
- Trigger eager caching with explicit iteration to populate cache
- Clear all processor models (GeometryCrafter, UniDepthV2, PriorDA) after caching
- Save artifacts from CPU cache without GPU models loaded
- Reduces GPU memory pressure during artifact saving by ~15-20GB
- Add zlib_level parameter to save_manifest() to track compression state
- Update manifest to report 'fp16_raw_per_frame' when zlib_level=0
- Update manifest to report 'fp16_zlib_per_frame' with level when compressed
- Ensure depth saving correctly skips zlib.compress() when zlib_level=0
- Support for raw fp16 binary depth files (no compression) for faster saving and simpler frontend decoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants