Feature/lazy gpu loading #3

shivin101 · 2025-10-30T18:43:24Z

No description provided.

- Add comprehensive profiling to SLAM system with record_function markers - Add profiling to pipeline execution with component-wise tracking - Add environment variable VIPE_PROFILE=1 to enable/disable profiling - Add profiler statistics formatter with structured console output - Track key components: SLAM loop, bundle adjustment, depth estimation - Default to disabled to avoid performance overhead in production

- CPU doesn't support float16 operations in PyTorch - Load models with float32 on CPU for Modal snapshot - Convert to float16 on GPU for efficiency

- Add device parameter to MogeModel.__init__ (defaults to 'cuda') - Update modal_helpers to load MogeModel on CPU during snapshot phase - Ensure input tensors are moved to correct device in estimate method

- Add prebuild_slam_components_cpu() to build SLAM on CPU during snapshot - Add move_slam_to_gpu() to transfer SLAM components to GPU after restore - Update DefaultAnnotationPipeline to accept prebuilt_slam_system parameter - Use pre-built SLAM system when available, avoiding rebuild on every request

- Add camera_type parameter to prebuild_slam_components_cpu() - Fix missing camera_type error during SLAM component initialization - Pass camera_type from pipeline.init config to SLAM build

- Change default width from 854 to 848 (848 % 8 == 0) - GraphBuffer requires dimensions divisible by 8 - Fixes AssertionError during SLAM component prebuild

- Add SE3.Identity(1) rig initialization - Required by _build_components_cpu() which accesses self.rig - Rig will be overridden during actual run() with real values - Fixes AttributeError: 'SLAMSystem' object has no attribute 'rig'

- Set has_init_pose=False as default for prebuild - Required by SLAMFrontend.__init__() which accesses args.has_init_pose - Will be overridden during actual run() based on video attributes - Fixes ConfigAttributeError: Missing key has_init_pose

Problem: SLAM GraphBuffer is video-specific (dimensions), so reusing entire pipeline/SLAM system across videos causes dimension mismatches. Solution: Only preload heavy models (DroidNet) in snapshot, create fresh SLAM system for each video with correct dimensions. Changes: - Replace prebuild_slam_components_cpu() with preload_slam_models_cpu() - Only load DroidNet model, not GraphBuffer or other video-specific components - SLAMSystem now accepts preloaded_models parameter - Pipeline creates fresh SLAM system per video using preloaded models - Remove pipeline reuse optimization (video-specific state) This fixes: RuntimeError: expanded size mismatch (848 vs 592)

Implement CPU pre-loading and GPU transfer for UniDepthV2 and TrackAnything models to optimize Modal cold start times. Changes: - UniDepthV2: Add device parameter to support CPU/GPU placement - TrackAnything: Add preloaded model support for SAM and AOT - Pipeline: Pass pre-loaded models to processors - Modal helpers: Add load/transfer functions for UniDepthV2 and TrackAnything Benefits: - UniDepthV2 loaded once and shared between SLAM and AdaptiveDepthProcessor - ~15s model pre-loading in Modal snapshot - ~2s GPU transfer after snapshot restore - Eliminates duplicate model loading Tested: - CPU loading: ✓ UniDepthV2 (11.6s), TrackAnything (5.5s) - GPU transfer: ✓ All models (~1-1.2s each) - End-to-end pipeline integration: ✓ Working correctly

…atibility - DroidNet.load_weights() now detects the device from model parameters - Uses map_location in torch.load() to properly deserialize on CPU or GPU - Fixes Modal snapshot creation error where CUDA checkpoint couldn't load on CPU This enables proper CPU-only loading during Modal's snapshot phase.

Critical optimization fix: - SLAM system now uses preloaded UniDepthV2 for keyframe depth instead of creating a new instance - Pipeline passes UniDepthV2 in preloaded_models dict to SLAM system - Eliminates duplicate UniDepthV2 loading (was loaded twice before) This significantly reduces cold start time and memory usage by truly sharing the UniDepthV2 model between SLAM keyframe depth estimation and post-processing depth alignment.

… enable visualization

- Removed single binary format - Added save_manifest() to create manifest.json with metadata - Manifest includes file paths, formats, dtypes, shapes - Updated save_pose_artifacts() to include poses_inv - Separate files: rgb (mp4), depth (zipped fp16), poses, intrinsics - Metadata: depth_range, fps, fov, resolution, aspect_ratio

- Changed pose files: pose.npz -> poses.bin + poses_inv.bin - Changed intrinsics: intrinsics.npz -> intrinsics.bin - Added {}_meta.json files with shape, dtype, frame_indices - All binary files use float32, little-endian (standard JS) - Updated manifest.json to reference .bin files and metadata

- Updated file structure (npz -> bin files) - Added JavaScript usage examples with fetch/TypedArray - Updated Python examples for binary format - Added byte_order and meta_file references - Emphasized JS compatibility benefits

- Add quantize_depth, max_depth, depth_image_format config options - Quantize depth to uint16 and save as PNG/WebP images for better compression - Backward compatible with fp16 binary format - Update save/read_depth_artifacts to handle both formats

- Add pack_depth_24bit() and unpack_depth_24bit() helper functions - Auto-compute min_depth from actual depth data for better precision - Replace uint16 (65K levels) with 24-bit RGB (16.7M levels) encoding - Update save_depth_artifacts() to use 24-bit RGB encoding - Update read_depth_artifacts() with backward compatibility for uint16 - Update save_manifest() to reflect quantized_rgb24 format - Update config and docstrings to mention 24-bit RGB encoding

- Add save_rgb_frames_as_webp() to save RGB frames as individual WebP images - Add save_depth_frames_individual() to save depth frames without zipping - Add save_individual_files and rgb_webp_quality to config - Simplify pack/unpack depth functions and remove verbose logging

- pack_depth_24bit returns RGB but OpenCV expects BGR - Add cv2.cvtColor to convert RGB->BGR before encoding - Add cv2.cvtColor to convert BGR->RGB after decoding - Fixes massive quantization errors (40m) caused by swapped channels

- Remove 24-bit RGB quantization (pack_depth_24bit, unpack_depth_24bit) - Remove save_depth_frames_individual function - Replace save_depth_artifacts with single-file zlib implementation - Replace read_depth_artifacts with simplified zlib reader - Update depth file extension from .zip to .bin.zlib - Simplify save_artifacts and save_manifest signatures - Remove quantize_depth, max_depth, depth_image_format parameters - Add zlib_level parameter (default: 6) Format: fp16 + zlib in single binary file Structure: [4-byte metadata length][JSON metadata][zlib-compressed data] Benefits: 4-5x faster encoding, 2x faster decoding, simpler code File size: ~116MB for 81 frames @ 704x1280 (1.2x compression) Precision: fp16 (~5mm mean error, ~15mm max error)

- Replace save_rgb_artifacts (H.264 video) with save_rgb_frames_as_webp - Save RGB as individual WebP frames in rgb/<artifact_name>/ directory - Add rgb_quality parameter to save_artifacts (default: 95) - Update manifest to reflect webp_frames format with directory structure - Fix depth file extension bug: remove with_suffix() call that created .bin.bin.zlib - Depth now correctly saved as .bin.zlib (path already has correct extension)

OpenCV's imencode expects BGR format, so we need to convert from RGB to BGR before encoding to WebP/PNG

…aving - Changed from online=True to online=False caching to process all frames upfront - Added clear_processor_models() to ProcessedVideoStream to free depth models - Delete slam_pipeline and clear processor models before saving artifacts - Prevents OOM by ensuring GPU models are unloaded before artifact I/O This fixes the CUDA OOM error that occurred during artifact saving when the stream tried to lazily load more frames while models were still in memory.

- Changed back to online=True caching (lazy/streaming mode) - Clear SLAM pipeline immediately after getting slam_output - Eager caching all 618 frames at once causes OOM - Online caching processes frames lazily, more memory efficient Note: ViPE processes ALL frames in the video, not limited to 200

Flashpack + Multigpu support

- Trigger eager caching with explicit iteration to populate cache - Clear all processor models (GeometryCrafter, UniDepthV2, PriorDA) after caching - Save artifacts from CPU cache without GPU models loaded - Reduces GPU memory pressure during artifact saving by ~15-20GB

…el 3)

- Add zlib_level parameter to save_manifest() to track compression state - Update manifest to report 'fp16_raw_per_frame' when zlib_level=0 - Update manifest to report 'fp16_zlib_per_frame' with level when compressed - Ensure depth saving correctly skips zlib.compress() when zlib_level=0 - Support for raw fp16 binary depth files (no compression) for faster saving and simpler frontend decoding

Shivin Yadav and others added 30 commits October 24, 2025 10:07

Fix PriorDAModel initialization - remove unsupported device parameter

39abf8e

Clean up trailing whitespace in modal_helpers.py

4b164e2

Fix GeometryCrafter CPU loading: use float32 instead of float16

583f3be

- CPU doesn't support float16 operations in PyTorch - Load models with float32 on CPU for Modal snapshot - Convert to float16 on GPU for efficiency

Fix MogeModel to support CPU device for Modal snapshot optimization

10fe087

- Add device parameter to MogeModel.__init__ (defaults to 'cuda') - Update modal_helpers to load MogeModel on CPU during snapshot phase - Ensure input tensors are moved to correct device in estimate method

Fix SLAM prebuild: add camera_type parameter

4fe758c

- Add camera_type parameter to prebuild_slam_components_cpu() - Fix missing camera_type error during SLAM component initialization - Pass camera_type from pipeline.init config to SLAM build

Fix SLAM prebuild dimensions: ensure divisible by 8

df6f535

- Change default width from 854 to 848 (848 % 8 == 0) - GraphBuffer requires dimensions divisible by 8 - Fixes AssertionError during SLAM component prebuild

fix: update test_binary_format.py to handle per-array compression and…

30c97c4

… enable visualization

feat: add compression summary and improved validation to test script

69fa7f7

docs: add comprehensive binary format specification

5585008

docs: add validation report for binary format

d0c50b7

docs: add manifest format specification

0443970

fix: correct RGB shape extraction in manifest (HWC format, not CHW)

13ed674

Fix BGR/RGB channel order in depth quantization

c2a6385

- pack_depth_24bit returns RGB but OpenCV expects BGR - Add cv2.cvtColor to convert RGB->BGR before encoding - Add cv2.cvtColor to convert BGR->RGB after decoding - Fixes massive quantization errors (40m) caused by swapped channels

test multiprocessing

4637b0b

afridi-morphic and others added 19 commits November 5, 2025 14:59

some perf_counters

85b15bd

true dino flashpack

d786dc1

fix: convert RGB to BGR before saving WebP frames

0f84d1a

OpenCV's imencode expects BGR format, so we need to convert from RGB to BGR before encoding to WebP/PNG

Merge pull request #4 from morphicfilms/ali/multigpu_final

7d9f80a

Flashpack + Multigpu support

fix: align unidepth input dtype

cf5d6b4

fix: cast unidepth inputs and camera dtype

ebbe63a

fix for io

9d7afe9

fix for io

2f5dbf3

fix for io

3e5265c

fix: optimize manifest save using tensor ops (87% faster, saves 5.5s)

7405bd9

save manifest

ccc315d

Remove depth manifest - only keep root manifest

8bba855

depth resize fix

77622e0

Fix undefined W and H variables in save_manifest

7594b5b

Revert to using full zlib compression level (no auto-downgrade to lev…

ef18d87

…el 3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/lazy gpu loading #3

Feature/lazy gpu loading #3

Uh oh!

shivin101 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/lazy gpu loading #3

Are you sure you want to change the base?

Feature/lazy gpu loading #3

Uh oh!

Conversation

shivin101 commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants