Skip to content

[BugFix] fix image_ids bug in distributed + packed mode#800

Open
1145284121 wants to merge 1 commit intonerfstudio-project:mainfrom
1145284121:bugfix
Open

[BugFix] fix image_ids bug in distributed + packed mode#800
1145284121 wants to merge 1 commit intonerfstudio-project:mainfrom
1145284121:bugfix

Conversation

@1145284121
Copy link

@1145284121 1145284121 commented Sep 20, 2025

What does this PR do?

This PR fixes the bug where image_ids was not updated after sparse_all_to_all operation, causing shape mismatch errors during isect_tiles() in packed mode with multi-GPU training.

Problem Description

Test Command

CUDA_VISIBLE_DEVICES=1,2,3,5 python examples/simple_trainer.py default \
    --data_dir data/360_v2/garden/ --data_factor 4 \
    --result_dir ./results/garden \
    --packed

When running distributed training with packed mode on multiple GPUs, the program crashes during the isect_tiles stage with the following error:

RuntimeError: CUDA error: operation not supported on global/shared address space
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Root Cause

The issue occurs because image_ids is not updated after sparse_all_to_all operation, causing a shape mismatch with other tensors like means2d.

Specifically, codes:

  torch.distributed.barrier()
  tiles_per_gauss, isect_ids, flatten_ids = isect_tiles(
      means2d,   # means2d.shape: torch.Size([67205, 2])
      radii,
      depths,
      tile_size,
      tile_width,
      tile_height,
      segmented=segmented,
      packed=packed,
      n_images=I,
      image_ids=image_ids,  # image_ids.shape : torch.Size([66317])
      gaussian_ids=gaussian_ids, # gaussian_ids.shape : torch.Size([67205])
  )

Note

When batch in single rank is not 0, the update of image_ids must also consider batch_ids to ensure correct mapping between images and their corresponding batch indices.

@liruilong940607
Copy link
Collaborator

Thanks for identifying the bug. however I think the proper fix should be replacing all camera_ids to image_ids in this chunk of the code?
https://github.com/1145284121/gsplat/blob/f9e61da389a7f5f98d8ccbe33d4bab302bd1aca7/gsplat/rendering.py#L532-L582

Explanation: image_ids is basically just the extension of camera_ids to the case of when there are multiple scenes (say B) to be rendered in the same pass, each with V cameras. Then camera_ids have range [0, V). image_id have the range of [0, B*V)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants