[BugFix] fix image_ids bug in distributed + packed mode by 1145284121 · Pull Request #800 · nerfstudio-project/gsplat

1145284121 · 2025-09-20T08:53:57Z

What does this PR do?

This PR fixes the bug where image_ids was not updated after sparse_all_to_all operation, causing shape mismatch errors during isect_tiles() in packed mode with multi-GPU training.

Problem Description

Test Command

CUDA_VISIBLE_DEVICES=1,2,3,5 python examples/simple_trainer.py default \
    --data_dir data/360_v2/garden/ --data_factor 4 \
    --result_dir ./results/garden \
    --packed

When running distributed training with packed mode on multiple GPUs, the program crashes during the isect_tiles stage with the following error:

RuntimeError: CUDA error: operation not supported on global/shared address space
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Root Cause

The issue occurs because image_ids is not updated after sparse_all_to_all operation, causing a shape mismatch with other tensors like means2d.

Specifically, codes:

  torch.distributed.barrier()
  tiles_per_gauss, isect_ids, flatten_ids = isect_tiles(
      means2d,   # means2d.shape: torch.Size([67205, 2])
      radii,
      depths,
      tile_size,
      tile_width,
      tile_height,
      segmented=segmented,
      packed=packed,
      n_images=I,
      image_ids=image_ids,  # image_ids.shape : torch.Size([66317])
      gaussian_ids=gaussian_ids, # gaussian_ids.shape : torch.Size([67205])
  )

Note

When batch in single rank is not 0, the update of image_ids must also consider batch_ids to ensure correct mapping between images and their corresponding batch indices.

liruilong940607 · 2026-01-28T01:30:26Z

Thanks for identifying the bug. however I think the proper fix should be replacing all camera_ids to image_ids in this chunk of the code?
https://github.com/1145284121/gsplat/blob/f9e61da389a7f5f98d8ccbe33d4bab302bd1aca7/gsplat/rendering.py#L532-L582

Explanation: image_ids is basically just the extension of camera_ids to the case of when there are multiple scenes (say B) to be rendered in the same pass, each with V cameras. Then camera_ids have range [0, V). image_id have the range of [0, B*V)

[BugFix] fix image_ids bug in distributed + packed mode

f9e61da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] fix image_ids bug in distributed + packed mode#800

[BugFix] fix image_ids bug in distributed + packed mode#800
1145284121 wants to merge 1 commit intonerfstudio-project:mainfrom
1145284121:bugfix

1145284121 commented Sep 20, 2025 •

edited

Loading

Uh oh!

liruilong940607 commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

1145284121 commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Problem Description

Root Cause

Note

Uh oh!

liruilong940607 commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1145284121 commented Sep 20, 2025 •

edited

Loading