Skip to content

GEMM_RS_OVERLAP's sporadic precision issue #55

@llying-001

Description

@llying-001

The GEMM reduce-scatter overlap method from Triton Distributed, integrated into this codebase, exhibits intermittent result errors on a specific machine (gpu-44 in the AAC cluster). It runs correctly on other machines. The hardware, ROCm version, Docker image, and the code used are all identical. Even on this machine, the behavior of the unit test case varies at different times.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions