Skip to content

[distributed] Accuracy issue for test_composability.py #2752

@zxd1997066

Description

@zxd1997066

🐛 Describe the bug

split from #2371
please get wheels from https://github.com/intel/torch-xpu-ops/actions/runs/21122359678 or use gh download

gh run download 21122359678 --repo intel/torch-xpu-ops --name Torch-XPU-Wheel-1826-21122359678-1 --dir path --pattern "*.zip"
git clone -b distributed_2.10 https://github.com/daisyden/pytorch.git
cd pytorch
pip install -r requirements.txt
pip install pytest expecttest
pytest -v test/distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_apply_optim_in_backward
pytest -v test/distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_apply_optim_in_backward_grad_as_bucket_view_false
___________________________________ TestDistBackendWithSpawn.test_ddp_apply_optim_in_backward ____________________________________
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 814, in wrapper
    self._join_processes(fn)
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1083, in _join_processes
    self._check_return_codes(fn, elapsed_time)
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1123, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 969, in run_test
    getattr(self, test_name)()
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 816, in wrapper
    fn()
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3359, in wrapper
    method(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 233, in wrapper
    return func(*args, **kwargs)
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 4934, in test_ddp_apply_optim_in_backward
    self._test_ddp_apply_optim_in_backward(
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 4916, in _test_ddp_apply_optim_in_backward
    self.assertEqual(
  File "/home/sdp/miniforge3/envs/xccl_test/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4354, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 16 / 3072 (0.5%)
Greatest absolute difference: 0.007080078125 at index (206, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.0726951081305742e-06 at index (90, 1) (up to 1.3e-06 allowed)
Params not equal at iteration 4

Versions

https://github.com/daisyden/pytorch/tree/distributed_2.10

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmodule: distributedFor distributed feature issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions