[Issue]: Issue on simple MLP with large size inference on navi48

### Problem Description

I have an issue with a simple PyTorch MLP program with full HD inference on navi48.

---

My environment is here:
- OS: Windows 11 Pro 24H2
- CPU: AMD Ryzen 9 9950X3D 16-Core Processor
- GPU: AMD Radeon RX 9070 XT
- GPU Driver Version:	32.0.21025.10016

- (Get-WmiObject Win32_OperatingSystem).Version:
  - 10.0.26100
- (Get-WmiObject win32_Processor).Name:
  - AMD Ryzen 9 9950X3D 16-Core Processor
- (Get-WmiObject win32_VideoController).Name
  - AMD Radeon(TM) Graphics
  - AMD Radeon RX 9070 XT

- Python: 3.11

---

A simple reproducible program is here:

requirements.txt:
```txt
--index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/
--extra-index-url https://pypi.org.simple

torch==2.10.0a+rocm7.10.0a20251009
tqdm
numpy
matplotlib
```

main.py:
```py
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")

width = 1920
height = 1080

# simple MLP
class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.fc0 = nn.Linear(3, 256)
        self.fc1 = nn.Linear(256, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 256)
        self.fc4 = nn.Linear(256, 3)
    def forward(self, x):
        x = torch.relu(self.fc0(x))
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = torch.sigmoid(self.fc4(x))
        return x

BATCH_SIZE = 1024
ITERATION = 100

# all red
target = torch.tensor([1.0, 0.0, 0.0]).to(device)
target = target.repeat(BATCH_SIZE).reshape(BATCH_SIZE, 3)

# setup training mlp
mlp = MLP().to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=0.1)

# train random value to red
mlp.train()
for epoch in tqdm(range(ITERATION)):
    x = torch.rand_like(target).to(device)
    optimizer.zero_grad()
    x = mlp(x)
    loss = criterion(x, target)
    loss.backward()
    optimizer.step()

# eval mlp from random image to red image
with torch.no_grad():
    mlp.eval()
    x = torch.rand(height * width, 3).to(device)
    
    img = mlp(x)
    # torch.cuda.synchronize()

    img_py = img.reshape(height, width, 3).detach().cpu().numpy()
    plt.imsave("./test-img.png", img_py)
```

This program trains an MLP to infer random colored inputs into all red.
And it performs inference at a large resolution.
The hope is that this will yield all red images.

However, when run on navi48, we get the following image:
<img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/40734e12-e6af-493f-bd99-c838d74f92b7" />

When I test the program with `device = torch.device("cpu")`, I get a completely red image as expected.
So, it seems a GPU related issue.
<img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/af959e32-bbc3-49f5-8341-8d6d4c21fbaf" />

There are no issues when the image resolution for inference is lowered (960x540).
<img width="960" height="540" alt="Image" src="https://github.com/user-attachments/assets/f6ff7e9b-8635-42f1-8cf6-2590f4f2e3e8" />

### Operating System

Windows 11 Pro 24H2

### CPU

AMD Ryzen 9 9950X3D

### GPU

AMD Radeon RX 9070 XT

### ROCm Version

ROCm 6.4.0

### ROCm Component

_No response_

### Steps to Reproduce

Run the above program.

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: Issue on simple MLP with large size inference on navi48 #1785

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: Issue on simple MLP with large size inference on navi48 #1785

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions