-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Problem Description
I have an issue with a simple PyTorch MLP program with full HD inference on navi48.
My environment is here:
-
OS: Windows 11 Pro 24H2
-
CPU: AMD Ryzen 9 9950X3D 16-Core Processor
-
GPU: AMD Radeon RX 9070 XT
-
GPU Driver Version: 32.0.21025.10016
-
(Get-WmiObject Win32_OperatingSystem).Version:
- 10.0.26100
-
(Get-WmiObject win32_Processor).Name:
- AMD Ryzen 9 9950X3D 16-Core Processor
-
(Get-WmiObject win32_VideoController).Name
- AMD Radeon(TM) Graphics
- AMD Radeon RX 9070 XT
-
Python: 3.11
A simple reproducible program is here:
requirements.txt:
--index-url https://rocm.nightlies.amd.com/v2/gfx120X-all/
--extra-index-url https://pypi.org.simple
torch==2.10.0a+rocm7.10.0a20251009
tqdm
numpy
matplotlib
main.py:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from tqdm import tqdm
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# device = torch.device("cpu")
width = 1920
height = 1080
# simple MLP
class MLP(nn.Module):
def __init__(self):
super(MLP, self).__init__()
self.fc0 = nn.Linear(3, 256)
self.fc1 = nn.Linear(256, 256)
self.fc2 = nn.Linear(256, 256)
self.fc3 = nn.Linear(256, 256)
self.fc4 = nn.Linear(256, 3)
def forward(self, x):
x = torch.relu(self.fc0(x))
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.relu(self.fc3(x))
x = torch.sigmoid(self.fc4(x))
return x
BATCH_SIZE = 1024
ITERATION = 100
# all red
target = torch.tensor([1.0, 0.0, 0.0]).to(device)
target = target.repeat(BATCH_SIZE).reshape(BATCH_SIZE, 3)
# setup training mlp
mlp = MLP().to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(mlp.parameters(), lr=0.1)
# train random value to red
mlp.train()
for epoch in tqdm(range(ITERATION)):
x = torch.rand_like(target).to(device)
optimizer.zero_grad()
x = mlp(x)
loss = criterion(x, target)
loss.backward()
optimizer.step()
# eval mlp from random image to red image
with torch.no_grad():
mlp.eval()
x = torch.rand(height * width, 3).to(device)
img = mlp(x)
# torch.cuda.synchronize()
img_py = img.reshape(height, width, 3).detach().cpu().numpy()
plt.imsave("./test-img.png", img_py)
This program trains an MLP to infer random colored inputs into all red.
And it performs inference at a large resolution.
The hope is that this will yield all red images.
However, when run on navi48, we get the following image:
When I test the program with device = torch.device("cpu")
, I get a completely red image as expected.
So, it seems a GPU related issue.
There are no issues when the image resolution for inference is lowered (960x540).
Operating System
Windows 11 Pro 24H2
CPU
AMD Ryzen 9 9950X3D
GPU
AMD Radeon RX 9070 XT
ROCm Version
ROCm 6.4.0
ROCm Component
No response
Steps to Reproduce
Run the above program.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status