-
Notifications
You must be signed in to change notification settings - Fork 151
Open
Labels
Description
Often -but not always- when getting NaNs when training on a mac, (m1 Max 2020 mac studio) I'm getting this output in xcode (I think only if Metal api validation is on)
These are also coinciding with bad resulting data (presumably because that execution didn't finish and big blobs are left in the data)
Training continues, and sometimes get more of these.
I also think it coincides with a sustained period of 100% gpu usage, so maybe the OS is overheating a bit, but I'm hoping this is an error that can be caught and undo/not commit any changes, re-run a step etc
Execution of the command buffer was aborted due to an error during execution. Internal Error (0000000e:Internal Error)
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG13XFamilyCommandBuffer: 0x14c689290>
label = <none>
device = <AGXG13XDevice: 0x13a864400>
name = Apple M1 Max
commandQueue = <AGXG13XFamilyCommandQueue: 0x14c93fa00>
label = <none>
device = <AGXG13XDevice: 0x13a864400>
name = Apple M1 Max
retainedReferences = 1
Step 101: nan (20%)
command buffer exited with error status
Haven't looked into if this message is coming from the OS or from opensplat/libtorch etc, but might be somewhere to start.