Skip to content

MPS implementation command buffer errors #198

@SoylentGraham

Description

@SoylentGraham

Often -but not always- when getting NaNs when training on a mac, (m1 Max 2020 mac studio) I'm getting this output in xcode (I think only if Metal api validation is on)
These are also coinciding with bad resulting data (presumably because that execution didn't finish and big blobs are left in the data)

Training continues, and sometimes get more of these.
I also think it coincides with a sustained period of 100% gpu usage, so maybe the OS is overheating a bit, but I'm hoping this is an error that can be caught and undo/not commit any changes, re-run a step etc

Execution of the command buffer was aborted due to an error during execution. Internal Error (0000000e:Internal Error)
Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Error: 
	(null)
	Internal Error (0000000e:Internal Error)
	<AGXG13XFamilyCommandBuffer: 0x14c689290>
    label = <none> 
    device = <AGXG13XDevice: 0x13a864400>
        name = Apple M1 Max 
    commandQueue = <AGXG13XFamilyCommandQueue: 0x14c93fa00>
        label = <none> 
        device = <AGXG13XDevice: 0x13a864400>
            name = Apple M1 Max 
    retainedReferences = 1
Step 101: nan (20%)

command buffer exited with error status

Haven't looked into if this message is coming from the OS or from opensplat/libtorch etc, but might be somewhere to start.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions