Open
Description
In the field, there are a number of ways that NVMe drives can fail. There are few things that it would be great to simulate in Propolis. I'm going to call out a few of these here:
- We would like the ability to set the drive into a
CFS
state where by that bit is set and it will refuse to process additional I/O. There are two different paths forward we would like to have on the device:- On reset, the device continues to basically issue an NSNR until we format it. I realize that may be a little complicated as it requires some persistent state to be kept and tracked around. I think it'd be fine is this was only in-memory state.
- On reset the CFS is cleared and the device begins processing again. The former is one we've seen more commonly on certain firmware revs, but this behavior seems useful to test.
- A device starting up in a read-only media mode
- Injecting asynchronous events that cover:
- NVM subsystem reliability which results in a read-only media and related information in the SMART / health log. PArticularly, bit 3 is the most relevant in the critical warning field, but others may be interesting.
- Transient / persistent internal errors, the latter which results in CFS being set.
Obviously with all these these are things we likely don't want to include in the product as a whole; however, the media read-only mode based on crucible going down to a single replica may be interesting.