Skip to content

Cosmo sequencer PMBus monitoring ought to include more rails #2394

@hawkw

Description

@hawkw

The other day we saw a Cosmo fail to sequence due to a timeout while waiting for the Group A rails. Looking at this sled's ringbuf we did not see POWER_GOOD on any of V1P5_RTC, V3P3_SP5_A1, or V1P8_SP5_A1 rails. The V3p3_SP5_A1 and V1P8_SP5_A1 rails are regulated by U116, which is an ISL68224. This is a PMBus part, and its PMBus alert pin is routed to the FPGA on PWR_CONT3_TO_FPGA1_ALERT_L, which can generate a corresponding interrupt to the SP.

Hubris could be enabling the PWR_CONT3_TO_FPGA_ALERT_L interrupt in the sequencer FPGA's Interrupt Enable Register (IER), and handle PMBus alerts from U116 similarly to how we handle alerts from the RAA229620A regulators for the VDDCR_CPU0_A0 and VDDCR_CPU1_A0 rails. It would be useful to be able to capture PMBus fault information from the ISL68224 as well.

Additionally, the current code for monitoring the Vcore VRMs (the RAA229620As) presently only considers PMBus page 0 on those parts, which are the VDDCR_CPU0_A0 and VDDCR_CPU1_A0 rails. We probably ought to also be looking at page 1 on these VRMs as well, which corresponds to VDDCR_SOC_A0 and VDDIO_SP5_A0. After #2390 merges, we will clear faults on both pages when the PMBus alert is asserted, but we only look at the status registers and record potential faults from the page 0 rails.

Metadata

Metadata

Assignees

Labels

cosmoSP5 Boardfault-managementEverything related to the Oxide's Fault Management architecture implementation⚠️ ereportif you see something, say something!

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions