Skip to content

rflash of OP940.10 resulted in factory reset of BMC. #6658

@markprez

Description

@markprez

We flashed 23 nodes from OP940.01 to OP940.10, 7 did not come back.

The BMC's were factory reset and the BMC IP was reset to DHCP.

I received very inconsistent results setting the BMC password and trying to recover the server
using xcat commands or obmctool until the node was physically power cycled.

Uma diagnosed this to a know problem. /var full on the BMC resulted in the factory reset.

The failure rate was high so I am concerned this will be a major issue for customers.

A couple notes:
#1 A factory reset will reset the password. If you were a customer who tried to preserve the old "0penBmc"
password, you will now be forced to set a new password before proceeding.

#2 At some point after the failure the root user was getting locked:
[root@csm01 cuda-repo-rhel8-11-0-local]# ssh 10.69.200.13
root@10.69.200.13's password:

Account locked due to 5190 failed logins
Permission denied, please try again.

This prevents bmcdiscover from being able to determine if a DHCP leased IP is a BMC.

[root@csm01 cuda-repo-rhel8-11-0-local]# /opt/xcat/share/xcat/scripts/BMC_change_password.sh -r 10.69.200.13 -n 0penBmc123:
[root@csm01 cuda-repo-rhel8-11-0-local]# bmcdiscover --range 10.69.200.13 -u root -p 0penBmc -w -z
Warning: [csm01]: No bmc found.

The best recovery I found was:

  • Find my BMC mac in the /var/lib/dhcpd/dhcpd.leases
  • physical power cycle
  • nodeset lostnode runcmd=bmcsetup
  • /opt/xcat/share/xcat/scripts/BMC_change_password.sh -r 10.69.200.46 -n 0penBmc123
  • bmcdiscover --range 10.69.200.46 -u root -p 0penBmc123: -w -z
  • rpower /node.* boot
  • After we regain control: rmdef /node.*
  1. HPC needs to track the /var full problem as must fix, this was a high failure rate. With California Password change,
    recovery is made more difficult. The recovery from this state is very difficult.
  2. Can xCAT check /var before flash and take some action before this failure occurs? Is there an action that can be taken?
  3. I am not sure if we do this but for best results in an environment where a factory reset could occur, the
    xCAT passwd table should have the default password:

"openbmc","root","0penBmc",,,,

and the node def have the custom password.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions