We flashed 23 nodes from OP940.01 to OP940.10, 7 did not come back.
The BMC's were factory reset and the BMC IP was reset to DHCP.
I received very inconsistent results setting the BMC password and trying to recover the server
using xcat commands or obmctool until the node was physically power cycled.
Uma diagnosed this to a know problem. /var full on the BMC resulted in the factory reset.
The failure rate was high so I am concerned this will be a major issue for customers.
A couple notes:
#1 A factory reset will reset the password. If you were a customer who tried to preserve the old "0penBmc"
password, you will now be forced to set a new password before proceeding.
#2 At some point after the failure the root user was getting locked:
[root@csm01 cuda-repo-rhel8-11-0-local]# ssh 10.69.200.13
root@10.69.200.13's password:
Account locked due to 5190 failed logins
Permission denied, please try again.
This prevents bmcdiscover from being able to determine if a DHCP leased IP is a BMC.
[root@csm01 cuda-repo-rhel8-11-0-local]# /opt/xcat/share/xcat/scripts/BMC_change_password.sh -r 10.69.200.13 -n 0penBmc123:
[root@csm01 cuda-repo-rhel8-11-0-local]# bmcdiscover --range 10.69.200.13 -u root -p 0penBmc -w -z
Warning: [csm01]: No bmc found.
The best recovery I found was:
- Find my BMC mac in the /var/lib/dhcpd/dhcpd.leases
- physical power cycle
- nodeset lostnode runcmd=bmcsetup
- /opt/xcat/share/xcat/scripts/BMC_change_password.sh -r 10.69.200.46 -n 0penBmc123
- bmcdiscover --range 10.69.200.46 -u root -p 0penBmc123: -w -z
- rpower /node.* boot
- After we regain control: rmdef /node.*
- HPC needs to track the /var full problem as must fix, this was a high failure rate. With California Password change,
recovery is made more difficult. The recovery from this state is very difficult.
- Can xCAT check /var before flash and take some action before this failure occurs? Is there an action that can be taken?
- I am not sure if we do this but for best results in an environment where a factory reset could occur, the
xCAT passwd table should have the default password:
"openbmc","root","0penBmc",,,,
and the node def have the custom password.
We flashed 23 nodes from OP940.01 to OP940.10, 7 did not come back.
The BMC's were factory reset and the BMC IP was reset to DHCP.
I received very inconsistent results setting the BMC password and trying to recover the server
using xcat commands or obmctool until the node was physically power cycled.
Uma diagnosed this to a know problem. /var full on the BMC resulted in the factory reset.
The failure rate was high so I am concerned this will be a major issue for customers.
A couple notes:
#1 A factory reset will reset the password. If you were a customer who tried to preserve the old "0penBmc"
password, you will now be forced to set a new password before proceeding.
#2 At some point after the failure the root user was getting locked:
[root@csm01 cuda-repo-rhel8-11-0-local]# ssh 10.69.200.13
root@10.69.200.13's password:
Account locked due to 5190 failed logins
Permission denied, please try again.
This prevents bmcdiscover from being able to determine if a DHCP leased IP is a BMC.
[root@csm01 cuda-repo-rhel8-11-0-local]# /opt/xcat/share/xcat/scripts/BMC_change_password.sh -r 10.69.200.13 -n 0penBmc123:
[root@csm01 cuda-repo-rhel8-11-0-local]# bmcdiscover --range 10.69.200.13 -u root -p 0penBmc -w -z
Warning: [csm01]: No bmc found.
The best recovery I found was:
recovery is made more difficult. The recovery from this state is very difficult.
xCAT passwd table should have the default password:
"openbmc","root","0penBmc",,,,
and the node def have the custom password.