-
Notifications
You must be signed in to change notification settings - Fork 892
Description
Summary
The Validator Client on Linux with kernel versions 6.14.4
-> 6.14.7
will intermittently freeze, stopping it from performing its duties.
If you're running a distro which closely follows Linux mainline (such as Arch Linux or Fedora) you may be affected.
Run uname -r
to check if your kernel version is in the affected range.
Note: Ubuntu 24.04 and older use older kernels and thus are not affected. However, Ubuntu 25.04 is currently running the 6.14.0 kernel which is unaffected but it is possible that a future package upgrade will include one of the affected kernels so upgrade with caution.
Solutions and Workarounds
This bug, caused by changes to the eventpoll
code, has already been patched in the Linux mainline kernel and will be fixed in 6.14.8+
.
Once your distro allows you to update the kernel to 6.14.8
you can safely do so.
In the meantime, if you are running an affected kernel version you have a few options:
Install an LTS kernel
The procedure for this will differ depending on your distro but the below example is the instructions for Arch Linux:
sudo pacman -S linux-lts
If you are using systemd-boot
, it should automatically generate the corresponding bootloader entries.
If you are using grub
you will need to regenerate them:
sudo grub-mkconfig -o /boot/grub/grub.cfg
Reboot and you should see linux-lts
included in the grub
menu.
Downgrade your kernel
This will vary depending on your distro and for some distros it is very involved. Here are the instructions for Fedora 41:
# Find available kernels
sudo dnf list kernel --showduplicates
# Install a specific kernel. For example:
sudo dnf install kernel-6.11.4-301.fc41
grub
entries will automatically be added, so reboot and select the new kernel from the list.
Run an API polling script
If you do not want to touch your kernel in case you break something there is a simple bash
script you can run instead.
Due to the internals of eventpoll
, when the VC receives an API call, it will wake from its freeze.
Because of this we can use a script running in the background which continuously polls the VC. Here is an example of such a script:
while sleep 5; do curl -s --fail "http://localhost:5062/lighthouse/auth" > /dev/null && echo "polled at $(date)"; done
This will keep the VC awake.
Note that running a full VC metrics server with Grafana polling the VC will also keep it awake for the same reason.
Acknowledgments
A huge thank you to the users on Discord who discovered this issue and assisted in diagnosis and testing, particularly @smooth.ninja and @ChosunOne.
See tokio-rs/tokio#7335 and #7403 for more details
Activity
0xriazaka commentedon May 6, 2025
Can i work on this?
michaelsproul commentedon May 6, 2025
@0xriazaka If you can work out the root cause, please try. We won't assign it exclusively to you because we need to fix this ASAP.
michaelsproul commentedon May 7, 2025
So far 3 of 3 confirmed cases occurred on Arch Linux.
I suspect it's something to do with the new kernel version.
[-]Validator client deadlocks/freezes intermittently[/-][+]Validator client deadlocks/freezes intermittently on Arch Linux[/+]initialized_validators
#7423j4cko commentedon May 10, 2025
I, too, am experiencing this on archlinux. Running strace during the hang yields:
futex(0x7928d3c71910, FUTEX_WAIT_BITSET_PRIVATE, 0, NULL, FUTEX_BITSET_MATCH_ANY
An interesting observation is maybe the following: I am running two validator processes, one with one key and another one with two active validators. Only the one with two validators hangs every couple of hours.
Let me know if I can be of any help reproducing the issue.
michaelsproul commentedon May 10, 2025
@j4cko Please try the work around polling the http api of the VC.
We might have a build to share soon
michaelsproul commentedon May 13, 2025
Something else we could try:
We could run a background thread in
parking_lot
that checks periodically for deadlocks. I think all we can do if we detect one is print out the thread IDs and the backtraces. Should probably run with debug symbols in order to get the best backtraces.keccakk commentedon May 14, 2025
Can confirm this issue began for me when updating to Linux 6.14.4-arch1-1 x86_64
No issue with v7.0.0 on the old kernel but updated both kernel and to v7.0.1 and the issue started.
Can't guarantee this is related, but thought I'd mention it in case it helps narrow the issue down. I'm currently participating in the Aztec Public Testnet and running a Sepolia node using Lighthouse for my beacon node, and the Aztec node occasionally hangs in a similar way to the VC. It just runs for a bit and then freezes after a few hours. If related, makes me suspect the beacon node.
emilbayes commentedon May 15, 2025
I'm having this issue as well with Gnosis mainnet. It only started happening after
v7.0.x
with Nethermind as the EL client. I'm also onLinux 6.14.4-arch1-2 x86_64
. Probing the VC with a http requests wakes it upmichaelsproul commentedon May 15, 2025
@emilbayes Please try updating the kernel to 6.14.6. We've got a smaller program than Lighthouse VC which reproduces the hang, which just uses some mutexes and some
sleep
s and it hangs in the same way on 6.14.{4,5} but not yet on 6.14.6. The underlying issue seems to be an incompatibility between Tokio and the kernel, or a kernel bug, nothing Lighthouse-specific.keccakk commentedon May 15, 2025
@michaelsproul I just got another hang on 6.14.6, started working again when I started the curl cmd up again.
19 remaining items