Skip to content

Validator Client intermittently freezes on Linux kernel 6.14.4 -> 6.14.7 #7403

@michaelsproul

Description

@michaelsproul
Member

Summary

The Validator Client on Linux with kernel versions 6.14.4 -> 6.14.7 will intermittently freeze, stopping it from performing its duties.

If you're running a distro which closely follows Linux mainline (such as Arch Linux or Fedora) you may be affected.

Run uname -r to check if your kernel version is in the affected range.

Note: Ubuntu 24.04 and older use older kernels and thus are not affected. However, Ubuntu 25.04 is currently running the 6.14.0 kernel which is unaffected but it is possible that a future package upgrade will include one of the affected kernels so upgrade with caution.

Solutions and Workarounds

This bug, caused by changes to the eventpoll code, has already been patched in the Linux mainline kernel and will be fixed in 6.14.8+.

Once your distro allows you to update the kernel to 6.14.8 you can safely do so.

In the meantime, if you are running an affected kernel version you have a few options:

Install an LTS kernel

The procedure for this will differ depending on your distro but the below example is the instructions for Arch Linux:

sudo pacman -S linux-lts

If you are using systemd-boot, it should automatically generate the corresponding bootloader entries.

If you are using grub you will need to regenerate them:

sudo grub-mkconfig -o /boot/grub/grub.cfg

Reboot and you should see linux-lts included in the grub menu.

Downgrade your kernel

This will vary depending on your distro and for some distros it is very involved. Here are the instructions for Fedora 41:

# Find available kernels
sudo dnf list kernel --showduplicates

# Install a specific kernel. For example:
sudo dnf install kernel-6.11.4-301.fc41

grub entries will automatically be added, so reboot and select the new kernel from the list.

Run an API polling script

If you do not want to touch your kernel in case you break something there is a simple bash script you can run instead.

Due to the internals of eventpoll, when the VC receives an API call, it will wake from its freeze.

Because of this we can use a script running in the background which continuously polls the VC. Here is an example of such a script:

while sleep 5; do curl -s --fail "http://localhost:5062/lighthouse/auth" > /dev/null && echo "polled at $(date)"; done

This will keep the VC awake.
Note that running a full VC metrics server with Grafana polling the VC will also keep it awake for the same reason.

Acknowledgments

A huge thank you to the users on Discord who discovered this issue and assisted in diagnosis and testing, particularly @smooth.ninja and @ChosunOne.


See tokio-rs/tokio#7335 and #7403 for more details

Activity

added
bugSomething isn't working
val-clientRelates to the validator client binary
on May 6, 2025
0xriazaka

0xriazaka commented on May 6, 2025

@0xriazaka

Can i work on this?

michaelsproul

michaelsproul commented on May 6, 2025

@michaelsproul
MemberAuthor

@0xriazaka If you can work out the root cause, please try. We won't assign it exclusively to you because we need to fix this ASAP.

michaelsproul

michaelsproul commented on May 7, 2025

@michaelsproul
MemberAuthor

So far 3 of 3 confirmed cases occurred on Arch Linux.

I suspect it's something to do with the new kernel version.

changed the title [-]Validator client deadlocks/freezes intermittently[/-] [+]Validator client deadlocks/freezes intermittently on Arch Linux[/+] on May 7, 2025
j4cko

j4cko commented on May 10, 2025

@j4cko

I, too, am experiencing this on archlinux. Running strace during the hang yields:
futex(0x7928d3c71910, FUTEX_WAIT_BITSET_PRIVATE, 0, NULL, FUTEX_BITSET_MATCH_ANY

An interesting observation is maybe the following: I am running two validator processes, one with one key and another one with two active validators. Only the one with two validators hangs every couple of hours.

Let me know if I can be of any help reproducing the issue.

michaelsproul

michaelsproul commented on May 10, 2025

@michaelsproul
MemberAuthor

@j4cko Please try the work around polling the http api of the VC.

We might have a build to share soon

michaelsproul

michaelsproul commented on May 13, 2025

@michaelsproul
MemberAuthor

Something else we could try:

We could run a background thread in parking_lot that checks periodically for deadlocks. I think all we can do if we detect one is print out the thread IDs and the backtraces. Should probably run with debug symbols in order to get the best backtraces.

keccakk

keccakk commented on May 14, 2025

@keccakk

Can confirm this issue began for me when updating to Linux 6.14.4-arch1-1 x86_64
No issue with v7.0.0 on the old kernel but updated both kernel and to v7.0.1 and the issue started.

Can't guarantee this is related, but thought I'd mention it in case it helps narrow the issue down. I'm currently participating in the Aztec Public Testnet and running a Sepolia node using Lighthouse for my beacon node, and the Aztec node occasionally hangs in a similar way to the VC. It just runs for a bit and then freezes after a few hours. If related, makes me suspect the beacon node.

emilbayes

emilbayes commented on May 15, 2025

@emilbayes

I'm having this issue as well with Gnosis mainnet. It only started happening after v7.0.x with Nethermind as the EL client. I'm also on Linux 6.14.4-arch1-2 x86_64. Probing the VC with a http requests wakes it up

michaelsproul

michaelsproul commented on May 15, 2025

@michaelsproul
MemberAuthor

@emilbayes Please try updating the kernel to 6.14.6. We've got a smaller program than Lighthouse VC which reproduces the hang, which just uses some mutexes and some sleeps and it hangs in the same way on 6.14.{4,5} but not yet on 6.14.6. The underlying issue seems to be an incompatibility between Tokio and the kernel, or a kernel bug, nothing Lighthouse-specific.

keccakk

keccakk commented on May 15, 2025

@keccakk

@michaelsproul I just got another hang on 6.14.6, started working again when I started the curl cmd up again.

19 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingval-clientRelates to the validator client binary

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @gaia@emilbayes@j4cko@michaelsproul@chong-he

        Issue actions

          Validator Client intermittently freezes on Linux kernel `6.14.4` -> `6.14.7` · Issue #7403 · sigp/lighthouse