Skip to content

fix: add timeout to sysfs writes to prevent daemon hang#1033

Open
zeeke wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
zeeke:us/timeout-sysfs
Open

fix: add timeout to sysfs writes to prevent daemon hang#1033
zeeke wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
zeeke:us/timeout-sysfs

Conversation

@zeeke
Copy link
Member

@zeeke zeeke commented Feb 16, 2026

Kernel drivers (e.g. i40e) can block indefinitely when writing to sriov_numvfs if the device is in a bad state. For example, the following error has been hit on a Intel XXV710 NIC:

Feb 16 13:53:01 worker0 kernel: 06c73374b594186: left promiscuous mode
Feb 16 13:53:01 worker0 kernel: i40e 0000:3b:00.0: Setting MAC 5e:28:32:f0:80:20 on VF 1
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Bring down and up the VF interface to make this change effective.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: 152a5b6a3b44739: left promiscuous mode

Replace direct os.WriteFile calls in SetSriovNumVfs with a new WriteFileWithTimeout utility that runs the write in a goroutine and returns a timeout error after 2 minutes.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions github-actions bot added the tests label Feb 16, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @zeeke, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where kernel drivers can cause the daemon to hang indefinitely when writing to sysfs files, specifically sriov_numvfs, if the device is in a bad state. To mitigate this, a new utility function WriteFileWithTimeout has been introduced, which wraps file write operations with a configurable timeout. This ensures that the application can recover gracefully instead of blocking indefinitely, improving the overall robustness and reliability of the system when interacting with potentially unresponsive kernel interfaces.

Highlights

  • Timeout Mechanism for Sysfs Writes: Introduced a sysfsWriteTimeout constant set to 2 minutes to prevent indefinite blocking during sysfs writes, particularly for sriov_numvfs.
  • New Utility Function: Implemented a new utility function, WriteFileWithTimeout, which executes file writes in a separate goroutine and returns an error if the operation exceeds the specified timeout.
  • Integration into SRIOV Configuration: Replaced direct os.WriteFile calls within the SetSriovNumVfs function with the new WriteFileWithTimeout utility to safeguard against daemon hangs caused by unresponsive kernel drivers.
  • Comprehensive Unit Tests: Added unit tests for the WriteFileWithTimeout function, covering successful writes, error propagation for invalid paths, and proper timeout behavior using a named pipe.
Changelog
  • pkg/host/internal/sriov/sriov.go
    • Defined sysfsWriteTimeout constant.
    • Replaced os.WriteFile calls with utils.WriteFileWithTimeout in SetSriovNumVfs.
  • pkg/utils/utils.go
    • Added WriteFileWithTimeout function with goroutine and select for timeout handling.
  • pkg/utils/utils_test.go
    • Added TestWriteFileWithTimeout_Success to verify successful file writing.
    • Added TestWriteFileWithTimeout_WriteError to confirm error propagation for invalid paths.
    • Added TestWriteFileWithTimeout_Timeout using a named pipe to simulate and test timeout behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a timeout for sysfs write operations to prevent the daemon from hanging when a kernel driver is in a bad state. This is achieved by replacing direct os.WriteFile calls with a new utils.WriteFileWithTimeout utility function. The new utility is well-implemented and comes with comprehensive unit tests covering success, error, and timeout scenarios. My feedback includes a suggestion to improve resource management in the new timeout utility.

@coveralls
Copy link

coveralls commented Feb 16, 2026

Pull Request Test Coverage Report for Build 22092217312

Details

  • 14 of 14 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 63.561%

Totals Coverage Status
Change from base Build 21902252428: 0.1%
Covered Lines: 9407
Relevant Lines: 14800

💛 - Coveralls

@SchSeba
Copy link
Collaborator

SchSeba commented Feb 25, 2026

Hi @zeeke can you check the tests please

@zeeke zeeke force-pushed the us/timeout-sysfs branch 2 times, most recently from 02d1571 to 3d9aa3c Compare February 25, 2026 14:08
Kernel drivers (e.g. i40e) can block indefinitely when writing to sriov_numvfs if the
device is in a bad state. For example, the following error has been hit on a `Intel XXV710` NIC:

```
Feb 16 13:53:01 worker0 kernel: 06c73374b594186: left promiscuous mode
Feb 16 13:53:01 worker0 kernel: i40e 0000:3b:00.0: Setting MAC 5e:28:32:f0:80:20 on VF 1
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Bring down and up the VF interface to make this change effective.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: i40e 0000:3b:00.0: Unable to configure VFs, other operation is pending.
Feb 16 13:53:02 worker0 kernel: 152a5b6a3b44739: left promiscuous mode
```

Replace direct `os.WriteFile` calls in SetSriovNumVfs with a new `WriteFileWithTimeout` utility that
runs the write in a goroutine and returns a timeout error after 2 minutes.

Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
@zeeke
Copy link
Member Author

zeeke commented Feb 26, 2026

Unit test flake

[FAILED] Timed out after 1.001s.
  The function passed to Eventually failed at /home/runner/work/sriov-network-operator/sriov-network-operator/controllers/sriovoperatorconfig_controller_test.go:132 with:
  Expected
      <[]string | len:0, cap:0>: nil
  not to be empty
  In [BeforeEach] at: /home/runner/work/sriov-network-operator/sriov-network-operator/controllers/sriovoperatorconfig_controller_test.go:142 @ 02/25/26 17:23:03.536

is fixed in

@zeeke zeeke requested review from SchSeba and adrianchiris March 16, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants