Skip to content

fix: NVIDIA Docker apt-get command#138

Closed
rollandf wants to merge 66 commits intoMellanox:network-operator-26.1.xfrom
rollandf:dockerbp
Closed

fix: NVIDIA Docker apt-get command#138
rollandf wants to merge 66 commits intoMellanox:network-operator-26.1.xfrom
rollandf:dockerbp

Conversation

@rollandf
Copy link
Member

No description provided.

maze88 and others added 30 commits January 20, 2026 11:56
Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
…lues update PR; and misc. corrections

Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
- add force flag to push on PR for CI
- delete the chart directory before copy

Signed-off-by: Fred Rolland <frolland@nvidia.com>
The CI PR sent to update the SRIOV-Network Operator versions
should be based on the BASE_BRANCH

Signed-off-by: Fred Rolland <frolland@nvidia.com>
- job names kebab case (more common on gh actions).
- pr body message with link to job.
- clean extra whitespace.

Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Many customers rely on the Ubuntu operating system, and the ability
to automatically update GRUB parameters is essential.

Signed-off-by: Aleksey Senin <alekseys@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
… Cluster

Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
almaslennikov and others added 25 commits January 20, 2026 11:56
this command does not take arguments thus failing before

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
(cherry picked from commit 3072ff0)
(cherry picked from commit d409c77)
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
(cherry picked from commit fefa935)
(cherry picked from commit 62f62c1)
Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>

fd
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
Signed-off-by: Ivan Kolodiazhnyi <ikolodiazhny@nvidia.com>
… daemon CI

Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
When SriovNetworkNodePolicy specifies groupingPolicy: all, the operator
should create a single OVS bridge with all matching PFs as uplinks.
Previously, it was incorrectly creating separate bridges for each PF.

Changes:
- Update ApplyBridgeConfig to propagate GroupingPolicy to node state
- Add applyGroupedBridgeConfig() to create single bridge configuration
  with all matching PFs as uplinks when groupingPolicy is "all"
- Modify CreateOVSBridge() to support multiple uplinks per bridge
- Remove ConfigureBridgesGrouping() as grouped bridge creation is now
  handled entirely by the controller via ApplyBridgeConfig
- Add unit tests for grouped bridge functionality

Fixes issue where applying a policy like:
```
  bridge:
    ovs:
      bridge:
        groupingPolicy: all
```

Would create multiple bridges (br-ens2f0np0, br-ens2f1np1, etc.)
instead of a single bridge (br-<policy-name>) containing all PFs.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
it's required for mlxfwreset to work properly

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
The getCurrentBridgeState() function was only processing the first
uplink (Uplinks[0]) and ignoring the rest. This caused bridges with
multiple uplinks (created via groupingPolicy: all) to constantly
reconfigure because the current state comparison always showed a
mismatch.

Additionally, NeedToUpdateBridges was comparing the entire Bridges
struct including the GroupingPolicy field. However, GroupingPolicy is
a policy directive used during configuration and is not stored in OVS,
so it will never appear in the discovered status. This caused another
source of constant mismatch detection.

Changes:
- Modify getCurrentBridgeState() to iterate through all uplinks in
  knownConfig.Uplinks instead of only processing the first one
- Modify NeedToUpdateBridges to only compare OVS configurations,
  ignoring the GroupingPolicy field
- Add test cases for bridge with multiple uplinks
- Add test case for partial uplinks when some interfaces are missing
- Add test case for groupingPolicy difference being ignored

Fixes infinite reconciliation loop where bridge ports kept being
removed and re-added on every sync cycle.

Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
This commit addresses several issues in the sriov-network-config-daemon
relating to host configuration management and observability.

1. Fix kargs.sh for read-only root filesystems:
   The daemon now uses the host's /tmp directory (via chroot) for temporary
   workspace files. This allows the script to function correctly when the
   container is deployed with 'readOnlyRootFilesystem: true', which previously
   caused 'cp' and 'grep' failures.

2. Fix logging serialization bug:
   Updated NodeReconciler to correctly call the Name() method of the main
   plugin when logging errors. Previously, passing the function pointer directly
   caused "json: unsupported type: func() string" errors, masking the actual
   reconciliation failures.

Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Alexander Maslennikov <amaslennikov@nvidia.com>
- Avoid unnecessary reboots by checking Mellanox firmware multiport state.
- Ensure devlink parameter changes trigger interface reconciliation.

Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Michael Zeevi <mzeevi@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
@github-actions
Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@greptile-apps
Copy link

greptile-apps bot commented Jan 21, 2026

Greptile Summary

This PR removes the || true fallback operator from the apt-get update command in the NVIDIA Docker config daemon Dockerfile. This change improves build robustness by ensuring the Docker build fails explicitly if package updates fail, rather than silently continuing with the previous workaround. Since the underlying GPG and authentication issues are already handled by the -o Acquire::AllowInsecureRepositories=true and --allow-unauthenticated command-line options added in the previous commit, the || true fallback is no longer necessary and was hiding potential package update failures.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects a simple, well-justified change that improves Docker build failure handling without introducing any regressions. The change removes an unnecessary workaround operator while properly configured command options handle the underlying package repository issues. No files require special attention.
  • No files require special attention

Sequence Diagram

sequenceDiagram
    participant Docker Build
    participant apt-get update
    participant apt-get install
    participant Build Success

    Docker Build->>apt-get update: Run with AllowInsecureRepositories
    Note over apt-get update: GPG issue handled by command options
    apt-get update-->>Docker Build: Success
    Docker Build->>apt-get install: Run with --allow-unauthenticated
    apt-get install-->>Docker Build: Success
    Docker Build->>Build Success: Continue build
    Note over Docker Build: || true removed: Build will fail if update fails

Loading

@maze88 maze88 force-pushed the network-operator-26.1.x branch from 47ae0bf to f0a2814 Compare January 21, 2026 09:33
@rollandf rollandf closed this Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants