Skip to content

Nomad client after restart loses track of CSI volumes with staging still in use #25813

Open
@ygg-drop

Description

@ygg-drop

Nomad version

Output from nomad version

Nomad v1.8.9+ent
BuildDate 2025-01-14T19:11:47Z
Revision fc0f34f5b196ce9fcd0c62b6a5e7ce23934826d4

Operating system and Environment details

Alpine 3.20

Issue

When using a CSI plugin with staging (e.g. CephCSI for CephFS volumes), Nomad client correctly keeps track of which mounted CSI volumes with access mode multi-node-multi-writer are being used by more than one alloc. When two allocations are using the same volume, the staging mount is shared between them, and when one gets stopped the staging mount is kept as long as the other allocation is running (i.e. only NodeUnpublishVolume on the volume is called, but not NodeUnstageVolume).

However, when Nomad client gets restarted, it loses track of which CSI volumes are still being used by other allocations. In the scenario above with two allocations using the same volume, when one alloc gets stopped, Nomad calls NodeUnpublishVolume followed by NodeUnstageVolume, even though the volume is still being used by the other allocation.

From the allocation's perspective, the consequences of this apparently depend on what job drivers are being used and how mount propagation is configured on the host. In my environment with Docker driver and parent mount having shared propagation it took a while to notice this bug. After unmounting the stage mount in the host mount namespace the allocation that was left running still kept the mount in its own mount namespace, and hence nothing really broke from its perspective. When the second allocation was stopped, Nomad client was able to properly unmount the volume. However, I'd expect more disastrous consequences in less favorable environments.

Reproduction steps

  1. Setup a Nomad cluster with a CSI plugin with staging (e.g. Ceph-CSI)
  2. Create a CSI volume with multi-node-multi-writer access mode (e.g. CephFS volume)
  3. Run two jobs that use the same volume on the same node.
  4. Restart Nomad client
  5. Stop one of the jobs, Nomad will call NodeUnstageVolume on the volume despite the other job still using the volume.

Expected Result

After restart, Nomad client should continue to keep track which allocations are using the same CSI volumes with staging.

Actual Result

After restart, when there are more than one allocations using the same CSI volume with staging, on stop of one of the allocs Nomad client potentially breaks the volume mount for the other alloc by unmounting the staging path.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions