Skip to content

aws-ebs CSI volume gives Input/output error intermittently after stopping and rerunning allocation #25819

Open
@msherman13

Description

@msherman13

Nomad version

Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e

aws-csi plugin version

public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.41.0

Operating system and Environment details

Rocky 9.3

Issue

Intermittently (about 25% of the time) if we manually stop an allocation of a job that uses an aws ebs CSI volume, and then nomad automatically creates a new allocation of the job, the volume gets into a state where any access to it results in input/output error. The issue is fixed by fully stopping the job (Stop Job in the GUI) and then starting it.

I am unsure if this is an issue with the CSI plugin or with nomad, but since stopping the job and then starting it fresh fixes it I am leaning towards nomad.

Reproduction steps

Stop the allocation and allow nomad to create a new one.

Expected Result

Volume is mounted to the new allocation and usable for reading and writing

Actual Result

Any access gives input/output error. Even if i ssh to the client node where the volume is mounted and try to access the volume I get the error:

$ ls /var/nomad/client/csi/node/aws_ebs/staging/central-infra/locates-server/rw-file-system-single-node-writer
ls: cannot access '/var/nomad/client/csi/node/aws_ebs/staging/central-infra/locates-server/rw-file-system-single-node-writer': Permission denied

relevant ebs-nodes plugin logs

Allocation is dropped right before this time:

I0506 22:41:17.177441       1 node.go:466] "NodeUnpublishVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" target_path:\"/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer\""
I0506 22:41:17.177487       1 node.go:486] "NodeUnpublishVolume: unmounting" target="/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.177510       1 mount_helper_common.go:93] unmounting "/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I0506 22:41:17.177522       1 mount_linux.go:412] Unmounting /local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer
I0506 22:41:17.181837       1 mount_helper_common.go:150] Deleting path "/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.181881       1 node.go:482] "NodeUnPublishVolume: volume operation finished" volumeId="vol-0944933a47e800e4b"
I0506 22:41:17.181892       1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"
I0506 22:41:17.187149       1 node.go:300] "NodeUnstageVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" staging_target_path:\"/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer\""
I0506 22:41:17.187374       1 node.go:340] "NodeUnstageVolume: unmounting" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.187387       1 mount_helper_common.go:93] unmounting "/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I0506 22:41:17.187399       1 mount_linux.go:412] Unmounting /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer
I0506 22:41:17.227478       1 mount_helper_common.go:150] Deleting path "/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.227541       1 node.go:345] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-0944933a47e800e4b" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.227552       1 node.go:315] "NodeUnStageVolume: volume operation finished" volumeID="vol-0944933a47e800e4b"
I0506 22:41:17.227561       1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"

Allocation is started before this time:

I0506 22:41:33.299587       1 node.go:112] "NodeStageVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" publish_context:{key:\"devicePath\" value:\"/dev/xvdaa\"} staging_target_path:\"/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer\" volume_capability:{mount:{} access_mode:{mode:SINGLE_NODE_WRITER}}"
I0506 22:41:33.299750       1 mount_linux.go:107] "[Debug] successfully resolved" nvmeName="nvme-Amazon_Elastic_Block_Store_vol0944933a47e800e4b" nvmeDevicePath="/dev/nvme4n1"
I0506 22:41:33.301827       1 node.go:241] "NodeStageVolume: checking if volume is already staged" device="" source="/dev/nvme4n1" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:33.301849       1 node.go:248] "NodeStageVolume: staging volume" source="/dev/nvme4n1" volumeID="vol-0944933a47e800e4b" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" fstype="ext4"
I0506 22:41:33.355523       1 mount_linux.go:647] Attempting to mount disk /dev/nvme4n1 in ext4 format at /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer
I0506 22:41:33.355561       1 mount_linux.go:270] Mounting cmd (mount) with arguments (-t ext4 -o defaults /dev/nvme4n1 /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer)
I0506 22:41:33.380686       1 node.go:295] "NodeStageVolume: successfully staged volume" source="/dev/nvme4n1" volumeID="vol-0944933a47e800e4b" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" fstype="ext4"
I0506 22:41:33.380700       1 node.go:190] "NodeStageVolume: volume operation finished" volumeID="vol-0944933a47e800e4b"
I0506 22:41:33.380709       1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"
I0506 22:41:33.381333       1 node.go:413] "NodePublishVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" publish_context:{key:\"devicePath\" value:\"/dev/xvdaa\"} staging_target_path:\"/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer\" target_path:\"/local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer\" volume_capability:{mount:{} access_mode:{mode:SINGLE_NODE_WRITER}}"
I0506 22:41:33.381432       1 mount_linux.go:199] "NodePublishVolume: creating dir" target="/local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:33.381533       1 node.go:757] "NodePublishVolume: mounting" source="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" target="/local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer" mountOptions=["bind"] fsType="ext4"
I0506 22:41:33.381687       1 mount_linux.go:270] Mounting cmd (mount) with arguments (-t ext4 -o bind /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer /local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer)
I0506 22:41:33.382818       1 mount_linux.go:270] Mounting cmd (mount) with arguments (-t ext4 -o bind,remount /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer /local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer)
I0506 22:41:33.383872       1 node.go:442] "NodePublishVolume: volume operation finished" volumeId="vol-0944933a47e800e4b"
I0506 22:41:33.383887       1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"

Job file (if appropriate)

job "<redacted-job-name>" {
  namespace = "<redacted-namespace>"
  node_pool = "<redacted-namespace>-nodes"
  datacenters = ["aws-us-east-1"]
  type = "service"

  group "<redacted-job-name>" {
    count = 1

    network {
      port "rpc" {}
    }

    volume "<redacted_volume_name>" {
      type = "csi"
      read_only = false
      source = "<redacted-job-name>"
      access_mode = "single-node-writer"
      attachment_mode = "file-system"
    }

    task "<redacted-job-name>" {
      driver = "docker"

      restart {
        interval = "30m"
        attempts = 120
        delay = "15s"
        render_templates = true
      }

      volume_mount {
        volume      = "<redacted_volume_name>"
        destination = "/opt/<redacted_volume_name>"
        read_only   = false
      }

      resources {
        memory = 128
        memory_max = 64000
      }

      config {
        image = "<redacted-image-url>"
        force_pull = true
        ports = ["rpc"]
        ipc_mode = "host"
      }

      identity {
        env = true
      }

      service {
        name = "<redacted-job-name>"
        provider = "nomad"
        port = "rpc"

        check {
          type     = "http"
          interval = "3s"
          path = "/health"
          timeout  = "1s"
        }

        tags = [
          "traefik.enable=true",
          "traefik.namespace=<redacted-namespace>",
        ]
      }
    }
  }
}

Nomad Server logs (if appropriate)

May 06 22:31:40 nomad-server0.dev.org.wingspantrading.com nomad[1027]:     2025-05-06T22:31:40.465Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=central-infra volume_id=locates-server
May 06 22:31:40 nomad-server0.dev.org.wingspantrading.com nomad[1027]: nomad.volumes_watcher: error releasing volume claims: namespace=central-infra volume_id=locates-server

Nomad Client logs (if appropriate)

May 06 22:41:17  nomad[1691366]:     2025-05-06T22:41:17.118Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Terminated msg="Exit Code: 0" failed=false
May 06 22:41:17  nomad[1691366]:  client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Terminated msg="Exit Code: 0" failed=false
May 06 22:41:17  nomad[1691366]:     2025-05-06T22:41:17.140Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killed msg="Task successfully killed" failed=false
May 06 22:41:17  nomad[1691366]:  client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killed msg="Task successfully killed" failed=false
May 06 22:41:17  nomad[1691366]:     2025-05-06T22:41:17.168Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> plugin=/usr/local/bin/nomad id=3744565
May 06 22:41:17  nomad[1691366]:  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> plugin=/usr/local/bin/nomad id=3744565
May 06 22:41:17  nomad[1691366]:     2025-05-06T22:41:17.361Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
May 06 22:41:17  nomad[1691366]:  client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
May 06 22:41:33  nomad[1691366]:     2025-05-06T22:41:33.391Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type="Task Setup" msg="Building Task Directory" failed=false
May 06 22:41:33  nomad[1691366]:  client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type="Task Setup" msg="Building Task Directory" failed=false
May 06 22:41:33  nomad[1691366]:     2025-05-06T22:41:33.469Z [INFO]  agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/local/config.yml"
May 06 22:41:33  nomad[1691366]:  agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/local/config.yml"
May 06 22:41:33  nomad[1691366]:     2025-05-06T22:41:33.488Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Driver msg="Downloading image" failed=false
May 06 22:41:33  nomad[1691366]:  client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Driver msg="Downloading image" failed=false
May 06 22:41:33  nomad[1691366]:     2025-05-06T22:41:33.880Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Started msg="Task started by client" failed=false
May 06 22:41:33  nomad[1691366]:  client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Started msg="Task started by client" failed=false

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Triaging

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions