Description
Nomad version
Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e
aws-csi plugin version
public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.41.0
Operating system and Environment details
Rocky 9.3
Issue
Intermittently (about 25% of the time) if we manually stop an allocation of a job that uses an aws ebs CSI volume, and then nomad automatically creates a new allocation of the job, the volume gets into a state where any access to it results in input/output error. The issue is fixed by fully stopping the job (Stop Job in the GUI) and then starting it.
I am unsure if this is an issue with the CSI plugin or with nomad, but since stopping the job and then starting it fresh fixes it I am leaning towards nomad.
Reproduction steps
Stop the allocation and allow nomad to create a new one.
Expected Result
Volume is mounted to the new allocation and usable for reading and writing
Actual Result
Any access gives input/output error. Even if i ssh to the client node where the volume is mounted and try to access the volume I get the error:
$ ls /var/nomad/client/csi/node/aws_ebs/staging/central-infra/locates-server/rw-file-system-single-node-writer
ls: cannot access '/var/nomad/client/csi/node/aws_ebs/staging/central-infra/locates-server/rw-file-system-single-node-writer': Permission denied
relevant ebs-nodes plugin logs
Allocation is dropped right before this time:
I0506 22:41:17.177441 1 node.go:466] "NodeUnpublishVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" target_path:\"/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer\""
I0506 22:41:17.177487 1 node.go:486] "NodeUnpublishVolume: unmounting" target="/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.177510 1 mount_helper_common.go:93] unmounting "/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I0506 22:41:17.177522 1 mount_linux.go:412] Unmounting /local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer
I0506 22:41:17.181837 1 mount_helper_common.go:150] Deleting path "/local/csi/per-alloc/759d066a-b84c-825b-9d46-b8a6b41adcc2/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.181881 1 node.go:482] "NodeUnPublishVolume: volume operation finished" volumeId="vol-0944933a47e800e4b"
I0506 22:41:17.181892 1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"
I0506 22:41:17.187149 1 node.go:300] "NodeUnstageVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" staging_target_path:\"/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer\""
I0506 22:41:17.187374 1 node.go:340] "NodeUnstageVolume: unmounting" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.187387 1 mount_helper_common.go:93] unmounting "/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" (corruptedMount: false, mounterCanSkipMountPointChecks: true)
I0506 22:41:17.187399 1 mount_linux.go:412] Unmounting /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer
I0506 22:41:17.227478 1 mount_helper_common.go:150] Deleting path "/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.227541 1 node.go:345] "NodeUnStageVolume: successfully unstaged volume" volumeID="vol-0944933a47e800e4b" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:17.227552 1 node.go:315] "NodeUnStageVolume: volume operation finished" volumeID="vol-0944933a47e800e4b"
I0506 22:41:17.227561 1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"
Allocation is started before this time:
I0506 22:41:33.299587 1 node.go:112] "NodeStageVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" publish_context:{key:\"devicePath\" value:\"/dev/xvdaa\"} staging_target_path:\"/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer\" volume_capability:{mount:{} access_mode:{mode:SINGLE_NODE_WRITER}}"
I0506 22:41:33.299750 1 mount_linux.go:107] "[Debug] successfully resolved" nvmeName="nvme-Amazon_Elastic_Block_Store_vol0944933a47e800e4b" nvmeDevicePath="/dev/nvme4n1"
I0506 22:41:33.301827 1 node.go:241] "NodeStageVolume: checking if volume is already staged" device="" source="/dev/nvme4n1" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:33.301849 1 node.go:248] "NodeStageVolume: staging volume" source="/dev/nvme4n1" volumeID="vol-0944933a47e800e4b" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" fstype="ext4"
I0506 22:41:33.355523 1 mount_linux.go:647] Attempting to mount disk /dev/nvme4n1 in ext4 format at /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer
I0506 22:41:33.355561 1 mount_linux.go:270] Mounting cmd (mount) with arguments (-t ext4 -o defaults /dev/nvme4n1 /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer)
I0506 22:41:33.380686 1 node.go:295] "NodeStageVolume: successfully staged volume" source="/dev/nvme4n1" volumeID="vol-0944933a47e800e4b" target="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" fstype="ext4"
I0506 22:41:33.380700 1 node.go:190] "NodeStageVolume: volume operation finished" volumeID="vol-0944933a47e800e4b"
I0506 22:41:33.380709 1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"
I0506 22:41:33.381333 1 node.go:413] "NodePublishVolume: called" args="volume_id:\"vol-0944933a47e800e4b\" publish_context:{key:\"devicePath\" value:\"/dev/xvdaa\"} staging_target_path:\"/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer\" target_path:\"/local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer\" volume_capability:{mount:{} access_mode:{mode:SINGLE_NODE_WRITER}}"
I0506 22:41:33.381432 1 mount_linux.go:199] "NodePublishVolume: creating dir" target="/local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer"
I0506 22:41:33.381533 1 node.go:757] "NodePublishVolume: mounting" source="/local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer" target="/local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer" mountOptions=["bind"] fsType="ext4"
I0506 22:41:33.381687 1 mount_linux.go:270] Mounting cmd (mount) with arguments (-t ext4 -o bind /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer /local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer)
I0506 22:41:33.382818 1 mount_linux.go:270] Mounting cmd (mount) with arguments (-t ext4 -o bind,remount /local/csi/staging/<redacted-namespace>/<redacted-job-name>/rw-file-system-single-node-writer /local/csi/per-alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/rw-file-system-single-node-writer)
I0506 22:41:33.383872 1 node.go:442] "NodePublishVolume: volume operation finished" volumeId="vol-0944933a47e800e4b"
I0506 22:41:33.383887 1 inflight.go:74] "Node Service: volume operation finished" key="vol-0944933a47e800e4b"
Job file (if appropriate)
job "<redacted-job-name>" {
namespace = "<redacted-namespace>"
node_pool = "<redacted-namespace>-nodes"
datacenters = ["aws-us-east-1"]
type = "service"
group "<redacted-job-name>" {
count = 1
network {
port "rpc" {}
}
volume "<redacted_volume_name>" {
type = "csi"
read_only = false
source = "<redacted-job-name>"
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
task "<redacted-job-name>" {
driver = "docker"
restart {
interval = "30m"
attempts = 120
delay = "15s"
render_templates = true
}
volume_mount {
volume = "<redacted_volume_name>"
destination = "/opt/<redacted_volume_name>"
read_only = false
}
resources {
memory = 128
memory_max = 64000
}
config {
image = "<redacted-image-url>"
force_pull = true
ports = ["rpc"]
ipc_mode = "host"
}
identity {
env = true
}
service {
name = "<redacted-job-name>"
provider = "nomad"
port = "rpc"
check {
type = "http"
interval = "3s"
path = "/health"
timeout = "1s"
}
tags = [
"traefik.enable=true",
"traefik.namespace=<redacted-namespace>",
]
}
}
}
}
Nomad Server logs (if appropriate)
May 06 22:31:40 nomad-server0.dev.org.wingspantrading.com nomad[1027]: 2025-05-06T22:31:40.465Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=central-infra volume_id=locates-server
May 06 22:31:40 nomad-server0.dev.org.wingspantrading.com nomad[1027]: nomad.volumes_watcher: error releasing volume claims: namespace=central-infra volume_id=locates-server
Nomad Client logs (if appropriate)
May 06 22:41:17 nomad[1691366]: 2025-05-06T22:41:17.118Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Terminated msg="Exit Code: 0" failed=false
May 06 22:41:17 nomad[1691366]: client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Terminated msg="Exit Code: 0" failed=false
May 06 22:41:17 nomad[1691366]: 2025-05-06T22:41:17.140Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killed msg="Task successfully killed" failed=false
May 06 22:41:17 nomad[1691366]: client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killed msg="Task successfully killed" failed=false
May 06 22:41:17 nomad[1691366]: 2025-05-06T22:41:17.168Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> plugin=/usr/local/bin/nomad id=3744565
May 06 22:41:17 nomad[1691366]: client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> plugin=/usr/local/bin/nomad id=3744565
May 06 22:41:17 nomad[1691366]: 2025-05-06T22:41:17.361Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
May 06 22:41:17 nomad[1691366]: client.alloc_runner.task_runner: Task event: alloc_id=759d066a-b84c-825b-9d46-b8a6b41adcc2 task=<redacted-job-name> type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
May 06 22:41:33 nomad[1691366]: 2025-05-06T22:41:33.391Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type="Task Setup" msg="Building Task Directory" failed=false
May 06 22:41:33 nomad[1691366]: client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type="Task Setup" msg="Building Task Directory" failed=false
May 06 22:41:33 nomad[1691366]: 2025-05-06T22:41:33.469Z [INFO] agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/local/config.yml"
May 06 22:41:33 nomad[1691366]: agent: (runner) rendered "(dynamic)" => "/var/nomad/alloc/0850ca8c-67e5-3ba1-8e88-a20d76964955/<redacted-job-name>/local/config.yml"
May 06 22:41:33 nomad[1691366]: 2025-05-06T22:41:33.488Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Driver msg="Downloading image" failed=false
May 06 22:41:33 nomad[1691366]: client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Driver msg="Downloading image" failed=false
May 06 22:41:33 nomad[1691366]: 2025-05-06T22:41:33.880Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Started msg="Task started by client" failed=false
May 06 22:41:33 nomad[1691366]: client.alloc_runner.task_runner: Task event: alloc_id=0850ca8c-67e5-3ba1-8e88-a20d76964955 task=<redacted-job-name> type=Started msg="Task started by client" failed=false
Metadata
Metadata
Assignees
Type
Projects
Status