Skip to content

ZFS bock mode devices sometimes don't appear in time (or disappear) #1505

Open
@mschiff

Description

@mschiff

Required information

  • Distribution: Gentoo
  • Distribution version: --
  • The output of "incus info":
config: {}
api_extensions:
[…]
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIIB+zCCAYGgAwIBAgIRAICBOYWaBvVjqdDlc8YUw/swCgYIKoZIzj0EAwMwMDEZ
    MBcGA1UEChMQTGludXggQ29udGFpbmVyczETMBEGA1UEAwwKcm9vdEBwbHV0bzAe
    Fw0yNDAxMTIxNTUzNTNaFw0zNDAxMDkxNTUzNTNaMDAxGTAXBgNVBAoTEExpbnV4
    IENvbnRhaW5lcnMxEzARBgNVBAMMCnJvb3RAcGx1dG8wdjAQBgcqhkjOPQIBBgUr
    gQQAIgNiAASHXIMZRZz07Fvdh1ehgSKNc3yk/fYbTPMGbOF3VyeDWVWEsjhHQFeW
    rJ9EPYNjogBUba3NKVaEmMAz+nKDM4TcwwLytIUAFLFyIqaRnuxFHF6RXU/g7jGV
    RzwqYxSzCcujXzBdMA4GA1UdDwEB/wQEAwIFoDATBgNVHSUEDDAKBggrBgEFBQcD
    ATAMBgNVHRMBAf8EAjAAMCgGA1UdEQQhMB+CBXBsdXRvhwR/AAABhxAAAAAAAAAA
    AAAAAAAAAAABMAoGCCqGSM49BAMDA2gAMGUCMQD9KzSklLuzuifRoJvKuwCPmn9r
    UyfCac0hEA5Kj57/hgJ0y/+n3ODpln3JuAdn95MCMBJ8b5/W6x6sxC2e+vso+lfD
    LNTgFQ6Ha6FfWEKBTGPb0KPr3XICf5IMIWb6kI3OiA==
    -----END CERTIFICATE-----
  certificate_fingerprint: a75220cf363eb0c0ffd6427f5d9d88510d5d608d335514b76548b47e37bb4e69
  driver: lxc | qemu
  driver_version: 6.0.2 | 8.2.3
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "false"
    unpriv_fscaps: "true"
  kernel_version: 6.6.62-gentoo-dist
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Gentoo
  os_version: "2.17"
  project: default
  server: incus
  server_clustered: false
  server_event_mode: full-mesh
  server_name: pluto
  server_pid: 1857954
  server_version: "6.7"
  storage: zfs
  storage_version: 2.2.5-r0-gentoo
  storage_supported_drivers:
  - name: zfs
    version: 2.2.5-r0-gentoo
    remote: false
  - name: dir
    version: "1"
    remote: false

Issue description

Here is another one on the same setup as in #1483

While running incus-benchmark launch --count 200 --parallel 11 (with --parallel >= 11 on that 12 thread machine), some instances fail with this errors:

(sysadm_r)@pluto ~ # incus-benchmark launch --count 200 --parallel 11
Test environment:
  Server backend: incus
  Server version: 6.7
  Kernel: Linux
  Kernel tecture: x86_64
  Kernel version: 6.6.62-gentoo-dist
  Storage backend: zfs
  Storage version: 2.2.5-r0-gentoo
  Container backend: lxc | qemu
  Container version: 6.0.2 | 8.2.3

Test variables:
  Container count: 200
  Container mode: unprivileged
  Startup mode: normal startup
  Image: images:ubuntu/22.04
  Batches: 18
  Batch size: 11
  Remainder: 2

[Dec 13 09:37:00.962] Found image in local store: 3bf7665e786f9baf780a5774fae5b43f5a29c79d76af1474ff6e8f212fcfa25d
[Dec 13 09:37:00.962] Batch processing start
[Dec 13 09:37:58.761] Processed 11 containers in 57.798s (0.190/s)
[Dec 13 09:38:54.180] Processed 22 containers in 113.217s (0.194/s)
[Dec 13 09:40:51.015] Processed 44 containers in 230.053s (0.191/s)
[Dec 13 09:41:25.084] Failed to launch container 'benchmark-052': Failed creating instance from image: Failed to activate volume: Failed to locate zvol for "zdata/incus/containers/benchmark-052": context deadline exceeded
[Dec 13 09:41:25.099] Failed to launch container 'benchmark-055': Failed creating instance from image: Failed to activate volume: Failed to locate zvol for "zdata/incus/containers/benchmark-055": context deadline exceeded
[Dec 13 09:44:18.373] Processed 88 containers in 437.411s (0.201/s)
[Dec 13 09:44:53.504] Failed to launch container 'benchmark-093': Failed creating instance from image: Failed to activate volume: Failed to locate zvol for "zdata/incus/containers/benchmark-093": context deadline exceeded
[Dec 13 09:45:43.469] Failed to launch container 'benchmark-107': Failed creating instance from image: Failed to mount "/dev/zvol/zdata/incus/containers/benchmark-107" on "/var/lib/incus/storage-pools/default/containers/benchmark-107" using "ext4": no such file or directory
[Dec 13 09:51:44.892] Processed 176 containers in 883.929s (0.199/s)
[Dec 13 09:53:45.504] Batch processing completed in 1004.542s
(sysadm_r)@pluto ~ #

indeed 4 instances missing:

(sysadm_r)@pluto ~ # incus list|grep benchmark| wc -l
196
(sysadm_r)@pluto ~ #

The ZFS storage itself is fine and healthy. The mentioned filesets are missing in the pool. If I increase --parallel further, I get more failing instances (both, "context deadline exceeded" and "no such file or directory")

Steps to reproduce

  1. Have zfs storage backend (here 2x nvme as system and zfs special device and 2x spinning rust for data)
  2. run above incus-benchmark command with --parallel matching (alomost) system cpu threads

Information to attach

I also see following errors, but also for instances that did not fail in the end:

Dez 13 09:43:25 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't open blockdev
Dez 13 09:43:25 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:43:26 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:43:26 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:43:27 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-086: Can't lookup blockdev
Dez 13 09:45:08 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't open blockdev
Dez 13 09:45:09 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:09 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:10 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:10 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:11 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:12 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:12 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:13 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:13 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:14 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:14 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:15 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:15 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:16 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:16 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:17 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:17 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:18 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:18 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-107: Can't lookup blockdev
Dez 13 09:45:36 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-110: Can't open blockdev
Dez 13 09:52:19 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-183: Can't open blockdev
Dez 13 09:52:34 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-195: Can't open blockdev
Dez 13 09:52:34 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-198: Can't open blockdev
Dez 13 09:52:34 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-195: Can't lookup blockdev
Dez 13 09:52:35 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-198: Can't lookup blockdev
Dez 13 09:52:35 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-195: Can't lookup blockdev
Dez 13 09:52:35 pluto kernel: /dev/zvol/zdata/incus/containers/benchmark-198: Can't lookup blockdev

To me this looks like load issues and incusd not waiting long enough for devices to appear? Because it eventually gave up on 107 (does not exist), but not for example on 086 (exists and is running)

Here is some of the the output of incus monitor --pretty, but from another run.

benchmark-055 failed with "no such file or directory":

(sysadm_r)@pluto ~ # grep benchmark-055 incus-monitor.log 
time="2024-12-13T10:35:54+01:00" level=info msg="ID: 7e31f71c-9194-40da-aaa8-eca2192195c2, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:35:54.267475495 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-055]]" Status=Pending StatusCode=Pending UpdatedAt="2024-12-13 10:35:54.267475495 +0100 CET"
time="2024-12-13T10:35:54+01:00" level=info msg="ID: 7e31f71c-9194-40da-aaa8-eca2192195c2, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:35:54.267475495 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-055]]" Status=Running StatusCode=Running UpdatedAt="2024-12-13 10:35:54.267475495 +0100 CET"
time="2024-12-13T10:35:54+01:00" level=debug msg="Instance operation lock created" action=create instance=benchmark-055 project=default reusable=false
time="2024-12-13T10:35:54+01:00" level=info msg="Creating instance" ephemeral=false instance=benchmark-055 instanceType=container project=default
time="2024-12-13T10:35:54+01:00" level=debug msg="Adding device" device=eth0 instance=benchmark-055 instanceType=container project=default type=nic
time="2024-12-13T10:35:54+01:00" level=info msg="Action: instance-created, Source: /1.0/instances/benchmark-055, Requestor: unix/root (@)" location=none storage-pool=default type=container
time="2024-12-13T10:35:54+01:00" level=info msg="Created instance" ephemeral=false instance=benchmark-055 instanceType=container project=default
time="2024-12-13T10:35:54+01:00" level=debug msg="Adding device" device=root instance=benchmark-055 instanceType=container project=default type=disk
time="2024-12-13T10:35:54+01:00" level=debug msg="CreateInstanceFromImage started" driver=zfs instance=benchmark-055 pool=default project=default
time="2024-12-13T10:35:56+01:00" level=debug msg="Activated ZFS volume" dev=zdata/incus/containers/benchmark-055 driver=zfs pool=default volName=benchmark-055
time="2024-12-13T10:36:14+01:00" level=debug msg="Deactivated ZFS volume" dev=zdata/incus/containers/benchmark-055 driver=zfs pool=default volName=benchmark-055
time="2024-12-13T10:36:15+01:00" level=debug msg="Removing device" device=root instance=benchmark-055 instanceType=container project=default type=disk
time="2024-12-13T10:36:15+01:00" level=debug msg="CreateInstanceFromImage finished" driver=zfs instance=benchmark-055 pool=default project=default
time="2024-12-13T10:36:15+01:00" level=debug msg="Instance operation lock finished" action=create err="<nil>" instance=benchmark-055 project=default reusable=false
time="2024-12-13T10:36:15+01:00" level=debug msg="Removing device" device=eth0 instance=benchmark-055 instanceType=container project=default type=nic
time="2024-12-13T10:36:15+01:00" level=info msg="ID: 7e31f71c-9194-40da-aaa8-eca2192195c2, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:35:54.267475495 +0100 CET" Err="Failed creating instance from image: Failed to mount \"/dev/zvol/zdata/incus/containers/benchmark-055\" on \"/var/lib/incus/storage-pools/default/containers/benchmark-055\" using \"ext4\": no such file or directory" Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-055]]" Status=Failure StatusCode=Failure UpdatedAt="2024-12-13 10:35:54.267475495 +0100 CET"
time="2024-12-13T10:36:15+01:00" level=debug msg="Failure for operation" class=task description="Creating instance" err="Failed creating instance from image: Failed to mount \"/dev/zvol/zdata/incus/containers/benchmark-055\" on \"/var/lib/incus/storage-pools/default/containers/benchmark-055\" using \"ext4\": no such file or directory" operation=7e31f71c-9194-40da-aaa8-eca2192195c2 project=default
(sysadm_r)@pluto ~ # 

benchmark-082 failed with context deadline exceeded

(sysadm_r)@pluto ~ # grep benchmark-082 incus-monitor .log
time="2024-12-13T10:38:30+01:00" level=info msg="ID: 69bfdd5e-d71c-4503-a18d-56fac3a0f59c, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:38:30.739073919 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-082]]" Status=Pending StatusCode=Pending UpdatedAt="2024-12-13 10:38:30.739073919 +0100 CET"
time="2024-12-13T10:38:30+01:00" level=info msg="ID: 69bfdd5e-d71c-4503-a18d-56fac3a0f59c, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:38:30.739073919 +0100 CET" Err= Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-082]]" Status=Running StatusCode=Running UpdatedAt="2024-12-13 10:38:30.739073919 +0100 CET"
time="2024-12-13T10:38:30+01:00" level=debug msg="Instance operation lock created" action=create instance=benchmark-082 project=default reusable=false
time="2024-12-13T10:38:30+01:00" level=info msg="Creating instance" ephemeral=false instance=benchmark-082 instanceType=container project=default
time="2024-12-13T10:38:31+01:00" level=debug msg="Adding device" device=eth0 instance=benchmark-082 instanceType=container project=default type=nic
time="2024-12-13T10:38:31+01:00" level=debug msg="Adding device" device=root instance=benchmark-082 instanceType=container project=default type=disk
time="2024-12-13T10:38:31+01:00" level=info msg="Action: instance-created, Source: /1.0/instances/benchmark-082, Requestor: unix/root (@)" location=none storage-pool=default type=container
time="2024-12-13T10:38:31+01:00" level=info msg="Created instance" ephemeral=false instance=benchmark-082 instanceType=container project=default
time="2024-12-13T10:38:31+01:00" level=debug msg="CreateInstanceFromImage started" driver=zfs instance=benchmark-082 pool=default project=default
time="2024-12-13T10:39:11+01:00" level=debug msg="Removing device" device=eth0 instance=benchmark-082 instanceType=container project=default type=nic
time="2024-12-13T10:39:11+01:00" level=debug msg="CreateInstanceFromImage finished" driver=zfs instance=benchmark-082 pool=default project=default
time="2024-12-13T10:39:11+01:00" level=debug msg="Instance operation lock finished" action=create err="<nil>" instance=benchmark-082 project=default reusable=false
time="2024-12-13T10:39:11+01:00" level=debug msg="Removing device" device=root instance=benchmark-082 instanceType=container project=default type=disk
time="2024-12-13T10:39:11+01:00" level=info msg="ID: 69bfdd5e-d71c-4503-a18d-56fac3a0f59c, Class: task, Description: Creating instance" CreatedAt="2024-12-13 10:38:30.739073919 +0100 CET" Err="Failed creating instance from image: Failed to activate volume: Failed to locate zvol for \"zdata/incus/containers/benchmark-082\": context deadline exceeded" Location=none MayCancel=false Metadata="map[]" Resources="map[instances:[/1.0/instances/benchmark-082]]" Status=Failure StatusCode=Failure UpdatedAt="2024-12-13 10:38:30.739073919 +0100 CET"
time="2024-12-13T10:39:11+01:00" level=debug msg="Failure for operation" class=task description="Creating instance" err="Failed creating instance from image: Failed to activate volume: Failed to locate zvol for \"zdata/incus/containers/benchmark-082\": context deadline exceeded" operation=69bfdd5e-d71c-4503-a18d-56fac3a0f59c project=default
(sysadm_r)@pluto ~ # 

Metadata

Metadata

Assignees

No one assigned

    Labels

    IncompleteWaiting on more information from reporter

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions