Skip to content

multipath-tools v0.1.0: segfault in ld-musl-x86_64.so.1 when processing FC paths + ExtensionServiceConfig writes empty configFile #1116

Description

@mcrespov

support-bundle.zip

Environment

  • Talos Linux: v1.13.4
  • Kubernetes: v1.36.1
  • Extension: siderolabs/multipath-tools v0.1.0
  • Hardware: Dell PowerEdge R660 (3 nodes: 1 control plane + 2 workers)
  • HBA: Emulex LPe35002-M2-D 2-Port 32Gb Fibre Channel Adapter
  • Storage: Pure Storage FA-C70R5, Purity//FA 6.10.4
  • Connectivity: Fibre Channel with LACP bonding (4x Broadcom 25Gb ports per node for Ethernet, separate FC fabric)
  • Zoning: verified redundant (Pure GUI shows "Redundant" status, 2 WWN per host, paths on CT0 and CT1)

Issue 1: multipathd segfaults in ld-musl-x86_64.so.1 when FC paths are present

After approximately 60-90 seconds of runtime, multipathd crashes with a segfault consistently reproducible across all three nodes. The crash occurs in the dynamic linker of musl libc regardless of the path_checker used (tested with both directio and tur).

Kernel log output (same pattern on all nodes):

multipathd[20353]: segfault at 7f8512486b38 ip 00007f85127c5b8b sp 00007ffe9843e780 error 4 in ld-musl-x86_64.so.1[60b8b,7f8512779000+58000] likely on CPU 89 (core 7, socket 1)

The service restarts automatically and the crash repeats indefinitely. During the brief window before the crash, multipathd is visible and running, but all four SCSI paths remain in orphan state and no dm device is ever created.

multipathd show paths output (captured before crash):

hcil     dev dev_t pri dm_st chk_st dev_st  next_check
12:0:0:1 sda 8:0   50  undef undef  unknown orphan
12:0:1:1 sdb 8:16  50  undef undef  unknown orphan
13:0:0:1 sdc 8:32  50  undef undef  unknown orphan
13:0:1:1 sdd 8:48  50  undef undef  unknown orphan

The kernel detects all four paths correctly at boot (two per HBA port, two HBA ports per node):

scsi 13:0:0:1: Direct-Access     PURE     FlashArray       8888 PQ: 0 ANSI: 6
scsi 13:0:1:1: Direct-Access     PURE     FlashArray       8888 PQ: 0 ANSI: 6
scsi 14:0:0:1: Direct-Access     PURE     FlashArray       8888 PQ: 0 ANSI: 6
scsi 14:0:1:1: Direct-Access     PURE     FlashArray       8888 PQ: 0 ANSI: 6

udev correctly populates all relevant attributes on each device:

DEVTYPE=disk
SUBSYSTEM=block
DM_MULTIPATH_DEVICE_PATH=1
ID_SCSI=1
ID_SERIAL=3624a93707e521182588644d300011b2d
ID_WWN=0x624a93707e521182
ID_WWN_VENDOR_EXTENSION=0x588644d300011b2d
ID_WWN_WITH_EXTENSION=0x624a93707e521182588644d300011b2d
ID_SCSI_SERIAL=7E521182588644D300011B2D

Issue 2: ExtensionServiceConfig creates configFile with 0 bytes

When providing a custom multipath.conf via ExtensionServiceConfig, the file is created in the container overlay at the correct path but its content is never written. The file is consistently 0 bytes.

Observed filesystem state inside the overlay:

-rw-r--r--    1 root     root             0 Jun 15 07:33 multipath.conf

The ExtensionServiceConfig spec is correctly stored in Talos (verified via talosctl get extensionserviceconfigs -o yaml) and contains the full configuration, but it does not reach the file on disk.

Workaround applied: a privileged DaemonSet with an initContainer writes the configuration directly to the overlay path /system/overlays/usr-local-lib-containers-multipathd-diff/etc/multipath/multipath.conf. After this workaround, multipathd show config local confirms the custom configuration is being read. However, the segfault (Issue 1) persists regardless of the configuration provided.

Issue 3: ExtensionServiceConfig name mismatch documentation

The correct name for the ExtensionServiceConfig must be multipathd, not multipath. Using name: multipath causes Talos to store the configuration without error but the extension service never picks it up. This took significant time to diagnose and is not documented anywhere. A note in the extension README would help future users.

Configuration used

ExtensionServiceConfig:

apiVersion: v1alpha1
kind: ExtensionServiceConfig
name: multipathd
configFiles:
    - content: |
        defaults {
            polling_interval        5
            path_grouping_policy    multibus
            uid_attribute           ID_WWN_WITH_EXTENSION
            failback                immediate
            no_path_retry           0
            user_friendly_names     no
            find_multipaths         no
        }
        blacklist {
            devnode "^nvme0n1$"
            devnode "^sr[0-9]*"
            devnode "^nbd[0-9]*"
        }
        devices {
            device {
                vendor                  "PURE"
                product                 "FlashArray"
                path_selector           "service-time 0"
                path_grouping_policy    multibus
                path_checker            tur
                fast_io_fail_tmo        10
                dev_loss_tmo            60
                no_path_retry           0
                failback                immediate
            }
        }
      mountPath: /etc/multipath/multipath.conf

machine.udev rules (added to ensure udev notifies multipathd of block devices):

machine:
    udev:
        rules:
            - SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ENV{ID_WWN}!="", ENV{DM_MULTIPATH_DEVICE_PATH}="1"
            - ACTION=="add|change", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", ENV{ID_WWN}!="", RUN+="/sbin/multipath -v 0 -r"
            - ACTION=="add", SUBSYSTEM=="scsi", ENV{DEVTYPE}=="scsi_device", RUN+="/sbin/multipath -v 0"

machine.kernel.modules:

machine:
    kernel:
        modules:
            - name: dm_multipath

Additional findings

The extension container runs as an overlay at /usr/local/lib/containers/multipathd/ with its writable layer at /system/overlays/usr-local-lib-containers-multipathd-diff/. The mountPath in configFiles appears to place files under /etc/multipath/ (subdirectory) rather than /etc/ (root). The correct mountPath to reach the location multipathd reads is /etc/multipath/multipath.conf, not /etc/multipath.conf.

The extension was also tested on Talos v1.13.3 where the segfault was more aggressive (occurring immediately on startup, causing a rapid restart loop). On v1.13.4 the service runs for approximately 60-90 seconds before crashing, suggesting a partial improvement but the underlying issue remains.

Expected behavior

multipathd should run stably, group the four FC paths into a single dm device, and expose it at /dev/disk/by-id/wwn-0x624a93707e521182588644d300011b2d pointing to a dm device rather than a raw SCSI disk.

Actual behavior

multipathd crashes repeatedly with a segfault in ld-musl-x86_64.so.1 before completing path grouping. No dm device is ever created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions