Skip to content

FDP GC crash in select_victim_ru() when removing RU from full_ru_list #186

@HXuanlin

Description

@HXuanlin

Describe the bug
I encountered a crash when running FEMU in black-box FDP mode under heavy write / GC pressure. The qemu-system-x86_64 process crashed with SIGSEGV in the FDP GC path.

The crash happens in select_victim_ru() when FDP GC falls back to selecting a victim RU from full_ru_list and removes it with QTAILQ_REMOVE().

I compared my local ftl.c with the upstream version. The only meaningful local changes are that I disabled several FDP_TRACE logs to reduce log size. The FDP GC logic around fdp_advance_ru_pointer(), do_gc_fdp_style(), select_victim_ru(), and full_ru_list handling appears unchanged.

Environment

  • Host OS: Ubuntu 22.04
  • Kernel version: 6.8.0-107-generic
  • FEMU version/commit: c966d34
  • FEMU mode: BlackBox SSD with FDP enabled
  • FDP configuration:
    • fdp=on
    • fdp.nruh=8
    • fdp.nrg=1
    • fdp.nru=256
  • Device size: 12288 MB
  • Guest OS/image: Ubuntu 24.04 qcow2 image

To Reproduce
Steps to reproduce the behavior:

  1. Use the upstream run-blackbox-fdp.sh script from commit c966d341a13795ef917702756c6fd727aeb2bbef.

  2. Start FEMU with the following command:

    stdbuf -oL -eL ./run-blackbox-fdp.sh 2>&1 | tee ~/femu-fdp-$(date +%F-%H%M%S).log
  3. The script starts FEMU in black-box SSD mode with FDP enabled. The effective QEMU command line shown in the coredump includes the following key options:

fdp=on
fdp.nruh=8
fdp.nrg=1
fdp.nru=256

devsz_mb=12288
femu_mode=1
secsz=512
secs_per_pg=8
pgs_per_blk=256
blks_per_pl=256
pls_per_lun=1
luns_per_ch=8
nchs=8

gc_thres_pcent=50
gc_thres_pcent_high=75
  1. Run a heavy write workload inside the guest so that the FDP device reaches high GC pressure / RU exhaustion.

  2. The host-side qemu-system-x86_64 process crashes with SIGSEGV.

Expected behavior
FEMU should not crash when the FDP device is under high write or GC pressure.

Even if there are no free RUs available, the FDP GC path should handle the situation gracefully, for example by returning an error, stalling/retrying the write path, or reporting device-full / no-free-RU conditions, rather than causing a segmentation fault in the QEMU process.

Error logs
The coredump shows that qemu-system-x86_64 crashed with SIGSEGV:

sudo coredumpctl list | tail -n 50

TIME                            PID UID GID SIG     COREFILE  EXE                                                      SIZE
Mon 2026-05-04 19:30:06 CST 1904205   0   0 SIGSEGV truncated /home/dell/femu-work/FEMU/build-femu/qemu-system-x86_64 17.3M

The GDB backtrace points to select_victim_ru():

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055c5ed859ee1 in select_victim_ru (
    force=false,
    ruhid=<error reading variable: Cannot access memory at address 0x7326a9cfc6e0>,
    rgid=<error reading variable: Cannot access memory at address 0x7326a9cfc728>,
    ssd=0x55c60e0ebce0
) at ../hw/femu/bbssd/ftl.c:1639

1639                    QTAILQ_REMOVE(&rm->full_ru_list, cand, entry);

(gdb) bt
#0  0x000055c5ed859ee1 in select_victim_ru (
    force=false,
    ruhid=<error reading variable: Cannot access memory at address 0x7326a9cfc6e0>,
    rgid=<error reading variable: Cannot access memory at address 0x7326a9cfc728>,
    ssd=0x55c60e0ebce0
) at ../hw/femu/bbssd/ftl.c:1639

#1  do_gc_fdp_style (
    ssd=0x55c60e0ebce0,
    rgid=<error reading variable: Cannot access memory at address 0x7326a9cfc728>,
    ruhid=<optimized out>,
    force=<optimized out>
) at ../hw/femu/bbssd/ftl.c:1818

Backtrace stopped: Cannot access memory at address 0x7326a9cfc798

The relevant code path is:

if (!victim_ru) {
    FemuReclaimUnit *cand;
    QTAILQ_FOREACH(cand, &rm->full_ru_list, entry) {
        bool is_active = false;
        for (uint16_t ri = 0; ri < (uint16_t)ssd->nruhs; ri++) {
            if (ssd->ruhs[ri].curr_ru == cand ||
                ssd->ruhs[ri].gc_ru == cand) {
                is_active = true;
                break;
            }
        }
        if (!is_active) {
            victim_ru = cand;
            QTAILQ_REMOVE(&rm->full_ru_list, cand, entry);
            rm->full_ru_cnt--;
            break;
        }
    }
}

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions