Skip to content

Corrupt FT.INTERNAL_UPDATE AOF entry aborts load by default (SIGABRT); skip-corrupted flag is broken and still SIGSEGVs #1109

@madolson

Description

@madolson

Description

Two related failures in how FT.INTERNAL_UPDATE handles a corrupt entry during AOF load:

  1. Default config aborts the load. A malformed-protobuf FT.INTERNAL_UPDATE entry in the AOF hits CHECK(false) (SIGABRT) when search.skip-corrupted-internal-update-entries=no (the default). This crash-loops the node on every startup until the AOF is hand-edited.
  2. The skip flag does not actually skip. Setting search.skip-corrupted-internal-update-entries=yes (the intended mitigation) logs "SKIPPING" and then still SIGSEGVs, because the skip branch returns OkStatus() and execution falls through into the same CreateEntryOnReplica -> TriggerCallbacks crash with a default-constructed entry.

So a corrupt AOF crashes the node by default, and the documented escape hatch is broken.

Steps to reproduce

mkdir -p /tmp/poisonaof/appendonlydir
printf 'file appendonly.aof.1.incr.aof seq 1 type i\n' > /tmp/poisonaof/appendonlydir/appendonly.aof.manifest
# SELECT 0 + FT.INTERNAL_UPDATE poisonidx <0xFF*6> <0xFF*6> (argv[2]/argv[3] not valid protobuf)
printf '*2\r\n$6\r\nSELECT\r\n$1\r\n0\r\n*4\r\n$18\r\nFT.INTERNAL_UPDATE\r\n$9\r\npoisonidx\r\n$6\r\n\xff\xff\xff\xff\xff\xff\r\n$6\r\n\xff\xff\xff\xff\xff\xff\r\n' > /tmp/poisonaof/appendonlydir/appendonly.aof.1.incr.aof

# (1) default -> SIGABRT
docker run -d --name n1 -v /tmp/poisonaof:/data -e VALKEY_EXTRA_FLAGS="--appendonly yes" valkey/valkey-bundle:unstable
# (2) skip flag set -> still SIGSEGV
docker run -d --name n2 -v /tmp/poisonaof:/data -e VALKEY_EXTRA_FLAGS="--appendonly yes --search.skip-corrupted-internal-update-entries yes" valkey/valkey-bundle:unstable

Crash output

# (1) default:
F0000 ... ft_internal_update.cc:46] Check failed: false Internal update failure during AOF loading - cannot continue
valkey ... crashed by signal: 6, si_code: -6

# (2) skip=yes:
ft_internal_update.cc:40: SKIPPING corrupted AOF entry due to configuration   (x2)
valkey ... crashed by signal: 11, si_code: 1
  TriggerCallbacks at src/coordinator/metadata_manager.cc:168 <- CreateEntryOnReplica

Root cause

src/commands/ft_internal_update.cc, HandleInternalUpdateFailure:

  • When LOADING and skip flag is false (default), it runs CHECK(false) at ft_internal_update.cc:46 -> abort.
  • When skip flag is true, it logs "SKIPPING" and returns absl::OkStatus(). The caller wraps it in VMSDK_RETURN_IF_ERROR, which does NOT short-circuit on OK, so FTInternalUpdateCmd continues with the un-parsed default metadata_entry and (because LOADING is set) calls CreateEntryOnReplica -> TriggerCallbacks -> the same SIGSEGV as the companion issue. The skip path never actually skips the rest of the command body.

Suggested fix

The skip branch must short-circuit the command (return early / signal "entry skipped"), not return OkStatus and fall through. Separately, reconsider aborting via CHECK(false) on corrupt persisted data — a corrupt AOF entry should not be a fatal assertion.

Environment

  • valkey-search origin/main (8c260db); also reproduces on valkey/valkey-bundle:unstable
  • Confirmed live on current main: ft_internal_update.cc is unchanged.

Related: see the companion issue on FT.INTERNAL_UPDATE crashing on a valid AOF entry.


This issue was generated by AI but verified, with love, by a human.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions