Description
Two related failures in how FT.INTERNAL_UPDATE handles a corrupt entry during AOF load:
- Default config aborts the load. A malformed-protobuf
FT.INTERNAL_UPDATE entry in the AOF hits CHECK(false) (SIGABRT) when search.skip-corrupted-internal-update-entries=no (the default). This crash-loops the node on every startup until the AOF is hand-edited.
- The skip flag does not actually skip. Setting
search.skip-corrupted-internal-update-entries=yes (the intended mitigation) logs "SKIPPING" and then still SIGSEGVs, because the skip branch returns OkStatus() and execution falls through into the same CreateEntryOnReplica -> TriggerCallbacks crash with a default-constructed entry.
So a corrupt AOF crashes the node by default, and the documented escape hatch is broken.
Steps to reproduce
mkdir -p /tmp/poisonaof/appendonlydir
printf 'file appendonly.aof.1.incr.aof seq 1 type i\n' > /tmp/poisonaof/appendonlydir/appendonly.aof.manifest
# SELECT 0 + FT.INTERNAL_UPDATE poisonidx <0xFF*6> <0xFF*6> (argv[2]/argv[3] not valid protobuf)
printf '*2\r\n$6\r\nSELECT\r\n$1\r\n0\r\n*4\r\n$18\r\nFT.INTERNAL_UPDATE\r\n$9\r\npoisonidx\r\n$6\r\n\xff\xff\xff\xff\xff\xff\r\n$6\r\n\xff\xff\xff\xff\xff\xff\r\n' > /tmp/poisonaof/appendonlydir/appendonly.aof.1.incr.aof
# (1) default -> SIGABRT
docker run -d --name n1 -v /tmp/poisonaof:/data -e VALKEY_EXTRA_FLAGS="--appendonly yes" valkey/valkey-bundle:unstable
# (2) skip flag set -> still SIGSEGV
docker run -d --name n2 -v /tmp/poisonaof:/data -e VALKEY_EXTRA_FLAGS="--appendonly yes --search.skip-corrupted-internal-update-entries yes" valkey/valkey-bundle:unstable
Crash output
# (1) default:
F0000 ... ft_internal_update.cc:46] Check failed: false Internal update failure during AOF loading - cannot continue
valkey ... crashed by signal: 6, si_code: -6
# (2) skip=yes:
ft_internal_update.cc:40: SKIPPING corrupted AOF entry due to configuration (x2)
valkey ... crashed by signal: 11, si_code: 1
TriggerCallbacks at src/coordinator/metadata_manager.cc:168 <- CreateEntryOnReplica
Root cause
src/commands/ft_internal_update.cc, HandleInternalUpdateFailure:
- When
LOADING and skip flag is false (default), it runs CHECK(false) at ft_internal_update.cc:46 -> abort.
- When skip flag is true, it logs "SKIPPING" and returns
absl::OkStatus(). The caller wraps it in VMSDK_RETURN_IF_ERROR, which does NOT short-circuit on OK, so FTInternalUpdateCmd continues with the un-parsed default metadata_entry and (because LOADING is set) calls CreateEntryOnReplica -> TriggerCallbacks -> the same SIGSEGV as the companion issue. The skip path never actually skips the rest of the command body.
Suggested fix
The skip branch must short-circuit the command (return early / signal "entry skipped"), not return OkStatus and fall through. Separately, reconsider aborting via CHECK(false) on corrupt persisted data — a corrupt AOF entry should not be a fatal assertion.
Environment
- valkey-search
origin/main (8c260db); also reproduces on valkey/valkey-bundle:unstable
- Confirmed live on current
main: ft_internal_update.cc is unchanged.
Related: see the companion issue on FT.INTERNAL_UPDATE crashing on a valid AOF entry.
This issue was generated by AI but verified, with love, by a human.
Description
Two related failures in how
FT.INTERNAL_UPDATEhandles a corrupt entry during AOF load:FT.INTERNAL_UPDATEentry in the AOF hitsCHECK(false)(SIGABRT) whensearch.skip-corrupted-internal-update-entries=no(the default). This crash-loops the node on every startup until the AOF is hand-edited.search.skip-corrupted-internal-update-entries=yes(the intended mitigation) logs "SKIPPING" and then still SIGSEGVs, because the skip branch returnsOkStatus()and execution falls through into the sameCreateEntryOnReplica->TriggerCallbackscrash with a default-constructed entry.So a corrupt AOF crashes the node by default, and the documented escape hatch is broken.
Steps to reproduce
Crash output
Root cause
src/commands/ft_internal_update.cc,HandleInternalUpdateFailure:LOADINGand skip flag is false (default), it runsCHECK(false)at ft_internal_update.cc:46 -> abort.absl::OkStatus(). The caller wraps it inVMSDK_RETURN_IF_ERROR, which does NOT short-circuit on OK, soFTInternalUpdateCmdcontinues with the un-parsed defaultmetadata_entryand (becauseLOADINGis set) callsCreateEntryOnReplica->TriggerCallbacks-> the same SIGSEGV as the companion issue. The skip path never actually skips the rest of the command body.Suggested fix
The skip branch must short-circuit the command (return early / signal "entry skipped"), not return
OkStatusand fall through. Separately, reconsider aborting viaCHECK(false)on corrupt persisted data — a corrupt AOF entry should not be a fatal assertion.Environment
origin/main(8c260db); also reproduces onvalkey/valkey-bundle:unstablemain:ft_internal_update.ccis unchanged.Related: see the companion issue on
FT.INTERNAL_UPDATEcrashing on a valid AOF entry.This issue was generated by AI but verified, with love, by a human.