Add auto-repair options for broken AOF tail on startup#7
Conversation
| fmterr: /* Format error. */ | ||
| serverLog(LL_WARNING, "Bad file format reading the append only file %s: " | ||
| "make a backup of your AOF file, then use ./redis-check-aof --fix <filename.manifest>", filename); | ||
| /* fmterr may be caused by accidentally machine shutdown, so if the broken tail | ||
| * is less than a specified size, try to recover it automatically */ | ||
| if (server.aof_load_broken) { | ||
| if (valid_up_to == -1) { | ||
| serverLog(LL_WARNING,"Last valid command offset is invalid"); | ||
| } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) { | ||
| if (truncate(aof_filepath,valid_up_to) == -1) { | ||
| serverLog(LL_WARNING,"Error truncating the AOF file: %s", | ||
| strerror(errno)); | ||
| } else { | ||
| /* Make sure the AOF file descriptor points to the end of the | ||
| * file after the truncate call. */ | ||
| if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) { | ||
| serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s", | ||
| strerror(errno)); | ||
| } else { | ||
| serverLog(LL_WARNING, | ||
| "AOF loaded anyway because aof-load-broken is enabled and " | ||
| "broken size '%lld' is less than aof-load-broken-max-size '%lld'", | ||
| (long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size)); | ||
| ret = AOF_BROKEN_RECOVERED; | ||
| goto loaded_ok; |
There was a problem hiding this comment.
🔴 The fmterr handler (line 1721) does not check fakeClient->flags & CLIENT_MULTI before truncating at valid_up_to, unlike the uxeof handler (line 1670-1674) which reverts valid_up_to to valid_before_multi for incomplete MULTI/EXEC blocks. This means corruption inside a MULTI block produces a recovered AOF with an incomplete transaction (MULTI without EXEC), which will fail to load on the next restart when aof-load-truncated=no. Add if (fakeClient->flags & CLIENT_MULTI) valid_up_to = valid_before_multi; at the top of the fmterr handler.
Extended reasoning...
Bug Description
The new fmterr handler at lines 1721-1758 handles format errors during AOF loading when aof-load-broken is enabled. It truncates the AOF file at valid_up_to (the offset after the last successfully executed command). However, it does not account for the case where the fake client is currently inside an incomplete MULTI/EXEC transaction block.
Comparison with uxeof Handler
The existing uxeof handler (unexpected EOF) at lines 1670-1674 explicitly handles this case:
if (fakeClient->flags & CLIENT_MULTI) {
serverLog(LL_WARNING,
"Revert incomplete MULTI/EXEC transaction in AOF file %s", filename);
valid_up_to = valid_before_multi;
goto uxeof;
}This ensures that if EOF is reached inside a MULTI block, valid_up_to is reverted to the position before the MULTI command, so the truncated file never contains an incomplete transaction. The fmterr handler lacks this check entirely.
Step-by-Step Proof
Consider an AOF file containing: SET key val, MULTI, SET a 1, <corruption>
- During loading with
aof-load-broken=yes:SET key valexecutes successfully,valid_up_toadvances.MULTIexecutes (setsCLIENT_MULTIflag,valid_before_multisaved).SET a 1is queued (not executed),valid_up_toadvances past it. Corruption is encountered, triggeringgoto fmterr. - At the
fmterrhandler (line 1721):valid_up_topoints to the offset afterSET a 1(inside the MULTI block). The handler truncates the file at this offset without checkingCLIENT_MULTI. - The recovered file now contains:
SET key val,MULTI,SET a 1— an incomplete MULTI/EXEC block with no EXEC. - On the next server restart, this file is loaded.
MULTIsetsCLIENT_MULTI,SET a 1is queued, then EOF is reached. - The
uxeofhandler at line 1670 detectsCLIENT_MULTIand revertsvalid_up_totovalid_before_multi, then falls through to theuxeoflabel. - If
aof-load-truncated=no(a valid user configuration), theuxeofhandler at line 1714-1718 logs "Unexpected end of file" and returnsAOF_FAILED. The server refuses to start.
Impact
The first recovery silently produces an AOF file that cannot be loaded with aof-load-truncated=no. Even with aof-load-truncated=yes (the default), the recovered file requires an unnecessary second round of truncation on every restart until the incomplete MULTI block is removed. This defeats the purpose of the recovery — the file should be clean after truncation.
Fix
Add the same CLIENT_MULTI check at the start of the fmterr handler, before the truncation logic:
fmterr:
if (fakeClient->flags & CLIENT_MULTI) {
valid_up_to = valid_before_multi;
}This is consistent with the uxeof handler and ensures that truncation never leaves an incomplete MULTI/EXEC block in the recovered file.
| } | ||
|
|
||
| # Remove all incr AOF files to make the base file being the last file | ||
| exec rm -f $aof_dirpath/appendonly.aof.* |
There was a problem hiding this comment.
🔴 Two test bugs at lines 724 and 828-835: (1) exec rm -f $aof_dirpath/appendonly.aof.* is a no-op because Tcl exec does not expand shell globs — the literal * is passed to rm, so no files are deleted. The correct pattern is catch {exec rm -f {*}[glob $aof_dirpath/appendonly.aof.*]} as used elsewhere in the test suite (e.g., test_helper.tcl:260). (2) The "Corrupted last AOF file" test at line 833 passes vacuously because the intermediate test's truncate() call (aof.c:1728) already cleaned the corruption from the mid file before the swap, so the last file is never actually corrupt when loaded.
Extended reasoning...
Bug 1: Glob not expanded by Tcl exec (line 724)
At line 724, the command exec rm -f $aof_dirpath/appendonly.aof.* is intended to remove all incremental AOF files so that the base file becomes the last (and only) file. However, Tcl's exec command does NOT perform shell glob expansion. The asterisk character is passed literally to the rm command as a filename argument. Since rm itself also does not expand globs (that's the shell's job), no files are actually deleted.
The existing Redis test codebase handles this correctly in two other locations:
test_helper.tcl:260:catch {exec rm -rf {*}[glob tests/tmp/redis.conf.*]}— uses Tcl's[glob ...]command for pattern expansion and{*}for list expansion into individual argumentsaofmanifest.tcl:130:catch {eval exec rm -rf [glob $aof_dirpath]}— uses[glob ...]withevalfor expansion
The correct fix would be: catch {exec rm -f {*}[glob $aof_dirpath/appendonly.aof.*]}
Step-by-step proof: After the first start_server_aof_ex test completes (line 710-720), the AOF directory contains both the base file (appendonly.aof.1.base.aof with corruption already truncated by the server's recovery) and the incr file (appendonly.aof.1.incr.aof). Line 724 attempts exec rm -f $aof_dirpath/appendonly.aof.*. Tcl passes the literal string <path>/appendonly.aof.* to rm. rm looks for a file literally named appendonly.aof.* — no such file exists. rm -f silently ignores this (the -f flag suppresses errors for nonexistent files). The incr file remains on disk. The subsequent test at line 725 ("Corrupted base AOF (last file): should recover") starts the server, which loads both files. Since the base file's corruption was already truncated by the previous server run's truncate() call at aof.c:1728, both files are now valid, and the server starts successfully. The test passes, but for the wrong reason — the base file is NOT the last file.
Bug 2: File-swap test does not exercise recovery (lines 828-835)
The test "Corrupted last AOF file: Server should still start and recover" at line 833 is supposed to verify that when the last AOF file has corruption, the server can recover using the aof-load-broken feature. However, the file it tests is already clean.
Step-by-step proof of the vacuous test:
- Lines 802-809 create three AOF files: base (clean), mid (contains "set fo mid" + "CORRUPTION"), and last (contains "set fo last")
- Line 817:
start_server_aof_exstarts a server that loads these files. When loading mid_aof_file, the server encounters "CORRUPTION" - The fmterr handler in aof.c:1724-1744 fires. Since
aof_load_brokenis enabled and the corruption size is small,truncate(aof_filepath, valid_up_to)is called at line 1728. This physically truncates mid_aof_file on disk, removing "CORRUPTION" and leaving only "set fo mid" - The function returns
AOF_BROKEN_RECOVERED. Back inloadAppendOnlyFiles(aof.c:1867-1870), since this is NOT the last file,retis set toAOF_FAILEDand the server exits with "Fatal error: the truncated file is not the last file" - After server shutdown, mid_aof_file on disk now contains only
set fo mid— it is already clean - Lines 828-830 swap mid and last files. The now-clean mid content goes to last_aof_file, and last's clean content goes to mid_aof_file
- Line 833: The final test loads all three files. All are clean.
foequals "mid" (from the swapped last file). The test passes without ever exercising theAOF_BROKEN_RECOVEREDrecovery path on the last file
To properly test this, a fresh corrupted file should be created after the intermediate test rather than reusing the file that was already truncated by the server's recovery logic.
| if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) { | ||
| goto cleanup; | ||
| } |
There was a problem hiding this comment.
🔴 Bug: AOF_BROKEN_RECOVERED is incorrectly treated as an error in multiple locations. At line 1841, the base-file error check includes AOF_BROKEN_RECOVERED, causing successful recovery of the last base file to skip server.aof_current_size and server.aof_rewrite_base_size initialization (contrast with the INCR section at line 1875 which correctly omits it). Additionally, stopLoading() at line 1894 does not recognize AOF_BROKEN_RECOVERED as success, so modules receive a spurious LOADING_FAILED event. Finally, DEBUG LOADAOF in debug.c:600 treats AOF_BROKEN_RECOVERED as an error.
Extended reasoning...
What the bug is
The PR introduces a new return value AOF_BROKEN_RECOVERED (value 6) from loadSingleAppendOnlyFile() to indicate that a broken AOF was successfully recovered via truncation. However, three locations in the codebase fail to treat this new status as a success, causing incorrect behavior after successful recovery.
Bug 1: Base-file error check at line 1841
In loadAppendOnlyFiles(), the error check for base AOF files at line 1841 is:
if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
goto cleanup;
}This unconditionally sends AOF_BROKEN_RECOVERED to cleanup, even when the base file is the last file and recovery was successful. Compare with the INCR file section at line 1875 which correctly omits AOF_BROKEN_RECOVERED:
if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) {
goto cleanup;
}Also compare with how AOF_TRUNCATED is handled for base files: AOF_TRUNCATED does NOT appear in line 1841, so a truncated last base file correctly falls through to set server.aof_current_size and server.aof_rewrite_base_size. The asymmetry between AOF_TRUNCATED and AOF_BROKEN_RECOVERED handling for base files is clearly unintentional.
Step-by-step proof for Bug 1
Scenario: Single base AOF file (no incr files) with corruption that gets successfully recovered.
loadSingleAppendOnlyFile()returnsAOF_BROKEN_RECOVERED- Line 1830: Logs DB loaded -- correct, since
(ret == AOF_BROKEN_RECOVERED && last_file)is true - Line 1836: Does NOT set
ret = AOF_FAILEDbecauselast_fileis true -- correct - Line 1841: Matches
ret == AOF_BROKEN_RECOVEREDand jumps tocleanup-- BUG - Lines 1881 and 1891 are skipped:
server.aof_current_sizestays 0,server.aof_rewrite_base_sizestays 0 - This causes incorrect AOF rewrite triggering since the server thinks the AOF is empty
Bug 2: stopLoading at line 1894
At cleanup, line 1894 calls:
stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED);This does not include AOF_BROKEN_RECOVERED. When ret == AOF_BROKEN_RECOVERED, stopLoading(false) is called, which fires REDISMODULE_SUBEVENT_LOADING_FAILED instead of REDISMODULE_SUBEVENT_LOADING_ENDED. This incorrectly tells all loaded modules that AOF loading failed, even though recovery was successful. This affects both code paths: base AOF (via the goto at line 1841) and INCR AOF (which correctly falls through the while loop but still reaches line 1894). The fix: stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED).
Bug 3: DEBUG LOADAOF in debug.c:600
The DEBUG LOADAOF command checks:
if (ret != AOF_OK && ret != AOF_EMPTY) {
addReplyError(c, ...);
return;
}When loadAppendOnlyFiles() returns AOF_BROKEN_RECOVERED (from the INCR path which correctly passes it through), DEBUG LOADAOF reports an error to the client even though recovery succeeded. The fix should add && ret != AOF_BROKEN_RECOVERED to the condition.
Impact
server.aof_current_sizeandserver.aof_rewrite_base_sizeremain at 0 after successful base-file recovery, causing premature or incorrect AOF rewrites- Modules receive false loading-failure notifications, potentially triggering error-handling logic in module code
DEBUG LOADAOFreports failure to clients after successful recovery
Fix
- Remove
ret == AOF_BROKEN_RECOVEREDfrom line 1841, making it consistent with line 1875 - Add
|| ret == AOF_BROKEN_RECOVEREDto thestopLoading()call at line 1894 - Add
ret != AOF_BROKEN_RECOVEREDto the error check in debug.c:600
| if (server.aof_load_broken) { | ||
| if (valid_up_to == -1) { | ||
| serverLog(LL_WARNING,"Last valid command offset is invalid"); | ||
| } else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) { |
There was a problem hiding this comment.
🟡 Off-by-one in aof-load-broken-max-size comparison: line 1727 uses < instead of <=, so when the corrupted portion is exactly aof_load_broken_max_size bytes, recovery is skipped even though the config name "max-size" implies an inclusive upper bound. Additionally, the else-branch log at line 1750 incorrectly says "is bigger than" when the sizes are equal. Change < to <= to match the documented semantics.
Extended reasoning...
What the bug is
At src/aof.c:1727, the comparison that decides whether to auto-recover a corrupted AOF tail uses strict less-than (<):
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {The config is named aof-load-broken-max-size, and the redis.conf documentation describes it as "the maximum size in bytes of corrupted data that can be automatically truncated." The word "maximum" implies an inclusive bound — i.e., corruption of up to and including that many bytes should be recoverable.
How it manifests
When the corrupted portion of the AOF is exactly equal to aof_load_broken_max_size bytes, the < comparison evaluates to false, so recovery is not attempted. Control falls through to the else branch at line 1746, which logs:
"broken size ... is bigger than aof-load-broken-max-size"
This log message is factually incorrect in the equal case — the sizes are the same, not one bigger than the other.
Step-by-step proof
- User sets
aof-load-broken yesandaof-load-broken-max-size 4096. - An AOF file gets corrupted such that the corrupted tail is exactly 4096 bytes:
sb.st_size - valid_up_to == 4096. - At line 1727:
(size_t)(4096) < (size_t)(4096)evaluates tofalse. - The recovery branch is skipped entirely.
- Execution falls to the else at line 1746, which logs "broken size 4096 is bigger than aof-load-broken-max-size 4096" — incorrect, they are equal.
retis set toAOF_FAILEDand the server refuses to start.- If
<=were used instead, step 3 would evaluate totrue, and recovery would proceed as the user intended.
Impact
This is a minor off-by-one that only triggers at the exact boundary value. In practice, the corrupted portion would need to be exactly the configured byte count, making this extremely unlikely. However, it represents a semantic inconsistency between the config name/documentation ("maximum size") and the implementation (strictly less than), and the misleading log message could confuse operators debugging a recovery failure.
Fix
Change < to <= at line 1727:
} else if ((size_t)(sb.st_size - valid_up_to) <= (size_t)server.aof_load_broken_max_size) {This makes the comparison match the "max-size" semantics and ensures the else-branch log message ("is bigger than") is always accurate when reached.
| createTimeTConfig("repl-backlog-ttl", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.repl_backlog_time_limit, 60*60, INTEGER_CONFIG, NULL, NULL), /* Default: 1 hour */ | ||
| createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL), | ||
| createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL), | ||
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), |
There was a problem hiding this comment.
🟡 aof-load-broken-max-size uses INTEGER_CONFIG but represents a size in bytes. It should use MEMORY_CONFIG (like the adjacent auto-aof-rewrite-min-size on line 3256) to allow human-readable suffixes such as 4kb, 1mb, etc. Currently users must specify raw byte values.
Extended reasoning...
What the bug is
The new aof-load-broken-max-size config parameter is registered with INTEGER_CONFIG at line 3258, but it represents a size in bytes (controlling the maximum size of corrupted AOF data that can be automatically truncated). Redis has two config display types for numeric values: INTEGER_CONFIG (plain numbers) and MEMORY_CONFIG (supports human-readable suffixes like kb, mb, gb).
How it manifests
With INTEGER_CONFIG, users must specify raw byte values (e.g., 4096). They cannot use the friendlier notation like 4kb or 1mb that Redis supports for other size-in-bytes parameters. If a user writes aof-load-broken-max-size 4kb in their config file, it will fail to parse.
Inconsistency with adjacent configs
Looking at the surrounding code, auto-aof-rewrite-min-size (line 3256) is also a createOffTConfig representing a byte size, and it uses MEMORY_CONFIG. The redis.conf documentation for this PR also describes aof-load-broken-max-size as "the maximum size in bytes," which is exactly the kind of parameter that conventionally uses MEMORY_CONFIG in Redis. While loading-process-events-interval-bytes (line 3257) also uses INTEGER_CONFIG, that one is a HIDDEN_CONFIG not exposed to users, so the precedent is less relevant.
Step-by-step proof
- Line 3256:
createOffTConfig("auto-aof-rewrite-min-size", ..., MEMORY_CONFIG, ...)— a byte-size AOF config usingMEMORY_CONFIG. - Line 3258:
createOffTConfig("aof-load-broken-max-size", ..., INTEGER_CONFIG, ...)— a byte-size AOF config usingINTEGER_CONFIG. - Both are user-facing
off_tconfigs representing data sizes in bytes. The inconsistency meansaof-load-broken-max-sizelacks human-readable suffix support.
Impact and fix
This is a minor usability inconsistency. The config works correctly with raw byte values. The fix is simple: change INTEGER_CONFIG to MEMORY_CONFIG on line 3258. This would allow users to write aof-load-broken-max-size 4kb instead of aof-load-broken-max-size 4096, consistent with the majority of size-in-bytes configs in Redis.
Benchmark PR from agentic-review-benchmarks#7