Skip to content

Add auto-repair options for broken AOF tail on startup#7

Open
tomerqodo wants to merge 22 commits intoclaude_claude_vs_qodo_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7from
claude_claude_vs_qodo_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7
Open

Add auto-repair options for broken AOF tail on startup#7
tomerqodo wants to merge 22 commits intoclaude_claude_vs_qodo_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7from
claude_claude_vs_qodo_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr7

Conversation

@tomerqodo
Copy link
Copy Markdown

Benchmark PR from agentic-review-benchmarks#7

Comment thread src/aof.c
Comment on lines 1721 to +1743
fmterr: /* Format error. */
serverLog(LL_WARNING, "Bad file format reading the append only file %s: "
"make a backup of your AOF file, then use ./redis-check-aof --fix <filename.manifest>", filename);
/* fmterr may be caused by accidentally machine shutdown, so if the broken tail
* is less than a specified size, try to recover it automatically */
if (server.aof_load_broken) {
if (valid_up_to == -1) {
serverLog(LL_WARNING,"Last valid command offset is invalid");
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
if (truncate(aof_filepath,valid_up_to) == -1) {
serverLog(LL_WARNING,"Error truncating the AOF file: %s",
strerror(errno));
} else {
/* Make sure the AOF file descriptor points to the end of the
* file after the truncate call. */
if (server.aof_fd != -1 && lseek(server.aof_fd,0,SEEK_END) == -1) {
serverLog(LL_WARNING,"Can't seek the end of the AOF file: %s",
strerror(errno));
} else {
serverLog(LL_WARNING,
"AOF loaded anyway because aof-load-broken is enabled and "
"broken size '%lld' is less than aof-load-broken-max-size '%lld'",
(long long)(sb.st_size - valid_up_to), (long long)(server.aof_load_broken_max_size));
ret = AOF_BROKEN_RECOVERED;
goto loaded_ok;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The fmterr handler (line 1721) does not check fakeClient->flags & CLIENT_MULTI before truncating at valid_up_to, unlike the uxeof handler (line 1670-1674) which reverts valid_up_to to valid_before_multi for incomplete MULTI/EXEC blocks. This means corruption inside a MULTI block produces a recovered AOF with an incomplete transaction (MULTI without EXEC), which will fail to load on the next restart when aof-load-truncated=no. Add if (fakeClient->flags & CLIENT_MULTI) valid_up_to = valid_before_multi; at the top of the fmterr handler.

Extended reasoning...

Bug Description

The new fmterr handler at lines 1721-1758 handles format errors during AOF loading when aof-load-broken is enabled. It truncates the AOF file at valid_up_to (the offset after the last successfully executed command). However, it does not account for the case where the fake client is currently inside an incomplete MULTI/EXEC transaction block.

Comparison with uxeof Handler

The existing uxeof handler (unexpected EOF) at lines 1670-1674 explicitly handles this case:

if (fakeClient->flags & CLIENT_MULTI) {
    serverLog(LL_WARNING,
        "Revert incomplete MULTI/EXEC transaction in AOF file %s", filename);
    valid_up_to = valid_before_multi;
    goto uxeof;
}

This ensures that if EOF is reached inside a MULTI block, valid_up_to is reverted to the position before the MULTI command, so the truncated file never contains an incomplete transaction. The fmterr handler lacks this check entirely.

Step-by-Step Proof

Consider an AOF file containing: SET key val, MULTI, SET a 1, <corruption>

  1. During loading with aof-load-broken=yes: SET key val executes successfully, valid_up_to advances. MULTI executes (sets CLIENT_MULTI flag, valid_before_multi saved). SET a 1 is queued (not executed), valid_up_to advances past it. Corruption is encountered, triggering goto fmterr.
  2. At the fmterr handler (line 1721): valid_up_to points to the offset after SET a 1 (inside the MULTI block). The handler truncates the file at this offset without checking CLIENT_MULTI.
  3. The recovered file now contains: SET key val, MULTI, SET a 1 — an incomplete MULTI/EXEC block with no EXEC.
  4. On the next server restart, this file is loaded. MULTI sets CLIENT_MULTI, SET a 1 is queued, then EOF is reached.
  5. The uxeof handler at line 1670 detects CLIENT_MULTI and reverts valid_up_to to valid_before_multi, then falls through to the uxeof label.
  6. If aof-load-truncated=no (a valid user configuration), the uxeof handler at line 1714-1718 logs "Unexpected end of file" and returns AOF_FAILED. The server refuses to start.

Impact

The first recovery silently produces an AOF file that cannot be loaded with aof-load-truncated=no. Even with aof-load-truncated=yes (the default), the recovered file requires an unnecessary second round of truncation on every restart until the incomplete MULTI block is removed. This defeats the purpose of the recovery — the file should be clean after truncation.

Fix

Add the same CLIENT_MULTI check at the start of the fmterr handler, before the truncation logic:

fmterr:
    if (fakeClient->flags & CLIENT_MULTI) {
        valid_up_to = valid_before_multi;
    }

This is consistent with the uxeof handler and ensures that truncation never leaves an incomplete MULTI/EXEC block in the recovered file.

Comment thread tests/integration/aof.tcl
}

# Remove all incr AOF files to make the base file being the last file
exec rm -f $aof_dirpath/appendonly.aof.*
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Two test bugs at lines 724 and 828-835: (1) exec rm -f $aof_dirpath/appendonly.aof.* is a no-op because Tcl exec does not expand shell globs — the literal * is passed to rm, so no files are deleted. The correct pattern is catch {exec rm -f {*}[glob $aof_dirpath/appendonly.aof.*]} as used elsewhere in the test suite (e.g., test_helper.tcl:260). (2) The "Corrupted last AOF file" test at line 833 passes vacuously because the intermediate test's truncate() call (aof.c:1728) already cleaned the corruption from the mid file before the swap, so the last file is never actually corrupt when loaded.

Extended reasoning...

Bug 1: Glob not expanded by Tcl exec (line 724)

At line 724, the command exec rm -f $aof_dirpath/appendonly.aof.* is intended to remove all incremental AOF files so that the base file becomes the last (and only) file. However, Tcl's exec command does NOT perform shell glob expansion. The asterisk character is passed literally to the rm command as a filename argument. Since rm itself also does not expand globs (that's the shell's job), no files are actually deleted.

The existing Redis test codebase handles this correctly in two other locations:

  • test_helper.tcl:260: catch {exec rm -rf {*}[glob tests/tmp/redis.conf.*]} — uses Tcl's [glob ...] command for pattern expansion and {*} for list expansion into individual arguments
  • aofmanifest.tcl:130: catch {eval exec rm -rf [glob $aof_dirpath]} — uses [glob ...] with eval for expansion

The correct fix would be: catch {exec rm -f {*}[glob $aof_dirpath/appendonly.aof.*]}

Step-by-step proof: After the first start_server_aof_ex test completes (line 710-720), the AOF directory contains both the base file (appendonly.aof.1.base.aof with corruption already truncated by the server's recovery) and the incr file (appendonly.aof.1.incr.aof). Line 724 attempts exec rm -f $aof_dirpath/appendonly.aof.*. Tcl passes the literal string <path>/appendonly.aof.* to rm. rm looks for a file literally named appendonly.aof.* — no such file exists. rm -f silently ignores this (the -f flag suppresses errors for nonexistent files). The incr file remains on disk. The subsequent test at line 725 ("Corrupted base AOF (last file): should recover") starts the server, which loads both files. Since the base file's corruption was already truncated by the previous server run's truncate() call at aof.c:1728, both files are now valid, and the server starts successfully. The test passes, but for the wrong reason — the base file is NOT the last file.

Bug 2: File-swap test does not exercise recovery (lines 828-835)

The test "Corrupted last AOF file: Server should still start and recover" at line 833 is supposed to verify that when the last AOF file has corruption, the server can recover using the aof-load-broken feature. However, the file it tests is already clean.

Step-by-step proof of the vacuous test:

  1. Lines 802-809 create three AOF files: base (clean), mid (contains "set fo mid" + "CORRUPTION"), and last (contains "set fo last")
  2. Line 817: start_server_aof_ex starts a server that loads these files. When loading mid_aof_file, the server encounters "CORRUPTION"
  3. The fmterr handler in aof.c:1724-1744 fires. Since aof_load_broken is enabled and the corruption size is small, truncate(aof_filepath, valid_up_to) is called at line 1728. This physically truncates mid_aof_file on disk, removing "CORRUPTION" and leaving only "set fo mid"
  4. The function returns AOF_BROKEN_RECOVERED. Back in loadAppendOnlyFiles (aof.c:1867-1870), since this is NOT the last file, ret is set to AOF_FAILED and the server exits with "Fatal error: the truncated file is not the last file"
  5. After server shutdown, mid_aof_file on disk now contains only set fo mid — it is already clean
  6. Lines 828-830 swap mid and last files. The now-clean mid content goes to last_aof_file, and last's clean content goes to mid_aof_file
  7. Line 833: The final test loads all three files. All are clean. fo equals "mid" (from the swapped last file). The test passes without ever exercising the AOF_BROKEN_RECOVERED recovery path on the last file

To properly test this, a fresh corrupted file should be created after the intermediate test rather than reusing the file that was already truncated by the server's recovery logic.

Comment thread src/aof.c
Comment on lines +1841 to 1843
if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
goto cleanup;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: AOF_BROKEN_RECOVERED is incorrectly treated as an error in multiple locations. At line 1841, the base-file error check includes AOF_BROKEN_RECOVERED, causing successful recovery of the last base file to skip server.aof_current_size and server.aof_rewrite_base_size initialization (contrast with the INCR section at line 1875 which correctly omits it). Additionally, stopLoading() at line 1894 does not recognize AOF_BROKEN_RECOVERED as success, so modules receive a spurious LOADING_FAILED event. Finally, DEBUG LOADAOF in debug.c:600 treats AOF_BROKEN_RECOVERED as an error.

Extended reasoning...

What the bug is

The PR introduces a new return value AOF_BROKEN_RECOVERED (value 6) from loadSingleAppendOnlyFile() to indicate that a broken AOF was successfully recovered via truncation. However, three locations in the codebase fail to treat this new status as a success, causing incorrect behavior after successful recovery.

Bug 1: Base-file error check at line 1841

In loadAppendOnlyFiles(), the error check for base AOF files at line 1841 is:

if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
    goto cleanup;
}

This unconditionally sends AOF_BROKEN_RECOVERED to cleanup, even when the base file is the last file and recovery was successful. Compare with the INCR file section at line 1875 which correctly omits AOF_BROKEN_RECOVERED:

if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) {
    goto cleanup;
}

Also compare with how AOF_TRUNCATED is handled for base files: AOF_TRUNCATED does NOT appear in line 1841, so a truncated last base file correctly falls through to set server.aof_current_size and server.aof_rewrite_base_size. The asymmetry between AOF_TRUNCATED and AOF_BROKEN_RECOVERED handling for base files is clearly unintentional.

Step-by-step proof for Bug 1

Scenario: Single base AOF file (no incr files) with corruption that gets successfully recovered.

  1. loadSingleAppendOnlyFile() returns AOF_BROKEN_RECOVERED
  2. Line 1830: Logs DB loaded -- correct, since (ret == AOF_BROKEN_RECOVERED && last_file) is true
  3. Line 1836: Does NOT set ret = AOF_FAILED because last_file is true -- correct
  4. Line 1841: Matches ret == AOF_BROKEN_RECOVERED and jumps to cleanup -- BUG
  5. Lines 1881 and 1891 are skipped: server.aof_current_size stays 0, server.aof_rewrite_base_size stays 0
  6. This causes incorrect AOF rewrite triggering since the server thinks the AOF is empty

Bug 2: stopLoading at line 1894

At cleanup, line 1894 calls:

stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED);

This does not include AOF_BROKEN_RECOVERED. When ret == AOF_BROKEN_RECOVERED, stopLoading(false) is called, which fires REDISMODULE_SUBEVENT_LOADING_FAILED instead of REDISMODULE_SUBEVENT_LOADING_ENDED. This incorrectly tells all loaded modules that AOF loading failed, even though recovery was successful. This affects both code paths: base AOF (via the goto at line 1841) and INCR AOF (which correctly falls through the while loop but still reaches line 1894). The fix: stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED).

Bug 3: DEBUG LOADAOF in debug.c:600

The DEBUG LOADAOF command checks:

if (ret != AOF_OK && ret != AOF_EMPTY) {
    addReplyError(c, ...);
    return;
}

When loadAppendOnlyFiles() returns AOF_BROKEN_RECOVERED (from the INCR path which correctly passes it through), DEBUG LOADAOF reports an error to the client even though recovery succeeded. The fix should add && ret != AOF_BROKEN_RECOVERED to the condition.

Impact

  1. server.aof_current_size and server.aof_rewrite_base_size remain at 0 after successful base-file recovery, causing premature or incorrect AOF rewrites
  2. Modules receive false loading-failure notifications, potentially triggering error-handling logic in module code
  3. DEBUG LOADAOF reports failure to clients after successful recovery

Fix

  1. Remove ret == AOF_BROKEN_RECOVERED from line 1841, making it consistent with line 1875
  2. Add || ret == AOF_BROKEN_RECOVERED to the stopLoading() call at line 1894
  3. Add ret != AOF_BROKEN_RECOVERED to the error check in debug.c:600

Comment thread src/aof.c
if (server.aof_load_broken) {
if (valid_up_to == -1) {
serverLog(LL_WARNING,"Last valid command offset is invalid");
} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Off-by-one in aof-load-broken-max-size comparison: line 1727 uses < instead of <=, so when the corrupted portion is exactly aof_load_broken_max_size bytes, recovery is skipped even though the config name "max-size" implies an inclusive upper bound. Additionally, the else-branch log at line 1750 incorrectly says "is bigger than" when the sizes are equal. Change < to <= to match the documented semantics.

Extended reasoning...

What the bug is

At src/aof.c:1727, the comparison that decides whether to auto-recover a corrupted AOF tail uses strict less-than (<):

} else if ((size_t)(sb.st_size - valid_up_to) < (size_t)server.aof_load_broken_max_size) {

The config is named aof-load-broken-max-size, and the redis.conf documentation describes it as "the maximum size in bytes of corrupted data that can be automatically truncated." The word "maximum" implies an inclusive bound — i.e., corruption of up to and including that many bytes should be recoverable.

How it manifests

When the corrupted portion of the AOF is exactly equal to aof_load_broken_max_size bytes, the < comparison evaluates to false, so recovery is not attempted. Control falls through to the else branch at line 1746, which logs:

"broken size ... is bigger than aof-load-broken-max-size"

This log message is factually incorrect in the equal case — the sizes are the same, not one bigger than the other.

Step-by-step proof

  1. User sets aof-load-broken yes and aof-load-broken-max-size 4096.
  2. An AOF file gets corrupted such that the corrupted tail is exactly 4096 bytes: sb.st_size - valid_up_to == 4096.
  3. At line 1727: (size_t)(4096) < (size_t)(4096) evaluates to false.
  4. The recovery branch is skipped entirely.
  5. Execution falls to the else at line 1746, which logs "broken size 4096 is bigger than aof-load-broken-max-size 4096" — incorrect, they are equal.
  6. ret is set to AOF_FAILED and the server refuses to start.
  7. If <= were used instead, step 3 would evaluate to true, and recovery would proceed as the user intended.

Impact

This is a minor off-by-one that only triggers at the exact boundary value. In practice, the corrupted portion would need to be exactly the configured byte count, making this extremely unlikely. However, it represents a semantic inconsistency between the config name/documentation ("maximum size") and the implementation (strictly less than), and the misleading log message could confuse operators debugging a recovery failure.

Fix

Change < to <= at line 1727:

} else if ((size_t)(sb.st_size - valid_up_to) <= (size_t)server.aof_load_broken_max_size) {

This makes the comparison match the "max-size" semantics and ensures the else-branch log message ("is bigger than") is always accurate when reached.

Comment thread src/config.c
createTimeTConfig("repl-backlog-ttl", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.repl_backlog_time_limit, 60*60, INTEGER_CONFIG, NULL, NULL), /* Default: 1 hour */
createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 aof-load-broken-max-size uses INTEGER_CONFIG but represents a size in bytes. It should use MEMORY_CONFIG (like the adjacent auto-aof-rewrite-min-size on line 3256) to allow human-readable suffixes such as 4kb, 1mb, etc. Currently users must specify raw byte values.

Extended reasoning...

What the bug is

The new aof-load-broken-max-size config parameter is registered with INTEGER_CONFIG at line 3258, but it represents a size in bytes (controlling the maximum size of corrupted AOF data that can be automatically truncated). Redis has two config display types for numeric values: INTEGER_CONFIG (plain numbers) and MEMORY_CONFIG (supports human-readable suffixes like kb, mb, gb).

How it manifests

With INTEGER_CONFIG, users must specify raw byte values (e.g., 4096). They cannot use the friendlier notation like 4kb or 1mb that Redis supports for other size-in-bytes parameters. If a user writes aof-load-broken-max-size 4kb in their config file, it will fail to parse.

Inconsistency with adjacent configs

Looking at the surrounding code, auto-aof-rewrite-min-size (line 3256) is also a createOffTConfig representing a byte size, and it uses MEMORY_CONFIG. The redis.conf documentation for this PR also describes aof-load-broken-max-size as "the maximum size in bytes," which is exactly the kind of parameter that conventionally uses MEMORY_CONFIG in Redis. While loading-process-events-interval-bytes (line 3257) also uses INTEGER_CONFIG, that one is a HIDDEN_CONFIG not exposed to users, so the precedent is less relevant.

Step-by-step proof

  1. Line 3256: createOffTConfig("auto-aof-rewrite-min-size", ..., MEMORY_CONFIG, ...) — a byte-size AOF config using MEMORY_CONFIG.
  2. Line 3258: createOffTConfig("aof-load-broken-max-size", ..., INTEGER_CONFIG, ...) — a byte-size AOF config using INTEGER_CONFIG.
  3. Both are user-facing off_t configs representing data sizes in bytes. The inconsistency means aof-load-broken-max-size lacks human-readable suffix support.

Impact and fix

This is a minor usability inconsistency. The config works correctly with raw byte values. The fix is simple: change INTEGER_CONFIG to MEMORY_CONFIG on line 3258. This would allow users to write aof-load-broken-max-size 4kb instead of aof-load-broken-max-size 4096, consistent with the majority of size-in-bytes configs in Redis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants