Skip to content

fix: recover from missing .active file during rotation (#386)#399

Merged
rcoh merged 2 commits into
mainfrom
fix-active-files
May 14, 2026
Merged

fix: recover from missing .active file during rotation (#386)#399
rcoh merged 2 commits into
mainfrom
fix-active-files

Conversation

@rcoh
Copy link
Copy Markdown
Contributor

@rcoh rcoh commented May 13, 2026

If the active trace file or its parent directory is removed externally (operator, log rotation, container teardown), RotatingWriter::rotate() fails at fs::rename(.active, .bin) or File::create(new_path). Before this fix the error propagated out of drained() without advancing next_drain_time / next_rotation_time, so should_drain() kept firing on every 5ms flush-loop tick — busy loop.

Recovery:

  • Advance both timers up front in rotate() so any failure below cannot cause busy-spinning.
  • Tolerate NotFound from the rename: log a rate-limited warning, skip pushing to closed_files, and start a fresh segment.
  • Tolerate NotFound from File::create by retrying create_dir_all once.
  • On any unrecoverable error, mark the writer Finished so it stops cleanly instead of retrying every cycle.
  • Apply the same NotFound tolerance to finalize().

Fixes #386

If the active trace file or its parent directory is removed externally
(operator, log rotation, container teardown), `RotatingWriter::rotate()`
fails at `fs::rename(.active, .bin)` or `File::create(new_path)`. Before
this fix the error propagated out of `drained()` without advancing
`next_drain_time` / `next_rotation_time`, so `should_drain()` kept
firing on every 5ms flush-loop tick — busy loop.

Recovery:
- Advance both timers up front in rotate() so any failure below cannot
  cause busy-spinning.
- Tolerate NotFound from the rename: log a rate-limited warning, skip
  pushing to closed_files, and start a fresh segment.
- Tolerate NotFound from File::create by retrying create_dir_all once.
- On any unrecoverable error, mark the writer Finished so it stops
  cleanly instead of retrying every cycle.
- Apply the same NotFound tolerance to finalize().
@rcoh rcoh requested review from jlizen and yulnr and removed request for yulnr May 13, 2026 14:49
};
let writer = BufWriter::new(file);
self.state = Self::prepare_segment(writer)?;
self.state = match Self::prepare_segment(writer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with a fail on this we would be leaving an orphan file created @479 and an inconsistent state. No big deal, but probably worth to take into account

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it would be good to remove the file inside our error handling block.

};
let writer = BufWriter::new(file);
self.state = Self::prepare_segment(writer)?;
self.state = match Self::prepare_segment(writer) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it would be good to remove the file inside our error handling block.

@rcoh rcoh enabled auto-merge May 14, 2026 18:01
@rcoh rcoh added this pull request to the merge queue May 14, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026
@rcoh rcoh added this pull request to the merge queue May 14, 2026
Merged via the queue into main with commit 02a49ef May 14, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

failure handling: drained failure

3 participants