Skip to content

db/state: fix spurious accessor rebuild for post-merge subset files#19965

Closed
anacrolix wants to merge 6 commits intomainfrom
anacrolix/removed-file-unexpectedly-exists
Closed

db/state: fix spurious accessor rebuild for post-merge subset files#19965
anacrolix wants to merge 6 commits intomainfrom
anacrolix/removed-file-unexpectedly-exists

Conversation

@anacrolix
Copy link
Copy Markdown
Contributor

Fixes #19797

Root cause

After a merge, old sub-range .ef files stay on disk while held open by active readers. OpenFolder/scanDirtyFiles re-adds them to dirtyFiles. BuildMissedAccessors then finds these re-added items have no .efi accessor (already deleted as part of merge cleanup) and rebuilds them, producing orphaned .efi files at paths that dir.trackRemovedFiles was told were deleted — triggering the repeated "Removed file unexpectedly exists" warnings.

Fix

In fileItemsWithMissedAccessors: skip items that are proper subsets of another item already present in dirtyFiles. Their accessor should not be rebuilt because the superset file supersedes them.

Observability

Added ERIGON_AGG_TRACE_FILE_LIFE logging to the skip path:

  • Not set → Debug (silent by default)
  • Set to empty (ERIGON_AGG_TRACE_FILE_LIFE=) → Warn for all files
  • Set to a substring → Warn only for matching filename bases

Also adds dbg.EnvStringLookup to distinguish "not set" from "set to empty string".

After a merge, the old sub-range .ef files stay on disk while held open
by active readers. OpenFolder/scanDirtyFiles re-adds them to dirtyFiles.
BuildMissedAccessors then found these re-added items had no .efi accessor
(already deleted as part of merge cleanup) and rebuilt them, producing
orphaned .efi files at paths that dir.trackRemovedFiles had been told were
deleted, triggering repeated "Removed file unexpectedly exists" warnings.

Fix: skip items in fileItemsWithMissedAccessors that are proper subsets
of another item already in dirtyFiles — their accessor should not be
rebuilt because the superset file supersedes them.

Add ERIGON_AGG_TRACE_FILE_LIFE logging to observe when the condition
fires: Debug by default; set the env var to empty to escalate all to
Warn, or to a substring to filter by filename base.

Also add dbg.EnvStringLookup to distinguish "not set" from "set to
empty string".

Fixes #19797
Add ERIGON_AGG_TRACE_FILE_LIFE="" to the two QA workflows that showed
the "Removed file unexpectedly exists" failures (qa-sync-from-scratch-
minimal-node and qa-sync-with-externalcl). This escalates all agg file
lifecycle debug logs to Warn level so the fix from #19797 is observable
in CI output.
@anacrolix anacrolix force-pushed the anacrolix/removed-file-unexpectedly-exists branch from 4b2876e to d8b2673 Compare March 18, 2026 02:57
Comment thread db/state/dirty_files.go
// "Removed file unexpectedly exists" warnings.
var coveringItem *FilesItem
for j, other := range dirtyFiles {
if i != j && item.isProperSubsetOf(other) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right reason!
But also @sudeepdino008 added DependencyIntegrityChecker - which has case when sub-set files win.

Example: if account.0-4.kv and account.0-8.kv exist - we still can decide to add account.0-4.kv to "visible files list" because commitment.0-8.kv merge is not completed yet. In this case "sub-set" files will win.

In my head:

  • maybe logic of "which in indices are missed can be implemented inside DependencyIntegrityChecker
  • or even worse: remove BuildMissedIndices at all in favor of go agg.MergeLoop (we have it in backend.go already.

cc: @JkLondon maybe it's side-effect from moving BuildMissedInidces to background

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it looks like a race where files aren't removed yet, then get rebuilt by BuildMissedIndices.

the other option is to nuke them from teh filesystem (and if we care/want to support windows, switch to opening these files on windows with the no exclusive flag to replicate unix behaviour (this is actually available on Windows, just not provided by default)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the 5 CI jobs that warned about the issue, and it should now warn when it sidesteps the problem, so the logs will be interesting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ef is held on disk, but accessors is deleted -- when does this happen?
"already deleted as part of merge cleanup" - closeWhatNotInList removes both together.

another solution can be to skip using FileItem.canDelete in fileItemsWithMissedAccessors, but it's not clear exactly when this is happening.

Copy link
Copy Markdown
Member

@sudeepdino008 sudeepdino008 Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS

  • BuildMissedIndices refcounts++ on dirty files; some subsetfile dirtyFile_A_instance1 is present.
  • mergeLoop marks this subset file for deletion, but doesn't closeAndRemove because refcount>0; but it does remove the dirtyfile from domain.dirtyFiles
  • OpenFolder is called from somewhere (onNewSnapshots or previous BMI call), it reloads same file as dirtyFile_A_instance2.
  • then new mergeLoop doescloseAndRemoves on dirtyFile_A_instance2
  • but BMI (still running), which has dirtyFiles_A_instance1; it has datafile, but finds kvi/kvei/bt is not there in disk; proceeds to rebuild

So i think there are couple of things going on here which are not ideal:

  1. BuildMissedIndices produces orphan files item which will prevent complete removal (inode count=0) from disk, because dirtyFile_A_instance1 is left hanging (dirtyFilesRoTx.Close() doesn't call closeAndRemove) -- so there's probably a file descriptor leak here, even after gc.
  2. BuildMissedIndices can start producing files for the closeAndRemove'd file; and once it does, we have no way to "stop the warning"
  3. OpenFolder can reload a new instance of dirtyFile even though a previous instance was marked for deletion.

I'm thinking about the solution -- in particular about splitting closeAndRemove - to close and remove.

  • and then call remove whenever dirtyFile.canDelete is set to true
  • close is called when refcnt becomes 0

this should solve 3. and maybe also 1. But i need to think more about item 2. in previous section.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I prefer:

  • I like BG BMI -- and we can afford to have it; also even if we remove it, sync BMI and MergeLoop are not synchronized (it doesn't hit us in practice, but in principle we should solve it).
  • db/state: fix spurious accessor rebuild for post-merge subset files #19987 -- is good; it cleans up orphan filesItem, so those are not a problem anymore
  • BuildMissedIndicesInBackground can start producing files for a subset file (which can be then closeAndRemove by mergeLoop); and once it does, we have no way to "stop the warning" -- this can be prevented by merging current PR.

so i'm in favour of merging this PR atleast. We can even afford bg BMI then.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't #19989 make this PR unnecessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to pull the merged changes. I wonder if we want to run with the logging change. I will watch teh workflows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anacrolix anacrolix marked this pull request as ready for review March 18, 2026 03:11
@anacrolix
Copy link
Copy Markdown
Contributor Author

Manually dispatched the affected workflows on this branch to verify the fix:

With ERIGON_AGG_TRACE_FILE_LIFE="" set, any remaining occurrences of the fix triggering will show as Warn-level [agg.dbg] fileItemsWithMissedAccessors: skipping subset item in the logs.

@anacrolix
Copy link
Copy Markdown
Contributor Author

Needed to merge new ci gate

@anacrolix
Copy link
Copy Markdown
Contributor Author

Checking if #19797 still occurs since #19989.

@anacrolix
Copy link
Copy Markdown
Contributor Author

Superseded by the cherry-pick in #20077, which landed on main 2026-03-23. See #19797 (comment) for confirmation that the issue no longer occurs as of that merge.

@anacrolix anacrolix closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

removed file unexpectedly exists on workflows

3 participants