db/state: fix spurious accessor rebuild for post-merge subset files#19965
db/state: fix spurious accessor rebuild for post-merge subset files#19965
Conversation
After a merge, the old sub-range .ef files stay on disk while held open by active readers. OpenFolder/scanDirtyFiles re-adds them to dirtyFiles. BuildMissedAccessors then found these re-added items had no .efi accessor (already deleted as part of merge cleanup) and rebuilt them, producing orphaned .efi files at paths that dir.trackRemovedFiles had been told were deleted, triggering repeated "Removed file unexpectedly exists" warnings. Fix: skip items in fileItemsWithMissedAccessors that are proper subsets of another item already in dirtyFiles — their accessor should not be rebuilt because the superset file supersedes them. Add ERIGON_AGG_TRACE_FILE_LIFE logging to observe when the condition fires: Debug by default; set the env var to empty to escalate all to Warn, or to a substring to filter by filename base. Also add dbg.EnvStringLookup to distinguish "not set" from "set to empty string". Fixes #19797
Add ERIGON_AGG_TRACE_FILE_LIFE="" to the two QA workflows that showed the "Removed file unexpectedly exists" failures (qa-sync-from-scratch- minimal-node and qa-sync-with-externalcl). This escalates all agg file lifecycle debug logs to Warn level so the fix from #19797 is observable in CI output.
4b2876e to
d8b2673
Compare
| // "Removed file unexpectedly exists" warnings. | ||
| var coveringItem *FilesItem | ||
| for j, other := range dirtyFiles { | ||
| if i != j && item.isProperSubsetOf(other) { |
There was a problem hiding this comment.
right reason!
But also @sudeepdino008 added DependencyIntegrityChecker - which has case when sub-set files win.
Example: if account.0-4.kv and account.0-8.kv exist - we still can decide to add account.0-4.kv to "visible files list" because commitment.0-8.kv merge is not completed yet. In this case "sub-set" files will win.
In my head:
- maybe logic of "which in indices are missed can be implemented inside
DependencyIntegrityChecker - or even worse: remove
BuildMissedIndicesat all in favor ofgo agg.MergeLoop(we have it inbackend.goalready.
cc: @JkLondon maybe it's side-effect from moving BuildMissedInidces to background
There was a problem hiding this comment.
yeah it looks like a race where files aren't removed yet, then get rebuilt by BuildMissedIndices.
the other option is to nuke them from teh filesystem (and if we care/want to support windows, switch to opening these files on windows with the no exclusive flag to replicate unix behaviour (this is actually available on Windows, just not provided by default)
There was a problem hiding this comment.
I've run the 5 CI jobs that warned about the issue, and it should now warn when it sidesteps the problem, so the logs will be interesting.
There was a problem hiding this comment.
ef is held on disk, but accessors is deleted -- when does this happen?
"already deleted as part of merge cleanup" - closeWhatNotInList removes both together.
another solution can be to skip using FileItem.canDelete in fileItemsWithMissedAccessors, but it's not clear exactly when this is happening.
There was a problem hiding this comment.
AFAICS
BuildMissedIndicesrefcounts++ on dirty files; some subsetfiledirtyFile_A_instance1is present.- mergeLoop marks this subset file for deletion, but doesn't
closeAndRemovebecause refcount>0; but it does remove the dirtyfile from domain.dirtyFiles OpenFolderis called from somewhere (onNewSnapshotsor previous BMI call), it reloads same file asdirtyFile_A_instance2.- then new
mergeLoopdoescloseAndRemoves ondirtyFile_A_instance2 - but BMI (still running), which has
dirtyFiles_A_instance1; it has datafile, but finds kvi/kvei/bt is not there in disk; proceeds to rebuild
So i think there are couple of things going on here which are not ideal:
BuildMissedIndicesproduces orphan files item which will prevent complete removal (inode count=0) from disk, because dirtyFile_A_instance1 is left hanging (dirtyFilesRoTx.Close()doesn't callcloseAndRemove) -- so there's probably a file descriptor leak here, even after gc.- BuildMissedIndices can start producing files for the closeAndRemove'd file; and once it does, we have no way to "stop the warning"
OpenFoldercan reload a new instance of dirtyFile even though a previous instance was marked for deletion.
I'm thinking about the solution -- in particular about splitting closeAndRemove - to close and remove.
- and then call remove whenever dirtyFile.canDelete is set to true
- close is called when refcnt becomes 0
this should solve 3. and maybe also 1. But i need to think more about item 2. in previous section.
There was a problem hiding this comment.
IMO, I prefer:
- I like BG BMI -- and we can afford to have it; also even if we remove it, sync BMI and MergeLoop are not synchronized (it doesn't hit us in practice, but in principle we should solve it).
- db/state: fix spurious accessor rebuild for post-merge subset files #19987 -- is good; it cleans up orphan filesItem, so those are not a problem anymore
- BuildMissedIndicesInBackground can start producing files for a subset file (which can be then closeAndRemove by mergeLoop); and once it does, we have no way to "stop the warning" -- this can be prevented by merging current PR.
so i'm in favour of merging this PR atleast. We can even afford bg BMI then.
There was a problem hiding this comment.
I've updated to pull the merged changes. I wonder if we want to run with the logging change. I will watch teh workflows.
There was a problem hiding this comment.
|
Manually dispatched the affected workflows on this branch to verify the fix:
With |
|
Needed to merge new ci gate |
|
Superseded by the cherry-pick in #20077, which landed on main 2026-03-23. See #19797 (comment) for confirmation that the issue no longer occurs as of that merge. |
Fixes #19797
Root cause
After a merge, old sub-range
.effiles stay on disk while held open by active readers.OpenFolder/scanDirtyFilesre-adds them todirtyFiles.BuildMissedAccessorsthen finds these re-added items have no.efiaccessor (already deleted as part of merge cleanup) and rebuilds them, producing orphaned.efifiles at paths thatdir.trackRemovedFileswas told were deleted — triggering the repeated "Removed file unexpectedly exists" warnings.Fix
In
fileItemsWithMissedAccessors: skip items that are proper subsets of another item already present indirtyFiles. Their accessor should not be rebuilt because the superset file supersedes them.Observability
Added
ERIGON_AGG_TRACE_FILE_LIFElogging to the skip path:Debug(silent by default)ERIGON_AGG_TRACE_FILE_LIFE=) →Warnfor all filesWarnonly for matching filename basesAlso adds
dbg.EnvStringLookupto distinguish "not set" from "set to empty string".