main: fix incorrect directory entries due to unstable iteration order#448
Merged
giuseppe merged 1 commit intocontainers:mainfrom Nov 5, 2025
Merged
main: fix incorrect directory entries due to unstable iteration order#448giuseppe merged 1 commit intocontainers:mainfrom
giuseppe merged 1 commit intocontainers:mainfrom
Conversation
When a directory is opened, closed, and reopened, which in turn causes nodes to be freed and recreated, the iteration order of entries can change. This causes the kernel FUSE layer's offset-based caching to skip or duplicate entries. In https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/9408, we have been tracking a problem with GitLab failing to load a file on a Fedora host with `podman` and `fuse-overlayfs`. When the problem occurs, we observed: - Files exist but don't appear in `ls` or directory listings. - Some files appear duplicated in directory listings. - `touch directory` temporarily fixes the issue. - This affects directories with many entries (>100 files). - Only happens when directory is accessed multiple times. The environment: - fuse-overlayfs 1.15 - Nested containers (Docker → Podman → fuse-overlayfs) - Directories with 200+ entries FUSE readdir uses offsets as opaque cookies to track position in directory listings: 1. Kernel calls `readdir(offset=0)` → gets entries 0-100 2. Kernel caches "offset 50 = file_x.rb" 3. Kernel calls `readdir(offset=100)` → continues from position 100 The kernel assumes offsets are stable identifiers across the lifetime of the directory handle. fuse-overlayfs stores directory entries in a hash table. When entries are freed (on `closedir`) and recreated (on next `opendir`), the hash iteration order can change due to: - **Free entry list recycling** - Hash entries freed in one order, reallocated in LIFO order - **Bucket chain positions** - Entries may land in different positions within collision chains - **Hash table resize** - If resize happens between loads, all positions change The bug sequence: ``` Session 1: opendir("entities/") readdir(offset=0...) → package_version.rb at position 23 Kernel caches: "I've read through offset 50, including package_version.rb at 23" closedir() → all nodes freed, hash cleared Session 2 (same directory, moments later): opendir("entities/") → nodes recreated readdir(offset=0...) → package_version.rb now at position 22 (ORDER CHANGED!) Kernel: "I already have offsets 0-50 cached" Kernel: "Offset 22 < 50, so I already returned this, skip it" Result: package_version.rb skipped, missing from final ls output! ``` For GitLab, newer versions of Zeitwerk (Ruby autoloader) access directories more frequently during initialization, increasing the chance of hitting this scenario. The bug exists regardless, but timing affects visibility. This is difficult to reproduce with simple shell commands because it requires: 1. Multiple opendir/closedir cycles on same directory 2. Nodes being freed and recreated between cycles 3. Application reading directory results that span these cycles The fix: **Ensure stable directory entry ordering by sorting entries alphabetically in `reload_tbl`.** This guarantees: - Same filename always at same offset - Kernel cache assumptions remain valid - No entries skipped or duplicated Increasing initial hash size from 128 to 512 reduces the problem by making iteration order more stable (fewer collisions, less churn in bucket chains), but doesn't eliminate it. Sorting is the proper fix. With this patch: - Directory listings are stable across multiple opendir/closedir cycles - No missing or duplicate entries - Performance impact: O(n log n) sort on ~200-500 entries ≈ microseconds Closes containers#447 Signed-off-by: Stan Hu <stanhu@gmail.com>
80b6c17 to
4266f68
Compare
Contributor
Author
|
@giuseppe Would you mind reviewing this? |
mathstuf
approved these changes
Nov 4, 2025
mathstuf
left a comment
There was a problem hiding this comment.
I have pulled a scratch build of fuse-overlayfs-1.13 with this patch into our CI which has been very consistently running into this issue. It works!
https://gitlab.kitware.com/utils/ghostflow-director/-/jobs/11881759#L46
Fedora backport PR (I grabbed the build from the Koji scratch build for this): https://src.fedoraproject.org/rpms/fuse-overlayfs/pull-request/6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a directory is opened, closed, and reopened, which in turn causes nodes to be freed and recreated, the iteration order of entries can change. This causes the kernel FUSE layer's offset-based caching to skip or duplicate entries.
In https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/9408, we have been tracking a problem with GitLab failing to load a file on a Fedora host with
podmanandfuse-overlayfs. When the problem occurs, we observed:lsor directory listings.touch directorytemporarily fixes the issue.The environment:
FUSE readdir uses offsets as opaque cookies to track position in directory listings:
readdir(offset=0)→ gets entries 0-100readdir(offset=100)→ continues from position 100The kernel assumes offsets are stable identifiers across the lifetime of the directory handle.
fuse-overlayfs stores directory entries in a hash table. When entries are freed (on
closedir) and recreated (on nextopendir), the hash iteration order can change due to:The bug sequence:
For GitLab, newer versions of Zeitwerk (Ruby autoloader) access directories more frequently during initialization, increasing the chance of hitting this scenario. The bug exists regardless, but timing affects visibility.
This is difficult to reproduce with simple shell commands because it requires:
The fix:
Ensure stable directory entry ordering by sorting entries alphabetically in
reload_tbl.This guarantees:
Increasing initial hash size from 128 to 512 reduces the problem by making iteration order more stable (fewer collisions, less churn in bucket chains), but doesn't eliminate it. Sorting is the proper fix.
With this patch:
Closes #447