fix(modern_bpf): resolve dirfd path in kernel space to prevent race conditions#2817
fix(modern_bpf): resolve dirfd path in kernel space to prevent race conditions#2817irozzo-1A wants to merge 9 commits intofalcosecurity:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: irozzo-1A The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Please double check driver/SCHEMA_VERSION file. See versioning. /hold |
37359ae to
fd8b1a1
Compare
Perf diff from master - unit testsHeap diff from master - unit testsHeap diff from master - scap fileBenchmarks diff from master |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #2817 +/- ##
==========================================
- Coverage 74.53% 74.50% -0.03%
==========================================
Files 292 294 +2
Lines 29987 30623 +636
Branches 4660 4852 +192
==========================================
+ Hits 22350 22815 +465
- Misses 7637 7808 +171
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
66c0c92 to
217460d
Compare
cfee24e to
f0bcd94
Compare
…onditions This patch resolves the directory file descriptor path in kernel space at syscall time, preventing race conditions where the dirfd may point to a different directory by the time user space processes the event. The race condition occurs when: 1. openat(dirfd=3, name="hosts") is called with dirfd pointing to /etc 2. Event is captured in kernel space with dirfd=3 3. Process closes dirfd=3 and opens a new file, reusing FD number 3 for a different directory (e.g., /dev) 4. User space resolves dirfd=3 at processing time via /proc/<pid>/fd/3, which now points to /dev instead of /etc 5. Result: /dev/hosts instead of /etc/hosts By resolving the path in kernel space, we capture the actual directory at the moment of the syscall, eliminating the race condition. Changes: - Append new parameter 'dirfdpath' (PT_FSPATH) to existing event versions (PPME_SYSCALL_OPENAT_2_X and PPME_SYSCALL_OPENAT2_X) for backward compatibility with old scap files - Resolve dirfd path in BPF programs using extract__file_struct_from_fd() and auxmap__store_d_path_approx() - For AT_FDCWD, capture CWD from task_struct->fs->pwd in kernel space - Update user space to prefer kernel-resolved path over user-space resolution - Add comprehensive test coverage for both AT_FDCWD and real dirfd cases Related to: falcosecurity/falco#3789 Signed-off-by: irozzo-1A <iacopo@sysdig.com>
f0bcd94 to
d8eb01b
Compare
|
I was adding a comment related to why we don't take into account |
|
Me and @irozzo-1A had a discussion and agreed on trying to see if we could leverage the returned file descriptor to obtain the full resolved path. |
If I recall correctly, we do exactly the same in One thing that we should probably take into account is the possible performance overhead and ring buffer availability.
|
@Andreagit97 That's a good point, performance is definitely the most sensitive aspect here, on the other hand I don't see alternatives if we want to address this race condition. I'm open to suggestions of course 😄 |
…e conditions This commit adds a new 'fullpath' parameter to PPME_SYSCALL_OPENAT_2_X and PPME_SYSCALL_OPENAT2_X events that captures the kernel-resolved full path of the opened file directly from the returned file descriptor at syscall time. Changes: - Event schema: Added 'fullpath' parameter (PT_FSPATH) to openat/openat2 exit events - Modern BPF: Extract full path from returned FD using extract__file_struct_from_fd() and auxmap__store_d_path_approx(), capturing the actual resolved path including inode numbers for O_TMPFILE files (e.g., /tmp/#123456) - Legacy BPF/Kernel module: Push empty parameter (fallback to userspace resolution) - Userspace: Updated parsers to use kernel-resolved fullpath when available - Tests: Updated to construct expected paths from CWD + inode for O_TMPFILE, and CWD + filename for regular files - Converter: Added conversion rules for older event formats This prevents TOCTOU race conditions where the file descriptor table or process state might change between syscall capture and userspace processing, leading to incorrect path resolution (e.g., /dev/hosts instead of /etc/hosts). For O_TMPFILE files, the kernel captures the full path including the inode number (e.g., /tmp/#123456), which is more accurate than reconstructing from the directory path alone. Signed-off-by: irozzo-1A <iacopo@sysdig.com>
- Add explanatory comment about O_TMPFILE path format (inode as filename with '#' prefix) - Remove unused expected_fullpath variable in failure test cases Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Thanks @ekoops for proposing the idea, it looks very promising and makes the resolution code even simpler. I'll try to assess the performance impact of the change as suggested by @Andreagit97 |
Update e2e test regex patterns to match the renamed parameter from dirfdpath to fullpath in openat/openat2 events. - test_file_writes.py: Update create_expected_arg() and create_expected_arg_for_dev() - test_read_sensitive_file.py: Update expected event regex pattern Signed-off-by: irozzo-1A <iacopo@sysdig.com>
If I recall correctly, DataDog implements an LRU cache in eBPF to store partial paths so that it is not necessary to recompute the whole path of the file. There was a GitHub repo with the benchmarks, but I cannot find it anymore :/ Btw, this implementation is pretty complex, it would make sense to use it if we decide to reconstruct the path kernel side for all the syscalls (mkdirat, renemat, ... all of them suffer from the same issues of openat)
Regarding this one, I would say that the only meaningful thing is probably kernel filtering on the path. But maybe Datadog already has something as well; it could be worth taking a look there. |
Performance Profiling Results: openatTest Setup:
Measurement Methodology: Results:
Raw data: bpf_stats.fullpath.txt Script: Drops due to increased event size: The expected effect on buffer drops is also visible, using the same workload we can observe drops when extracting the full path.
|
|
Since you have already set up an environment to test it, could we please compare the performance differences between the current approach and the one where we just send the full "trusted" |
@Andreagit97 What do you mean by kernel filtering on the path? Like pushing information from userspace about which paths are meaninfgul for rules evaluation and which are not? A proposal from @gnosek was to resolve the full path only when the |
@ekoops it is mechanical that if events increase in size on average, the event rate required to overflow the buffer decreases. I just observed that given this arbitrary workload I had no drops with the vanilla probe, while I had drops with the fullpath resolution, that sounds reasonable. Are you suggesting we should consider to introduce a breaking change and replace |
Update parse_open_openat_creat_exit() to prefer kernel-resolved fullpath parameter when available, falling back to dirfd + name concatenation when the kernel-resolved path is empty or <NA>. This prevents TOCTOU race conditions by using the path resolved by the kernel at syscall time, which includes symlink resolution and path normalization. Signed-off-by: irozzo-1A <iacopo@sysdig.com>
8e631b4 to
37ff94b
Compare
yep, something very similar to this https://www.datadoghq.com/blog/engineering/workload-protection-ebpf-fim/. It was also proposed in the past #1867. If we start to resolve the full path kernel side, a kernel filter could become pretty powerful and could filter out most of the events without sending them to userspace.
found https://github.com/Gui774ume/fsprobe
This seems a smart move, if pathname len is closer to full_path, there should not be much difference in terms of bytes sent, but yes, it really depends on where the dirfd points in the path.
It doesn't seem too bad |
Detect escaped path cases (e.g., nsfs nodes like /proc/self/ns/net) where dentry == d_parent but dentry != mnt_root_p on the first iteration. In these cases, path reconstruction fails, so store an empty parameter to allow userspace to fall back to the name parameter from syscall arguments. This fixes the issue where namespace filesystem entries were incorrectly resolved to '/' instead of the original path. Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Detect nsfs entries (namespace filesystem) using superblock magic when path reconstruction fails (escaped path condition). For nsfs entries, the dentry name is "/" (len=1), which would result in an incorrect path. Store empty parameter to allow userspace fallback to the name parameter from syscall arguments. This fixes the issue where namespace filesystem entries like /proc/self/ns/net were incorrectly resolved to "/" instead of the original path. Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Add zlib module include and dependency to fix compilation error where zlib.h was not found when building libsinsp_e2e_tests. Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Remove linux-tools-generic from perf action dependencies as it causes dependency conflicts on systems with non-generic kernels (e.g., Oracle kernels). The linux-tools-`uname -r` package already provides the necessary kernel-specific tools, making linux-tools-generic redundant. Signed-off-by: irozzo-1A <iacopo@sysdig.com>




What type of PR is this?
/kind bug
Any specific area of the project related to this PR?
/area driver-modern-bpf
Does this PR require a change in the driver versions?
What this PR does / why we need it:
This patch resolves the directory file descriptor path in kernel space at syscall time, preventing race conditions where the dirfd may point to a different directory by the time user space processes the event.
The race condition occurs when:
By resolving the path in kernel space, we capture the actual directory at the moment of the syscall, eliminating the race condition.
Changes:
openat/openat2event versionsopenat/openat2Related to: falcosecurity/falco#3789
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: