Skip to content

fix(modern_bpf): resolve dirfd path in kernel space to prevent race conditions#2817

Draft
irozzo-1A wants to merge 9 commits intofalcosecurity:masterfrom
irozzo-1A:fix/kernel-dirfd-path-resolution
Draft

fix(modern_bpf): resolve dirfd path in kernel space to prevent race conditions#2817
irozzo-1A wants to merge 9 commits intofalcosecurity:masterfrom
irozzo-1A:fix/kernel-dirfd-path-resolution

Conversation

@irozzo-1A
Copy link
Contributor

@irozzo-1A irozzo-1A commented Jan 29, 2026

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind test

/kind feature

/kind sync

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area API-version

/area build

/area CI

/area driver-kmod

/area driver-bpf

/area driver-modern-bpf

/area libscap-engine-bpf

/area libscap-engine-gvisor

/area libscap-engine-kmod

/area libscap-engine-modern-bpf

/area libscap-engine-nodriver

/area libscap-engine-noop

/area libscap-engine-source-plugin

/area libscap-engine-savefile

/area libscap

/area libpman

/area libsinsp

/area tests

/area proposals

Does this PR require a change in the driver versions?

/version driver-API-version-major

/version driver-API-version-minor

/version driver-API-version-patch

/version driver-SCHEMA-version-major

/version driver-SCHEMA-version-minor

/version driver-SCHEMA-version-patch

What this PR does / why we need it:

This patch resolves the directory file descriptor path in kernel space at syscall time, preventing race conditions where the dirfd may point to a different directory by the time user space processes the event.

The race condition occurs when:

  1. openat(dirfd=3, name="hosts") is called with dirfd pointing to /etc
  2. Event is captured in kernel space with dirfd=3
  3. Process exec's or FD table changes between capture and processing
  4. User space resolves dirfd=3 at processing time, now pointing to /dev
  5. Result: /dev/hosts instead of /etc/hosts

By resolving the path in kernel space, we capture the actual directory at the moment of the syscall, eliminating the race condition.

Changes:

  • Add new parameter 'dirfdpath' (PT_FSPATH) to new openat/openat2 event versions
  • Resolve dirfd path in the modern BPF program
  • Update converter to append empty parameter to old openat/openat2
  • Add test coverage

Related to: falcosecurity/falco#3789

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@poiana
Copy link
Contributor

poiana commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: irozzo-1A

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

Please double check driver/SCHEMA_VERSION file. See versioning.

/hold

@github-actions
Copy link

github-actions bot commented Jan 29, 2026

Perf diff from master - unit tests

     9.00%     +1.04%  [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()
     5.08%     -0.56%  [.] thread_group_info::get_first_thread() const
     7.13%     -0.34%  [.] std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count(std::__weak_count<(__gnu_cxx::_Lock_policy)2> const&, std::nothrow_t)
    15.74%     +0.34%  [.] std::__shared_ptr<sinsp_threadinfo, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__weak_ptr<sinsp_threadinfo, (__gnu_cxx::_Lock_policy)2> const&, std::nothrow_t)
    13.67%     -0.33%  [.] std::__shared_count<(__gnu_cxx::_Lock_policy)2>::_M_get_use_count() const
     3.16%     -0.29%  [.] sinsp_thread_manager::create_thread_dependencies(std::shared_ptr<sinsp_threadinfo> const&)
     9.50%     +0.28%  [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_lock_nothrow()
     0.23%     +0.20%  [.] scap_event_encode_params_v
    11.86%     -0.17%  [.] sinsp_threadinfo::get_main_thread()
     9.18%     -0.13%  [.] sinsp_threadinfo::update_main_fdtable()

Heap diff from master - unit tests

peak heap memory consumption: -2.15K
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Heap diff from master - scap file

peak heap memory consumption: -160B
peak RSS (including heaptrack overhead): 0B
total memory leaked: 0B

Benchmarks diff from master

Comparing gbench_data.json to /root/actions-runner/_work/libs/libs/build/gbench_data.json
Benchmark                                                         Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------
BM_sinsp_split_mean                                            -0.0033         -0.0032           247           246           247           246
BM_sinsp_split_median                                          -0.0049         -0.0047           247           245           246           245
BM_sinsp_split_stddev                                          -0.4226         -0.4321             3             2             3             2
BM_sinsp_split_cv                                              -0.4207         -0.4303             0             0             0             0
BM_sinsp_concatenate_paths_relative_path_mean                  -0.0111         -0.0111            73            73            73            73
BM_sinsp_concatenate_paths_relative_path_median                -0.0240         -0.0241            74            72            73            72
BM_sinsp_concatenate_paths_relative_path_stddev                +2.3642         +2.3052             0             1             0             1
BM_sinsp_concatenate_paths_relative_path_cv                    +2.4019         +2.3423             0             0             0             0
BM_sinsp_concatenate_paths_empty_path_mean                     -0.0450         -0.0451            42            41            42            41
BM_sinsp_concatenate_paths_empty_path_median                   -0.0577         -0.0578            42            40            42            40
BM_sinsp_concatenate_paths_empty_path_stddev                   +5.7237         +5.7855             0             1             0             1
BM_sinsp_concatenate_paths_empty_path_cv                       +6.0404         +6.1061             0             0             0             0
BM_sinsp_concatenate_paths_absolute_path_mean                  +0.0023         +0.0024            73            74            73            74
BM_sinsp_concatenate_paths_absolute_path_median                +0.0015         +0.0017            73            73            73            73
BM_sinsp_concatenate_paths_absolute_path_stddev                -0.1890         -0.1958             1             1             1             1
BM_sinsp_concatenate_paths_absolute_path_cv                    -0.1908         -0.1976             0             0             0             0

@codecov
Copy link

codecov bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.50%. Comparing base (0ea6d55) to head (c02601f).
⚠️ Report is 17 commits behind head on master.

Files with missing lines Patch % Lines
userspace/libsinsp/test/events_fspath.ut.cpp 82.05% 7 Missing ⚠️
userspace/libsinsp/parsers.cpp 85.71% 4 Missing ⚠️
userspace/libsinsp/sinsp_filtercheck_fspath.cpp 94.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2817      +/-   ##
==========================================
- Coverage   74.53%   74.50%   -0.03%     
==========================================
  Files         292      294       +2     
  Lines       29987    30623     +636     
  Branches     4660     4852     +192     
==========================================
+ Hits        22350    22815     +465     
- Misses       7637     7808     +171     
Flag Coverage Δ
libsinsp 74.50% <87.50%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@irozzo-1A irozzo-1A force-pushed the fix/kernel-dirfd-path-resolution branch 3 times, most recently from 66c0c92 to 217460d Compare January 30, 2026 11:32
@poiana poiana added size/XXL and removed size/XL labels Jan 30, 2026
@irozzo-1A irozzo-1A force-pushed the fix/kernel-dirfd-path-resolution branch 2 times, most recently from cfee24e to f0bcd94 Compare January 30, 2026 13:31
…onditions

This patch resolves the directory file descriptor path in kernel space
at syscall time, preventing race conditions where the dirfd may point
to a different directory by the time user space processes the event.

The race condition occurs when:
1. openat(dirfd=3, name="hosts") is called with dirfd pointing to /etc
2. Event is captured in kernel space with dirfd=3
3. Process closes dirfd=3 and opens a new file, reusing FD number 3
   for a different directory (e.g., /dev)
4. User space resolves dirfd=3 at processing time via /proc/<pid>/fd/3,
   which now points to /dev instead of /etc
5. Result: /dev/hosts instead of /etc/hosts

By resolving the path in kernel space, we capture the actual directory
at the moment of the syscall, eliminating the race condition.

Changes:
- Append new parameter 'dirfdpath' (PT_FSPATH) to existing event versions
  (PPME_SYSCALL_OPENAT_2_X and PPME_SYSCALL_OPENAT2_X) for backward
  compatibility with old scap files
- Resolve dirfd path in BPF programs using extract__file_struct_from_fd()
  and auxmap__store_d_path_approx()
- For AT_FDCWD, capture CWD from task_struct->fs->pwd in kernel space
- Update user space to prefer kernel-resolved path over user-space resolution
- Add comprehensive test coverage for both AT_FDCWD and real dirfd cases

Related to: falcosecurity/falco#3789

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
@irozzo-1A irozzo-1A force-pushed the fix/kernel-dirfd-path-resolution branch from f0bcd94 to d8eb01b Compare January 30, 2026 14:58
@ekoops ekoops added this to the 0.24.0 milestone Feb 2, 2026
@ekoops
Copy link
Contributor

ekoops commented Feb 2, 2026

I was adding a comment related to why we don't take into account pathname while building the new parameter, but then I realized that here you want just to send the dirfd path. If we could find a way of sending the full resolved path, we could drop both TOCTOU mitigation tracepoint programs on sys_enter_openat and sys_enter_openat2...

@ekoops
Copy link
Contributor

ekoops commented Feb 2, 2026

Me and @irozzo-1A had a discussion and agreed on trying to see if we could leverage the returned file descriptor to obtain the full resolved path.

@Andreagit97
Copy link
Member

Me and @irozzo-1A had a discussion and agreed on trying to see if we could leverage the returned file descriptor to obtain the full resolved path.

If I recall correctly, we do exactly the same in open_by_handle_at syscall

/* We collect the file path from the file descriptor only if it is valid */

One thing that we should probably take into account is the possible performance overhead and ring buffer availability. openat is one of the noisiest syscalls:

  • Resolving the path for each call could slow down the system
  • Pushing too many bytes into the buffer could cause drops if the buffer becomes full

@irozzo-1A
Copy link
Contributor Author

irozzo-1A commented Feb 2, 2026

One thing that we should probably take into account is the possible performance overhead and ring buffer availability. openat is one of the noisiest syscalls:

  • Resolving the path for each call could slow down the system
  • Pushing too many bytes into the buffer could cause drops if the buffer becomes full

@Andreagit97 That's a good point, performance is definitely the most sensitive aspect here, on the other hand I don't see alternatives if we want to address this race condition. I'm open to suggestions of course 😄

…e conditions

This commit adds a new 'fullpath' parameter to PPME_SYSCALL_OPENAT_2_X and
PPME_SYSCALL_OPENAT2_X events that captures the kernel-resolved full path
of the opened file directly from the returned file descriptor at syscall time.

Changes:
- Event schema: Added 'fullpath' parameter (PT_FSPATH) to openat/openat2 exit events
- Modern BPF: Extract full path from returned FD using extract__file_struct_from_fd()
  and auxmap__store_d_path_approx(), capturing the actual resolved path including
  inode numbers for O_TMPFILE files (e.g., /tmp/#123456)
- Legacy BPF/Kernel module: Push empty parameter (fallback to userspace resolution)
- Userspace: Updated parsers to use kernel-resolved fullpath when available
- Tests: Updated to construct expected paths from CWD + inode for O_TMPFILE,
  and CWD + filename for regular files
- Converter: Added conversion rules for older event formats

This prevents TOCTOU race conditions where the file descriptor table or process
state might change between syscall capture and userspace processing, leading to
incorrect path resolution (e.g., /dev/hosts instead of /etc/hosts).

For O_TMPFILE files, the kernel captures the full path including the inode
number (e.g., /tmp/#123456), which is more accurate than reconstructing from
the directory path alone.

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
- Add explanatory comment about O_TMPFILE path format (inode as filename with '#' prefix)
- Remove unused expected_fullpath variable in failure test cases

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
@irozzo-1A
Copy link
Contributor Author

Me and @irozzo-1A had a discussion and agreed on trying to see if we could leverage the returned file descriptor to obtain the full resolved path.

Thanks @ekoops for proposing the idea, it looks very promising and makes the resolution code even simpler. I'll try to assess the performance impact of the change as suggested by @Andreagit97

Update e2e test regex patterns to match the renamed parameter from
dirfdpath to fullpath in openat/openat2 events.

- test_file_writes.py: Update create_expected_arg() and create_expected_arg_for_dev()
- test_read_sensitive_file.py: Update expected event regex pattern

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
@Andreagit97
Copy link
Member

@Andreagit97 That's a good point, performance is definitely the most sensitive aspect here, on the other hand I don't see alternatives if we want to address this race condition. I'm open to suggestions of course 😄

If I recall correctly, DataDog implements an LRU cache in eBPF to store partial paths so that it is not necessary to recompute the whole path of the file. There was a GitHub repo with the benchmarks, but I cannot find it anymore :/ Btw, this implementation is pretty complex, it would make sense to use it if we decide to reconstruct the path kernel side for all the syscalls (mkdirat, renemat, ... all of them suffer from the same issues of openat)

Pushing too many bytes into the buffer could cause drops if the buffer becomes full

Regarding this one, I would say that the only meaningful thing is probably kernel filtering on the path. But maybe Datadog already has something as well; it could be worth taking a look there.

@irozzo-1A
Copy link
Contributor Author

Performance Profiling Results: openat

Test Setup:

  • Workload: stress-ng --open 12 --fork 2 --max-fd 400 --timeout 5m
  • Environment: Ubuntu 24.04.3, Kernel 6.14.0-1018-aws, 8 CPUs
  • Falco versions:
    • 0.43.0
    • Local build (Libs: e6d211d Falco: 429c14a9065b010883e7fe3c7414b5be20e0ad8e)

Measurement Methodology:

sudo perf record -a -g -F 99 -o "$OUTPUT_DATA" -- sleep 30
sudo perf report -i "$OUTPUT_DATA" --no-children --stdio --symbol-filter=bpf_prog

Results:

Mode openat_x Handler openat_e (Enter) sys_exit Dispatcher Total eBPF
Standard 0.33% 0.11% 0.37% ~1.5%
Fullpath 0.74% 0.17% 0.31% ~1.9%

Raw data:

bpf_stats.fullpath.txt
bpf_stats.0.43.0.txt
falco_perf_data.zip

Script:
profile-ebpf.sh

Drops due to increased event size:

The expected effect on buffer drops is also visible, using the same workload we can observe drops when extracting the full path.

Screenshot 2026-02-03 at 23 19 03 Screenshot 2026-02-03 at 23 20 39 Screenshot 2026-02-03 at 23 21 52 Screenshot 2026-02-03 at 23 22 18

@ekoops
Copy link
Contributor

ekoops commented Feb 4, 2026

Since you have already set up an environment to test it, could we please compare the performance differences between the current approach and the one where we just send the full "trusted" full_path in place of the pathname parameter? Just to confirm that drops are given by "too many bytes" on the ring buffer.

@irozzo-1A
Copy link
Contributor Author

Regarding this one, I would say that the only meaningful thing is probably kernel filtering on the path. But maybe Datadog already has something as well; it could be worth taking a look there.

@Andreagit97 What do you mean by kernel filtering on the path? Like pushing information from userspace about which paths are meaninfgul for rules evaluation and which are not?

A proposal from @gnosek was to resolve the full path only when the fddir is not relative to the cwd, I don't know to be honest how much it would be effective , but probably it would not fully solve the TOCTOU problem.

@irozzo-1A
Copy link
Contributor Author

Since you have already set up an environment to test it, could we please compare the performance differences between the current approach and the one where we just send the full "trusted" full_path in place of the pathname parameter? Just to confirm that drops are given by "too many bytes" on the ring buffer.

@ekoops it is mechanical that if events increase in size on average, the event rate required to overflow the buffer decreases. I just observed that given this arbitrary workload I had no drops with the vanilla probe, while I had drops with the fullpath resolution, that sounds reasonable.

Are you suggesting we should consider to introduce a breaking change and replace pathname and dirfd with full_path when we're able to resolve it? I agree this would definitely mitigate the size issue, at the cost of breaking the event scheme backward compatibility, probably we should consider this option as well.

Update parse_open_openat_creat_exit() to prefer kernel-resolved fullpath
parameter when available, falling back to dirfd + name concatenation
when the kernel-resolved path is empty or <NA>.

This prevents TOCTOU race conditions by using the path resolved by the
kernel at syscall time, which includes symlink resolution and path
normalization.

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
@irozzo-1A irozzo-1A force-pushed the fix/kernel-dirfd-path-resolution branch from 8e631b4 to 37ff94b Compare February 4, 2026 21:07
@Andreagit97
Copy link
Member

@Andreagit97 What do you mean by kernel filtering on the path? Like pushing information from userspace about which paths are meaninfgul for rules evaluation and which are not?

yep, something very similar to this https://www.datadoghq.com/blog/engineering/workload-protection-ebpf-fim/. It was also proposed in the past #1867. If we start to resolve the full path kernel side, a kernel filter could become pretty powerful and could filter out most of the events without sending them to userspace.

If I recall correctly, DataDog implements an LRU cache in eBPF to store partial paths so that it is not necessary to recompute the whole path of the file. There was a GitHub repo with the benchmarks, but I cannot find it anymore

found https://github.com/Gui774ume/fsprobe

Are you suggesting we should consider to introduce a breaking change and replace pathname and dirfd with full_path when we're able to resolve it?

This seems a smart move, if pathname len is closer to full_path, there should not be much difference in terms of bytes sent, but yes, it really depends on where the dirfd points in the path.

Mode openat_x Handler openat_e (Enter) sys_exit Dispatcher Total eBPF
Standard 0.33% 0.11% 0.37% ~1.5%
Fullpath 0.74% 0.17% 0.31% ~1.9%

It doesn't seem too bad

Detect escaped path cases (e.g., nsfs nodes like /proc/self/ns/net) where
dentry == d_parent but dentry != mnt_root_p on the first iteration. In
these cases, path reconstruction fails, so store an empty parameter to
allow userspace to fall back to the name parameter from syscall arguments.

This fixes the issue where namespace filesystem entries were incorrectly
resolved to '/' instead of the original path.

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Detect nsfs entries (namespace filesystem) using superblock magic when
path reconstruction fails (escaped path condition). For nsfs entries,
the dentry name is "/" (len=1), which would result in an incorrect
path. Store empty parameter to allow userspace fallback to the name
parameter from syscall arguments.

This fixes the issue where namespace filesystem entries like
/proc/self/ns/net were incorrectly resolved to "/" instead of the
original path.

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Add zlib module include and dependency to fix compilation error where
zlib.h was not found when building libsinsp_e2e_tests.

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Remove linux-tools-generic from perf action dependencies as it causes
dependency conflicts on systems with non-generic kernels (e.g., Oracle
kernels). The linux-tools-`uname -r` package already provides the
necessary kernel-specific tools, making linux-tools-generic redundant.

Signed-off-by: irozzo-1A <iacopo@sysdig.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

4 participants