-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle processes whose main thread has exited #376
base: main
Are you sure you want to change the base?
Conversation
Thanks for looking into this. This looks OK overall and should solve the issue from the user perspective. One of the downsides I see is that while we do not unload the old mappings, we re also not loading new mappings, which may degrade profiling of such processes ( I am still not sure if there are legit applications with dead main thread, or is it a highly infrequent corner case) I personally would prefer if the processmanager "re-elected" a main thread by looking into the process threads, although I realize it may require more work and we may do this later. Another thing to consider is to hook a kprobe on It would be nice to have a unit test for this case regardless of the solution we chose. |
I'm currently working on this, will push new commits (implementing part 2 of the proposed solution in #365) today.
I think we can switch to EDIT: EDIT2 Went back to |
093a15f
to
38f6e51
Compare
38f6e51
to
62d82c6
Compare
sched_process_free is called when the task is freed by the kernel, which allows for simpler cleanup of processes whose main thread has exited.
Making TID available to processmanager allows the agent to keep profiling a process whose main thread calls pthread_exit while other threads continue to run.
This allows the agent to continue profiling a process whose main thread has exited, but other threads continue to run. Mapping changes triggered by one of the remaining threads are also tracked.
57c0ebc
to
e11a0dc
Compare
e11a0dc
to
87e351e
Compare
} else if path != "" { | ||
// Ignore [vsyscall] and similar executable kernel | ||
// pages we don't care about | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No semantic change, I just inlined the logic from GetMappings
here as this is the more appropriate place.
@@ -538,7 +538,7 @@ func (pm *ProcessManager) synchronizeMappings(pr process.Process, | |||
// fast enough and this particular pid is reused again by the system. | |||
func (pm *ProcessManager) processPIDExit(pid libpf.PID) { | |||
exitKTime := times.GetKTime() | |||
log.Debugf("- PID: %v", pid) | |||
log.Warnf("- PID: %v", pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove these newly added warnings before merging, they should help with reviewing the PR as you don't need to run the agent with debug logs enabled and sort through a lot of irrelevant noise.
@@ -626,22 +633,7 @@ func (pm *ProcessManager) SynchronizeProcess(pr process.Process) { | |||
// return ESRCH. Handle it as if the process did not exist. | |||
pm.mappingStats.errProcESRCH.Add(1) | |||
} | |||
return | |||
} | |||
if len(mappings) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are no longer relevant.
I added some more information and notes on how to review/test to the description. @korniltsev please take another look and review/test. |
87e351e
to
48698d5
Compare
Great job. Thank you for looking into this. |
94a86eb
to
5905d6a
Compare
tracer/tracer.go
Outdated
// It needs to be buffered to avoid locking the writers and stacking up resources when we | ||
// read new PIDs at startup or notified via eBPF. | ||
pidEvents chan libpf.PID | ||
pidEvents chan uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a new type, maybe libpf.PidTidg
, be used here to make clear that these are not ordinary uint64 numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added libpf.PIDTID
@@ -53,9 +58,10 @@ func init() { | |||
} | |||
|
|||
// New returns an object with Process interface accessing it | |||
func New(pid libpf.PID) Process { | |||
func New(pid, tid libpf.PID) Process { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't switch Process
to accept libpf.PIDTID
as the latter is only used with PID events, and I'd rather not couple it here too.
Summary
This PR implements both steps described in #365 (comment).
Thanks to @korniltsev for suggesting
disassociate_ctty
, I ended up using another tracepointsched_process_free
instead as it makes fewer assumptions and is more stable (see this comment for more context). It also allows us to simplify cleanup logic (no need for the extra periodic cleanups I had in the first prototype solution), as userspace will get a final PID notification when the process gets freed by the kernel.Essentially, whenever the main thread exits, we do not unload process information thus allowing profiling the remaining threads to continue. Processmanager can also track mapping changes triggered by one of the remaining threads.
I added some debug warning statements to ease review, I will remove the commit that introduced them before merging. I also added a C program that you can compile and run as a testing workload with the profiling agent also running, that should exercise all the corner cases that this PR addresses. Looking at the warning logs I added and the generated flamegraph in devfiler should make the timeline of processmanager operations very clear.
It's probably easier to review this commit-by-commit.
TODO:
Add test program