Skip to content
This repository was archived by the owner on Apr 17, 2025. It is now read-only.
This repository was archived by the owner on Apr 17, 2025. It is now read-only.

defunct processes, the possible explanation #74

Open
@sterchelen

Description

@sterchelen

introduction

The following explanation is focused on using csi-s3 with goofys as a backend. All the components are in their latest version.
The issue I stumbled upon is the number of goofys Zombie processes.
The number doesn't have any importance in the understanding.

explanation

I looked in the csi-s3 code and more importantly at the FuseUnmount function and then at waitForProcess

func waitForProcess(p *os.Process, backoff int) error {
if backoff == 20 {
return fmt.Errorf("Timeout waiting for PID %v to end", p.Pid)
}
cmdLine, err := getCmdLine(p.Pid)
if err != nil {
glog.Warningf("Error checking cmdline of PID %v, assuming it is dead: %s", p.Pid, err)
return nil
}
if cmdLine == "" {
// ignore defunct processes
// TODO: debug why this happens in the first place
// seems to only happen on k8s, not on local docker
glog.Warning("Fuse process seems dead, returning")
return nil
}
if err := p.Signal(syscall.Signal(0)); err != nil {
glog.Warningf("Fuse process does not seem active or we are unprivileged: %s", err)
return nil
}
glog.Infof("Fuse process with PID %v still active, waiting...", p.Pid)
time.Sleep(time.Duration(backoff*100) * time.Millisecond)
return waitForProcess(p, backoff+1)
}

Due to the name of the function I was expected to see a wait4 syscall to consume the child process, in our case goofys.
If we look at the below outputs:

  • we have a goofys Zombie process with pid=32767
$ ps aux | grep goofys
root     32767  0.0  0.0      0     0 ?        Zs   Jun14   0:00 [goofys] <defunct>
  • its parent process the s3driver
$ pstree -s 32767
systemd───containerd-shim───s3driver───goofys

As s3driver launches goofys backend (I guess it is the case for the other backends 🤷🏼‍♂️), s3driver is the parent process. Then as a good parent 😃 it should wait4 its child to know what was its status.

In other words, there is a leak on child termination. The fix should be trivial; in the waitForProcess when the cmdLine is empty, we have to syscall.wait4 on the given pid.

if cmdLine == "" {
// ignore defunct processes
// TODO: debug why this happens in the first place
// seems to only happen on k8s, not on local docker
glog.Warning("Fuse process seems dead, returning")
return nil
}

wdyt @ctrox?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions