pidns: Fix wait for zombies to be reaped in zap_pid_ns_processes v2

Fix zap_pid_ns_processes so that it successfully waits for all of
the tasks in the pid namespace to be reaped, even if called for a
non-leader task of the init process.  This guarantees that no task
can escpae the pid namespace, and that it is safe for proc_flush_task
to put the proc_mnt of the dead pid_namespace when pid 1 of the
pid namespace calls proc_flush_task.

Before zap_pid_ns_processes was fixed to wait for all zombies
in the pid namespace to be reaped the easiest way to generate
a zombie that would escape the pid namespace would be to attach
a debugger to a process inside a pidnamespace from outside the
pid namespace and then exit the pid namespace.

In the process of trying to fix this bug I have looked at a lot
of different options and a lot of different directions we can
go.  There are several limiting factors.

- We need to guarantee that the init process of a pid namespace
  is not reaped before every other task in the pid namespace is
  reaped.  Wait succeeding on the init process of a pid namespace
  gives the guarantee that all processes in the pid namespace
  are dead and gone.  Or more succinctly it is not possible to
  escape from a pid namespace.

  The previous behaviour where some zombies could escape the pid
  namespace violates the assumption made by some reapers of a pid
  namespace init that all of the pid namespace cleanup has completed
  by the time that init is reaped.

- proc_flush_task needs to be called after each task is reaped.
  Tasks are volatile and applications like top and ps frequently
  touch every thread group directory in /proc which triggers dcache
  entries to be created.  If we don't remove those dcache entries
  when tasks are reaped we can get a large build up of useless
  inodes and dentries.  Shrink_slab is designed to flush out useful
  cache entries not useless ones so while in the big picture it doesn't
  hurt if we leak a few if we leak a lot of dentries we put unnatural
  pressure on the kernels memory managment.

  I sat down and attempted to measure the cost of calling
  proc_flush_task with lat_tcp (from lmbench) and I get the same
  range of latency readings wether or not proc_flush_task is
  called.  Which tells me the runtime cost of the existing
  proc_flush_task is in the noise.

  By running find /proc/ > /dev/null with proc_flush_task
  disabled and then examining the counts in the slabcache
  I managed to see us growing about 84 proc_inodes per
  iteration, which is pretty horrific.  With proc_flush_task
  enabled I don't see steady growth in any of the slab caches.

- Mounts of the /proc need a referenece to the pid namespace
  that doesn't go away until /proc is unmounted.  Without
  holding a reference to the pid namespace that lasts until
  a /proc is unmounted it just isn't safe to lookup and display
  pids in a particular pid_namespace.

- The pid_namespace needs to be able to access proc_mnt until
  the at least the last call of proc_flush_task, for a

  Currently there is a the circular reference between proc_mnt
  and the pid_namespace that we break very carefully through
  an interaction of zap_pid_ns_processes, and proc_flush_task.
  That clever interaction going wrong is what caused oopses
  that led us to realize we have a problem.

  Even if we fix the pid_namespace and the proc_mnt to have
  conjoined lifetimes and the oopses are fixed we still have
  the problem that zombie processes can escape the pid namespace.
  Which appears to be a problem for people using pid_namespaces
  as inescapable process containers.

- fork/exec/waitpid is a common kernel path and as such we need
  to keep the runtime costs down.  Which means as much as possible
  we want to keep from adding code (especially expensive code)
  into the fork/exec/waitpid code path.

  Changing zap_pid_ns_processes to fix the problem instead of
  changing the code elsewhere is one of the few solutions I have
  seen that does not increase the cost of the lat_proc test from

v2: Trivial fixes
   Add found variable so we only look at task inside rcu_read_lock.

Acked-by: Louis Rilling <>
Acked-by: Serge E. Hallyn <>
Acked-by: Sukadev Bhattiprolu <>
Reported-by: Louis Rilling <>
Signed-off-by: Eric W. Biederman <>
1 file changed