| .\" Copyright (c) 2013 by Michael Kerrisk <mtk.manpages@gmail.com> |
| .\" and Copyright (c) 2012 by Eric W. Biederman <ebiederm@xmission.com> |
| .\" |
| .\" %%%LICENSE_START(VERBATIM) |
| .\" Permission is granted to make and distribute verbatim copies of this |
| .\" manual provided the copyright notice and this permission notice are |
| .\" preserved on all copies. |
| .\" |
| .\" Permission is granted to copy and distribute modified versions of this |
| .\" manual under the conditions for verbatim copying, provided that the |
| .\" entire resulting derived work is distributed under the terms of a |
| .\" permission notice identical to this one. |
| .\" |
| .\" Since the Linux kernel and libraries are constantly changing, this |
| .\" manual page may be incorrect or out-of-date. The author(s) assume no |
| .\" responsibility for errors or omissions, or for damages resulting from |
| .\" the use of the information contained herein. The author(s) may not |
| .\" have taken the same level of care in the production of this manual, |
| .\" which is licensed free of charge, as they might when working |
| .\" professionally. |
| .\" |
| .\" Formatted or processed versions of this manual, if unaccompanied by |
| .\" the source, must acknowledge the copyright and authors of this work. |
| .\" %%%LICENSE_END |
| .\" |
| .\" |
| .TH PID_NAMESPACES 7 2020-11-01 "Linux" "Linux Programmer's Manual" |
| .SH NAME |
| pid_namespaces \- overview of Linux PID namespaces |
| .SH DESCRIPTION |
| For an overview of namespaces, see |
| .BR namespaces (7). |
| .PP |
| PID namespaces isolate the process ID number space, |
| meaning that processes in different PID namespaces can have the same PID. |
| PID namespaces allow containers to provide functionality |
| such as suspending/resuming the set of processes in the container and |
| migrating the container to a new host |
| while the processes inside the container maintain the same PIDs. |
| .PP |
| PIDs in a new PID namespace start at 1, |
| somewhat like a standalone system, and calls to |
| .BR fork (2), |
| .BR vfork (2), |
| or |
| .BR clone (2) |
| will produce processes with PIDs that are unique within the namespace. |
| .PP |
| Use of PID namespaces requires a kernel that is configured with the |
| .B CONFIG_PID_NS |
| option. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS The namespace "init" process |
| The first process created in a new namespace |
| (i.e., the process created using |
| .BR clone (2) |
| with the |
| .BR CLONE_NEWPID |
| flag, or the first child created by a process after a call to |
| .BR unshare (2) |
| using the |
| .BR CLONE_NEWPID |
| flag) has the PID 1, and is the "init" process for the namespace (see |
| .BR init (1)). |
| This process becomes the parent of any child processes that are orphaned |
| because a process that resides in this PID namespace terminated |
| (see below for further details). |
| .PP |
| If the "init" process of a PID namespace terminates, |
| the kernel terminates all of the processes in the namespace via a |
| .BR SIGKILL |
| signal. |
| This behavior reflects the fact that the "init" process |
| is essential for the correct operation of a PID namespace. |
| In this case, a subsequent |
| .BR fork (2) |
| into this PID namespace fail with the error |
| .BR ENOMEM ; |
| it is not possible to create a new process in a PID namespace whose "init" |
| process has terminated. |
| Such scenarios can occur when, for example, |
| a process uses an open file descriptor for a |
| .I /proc/[pid]/ns/pid |
| file corresponding to a process that was in a namespace to |
| .BR setns (2) |
| into that namespace after the "init" process has terminated. |
| Another possible scenario can occur after a call to |
| .BR unshare (2): |
| if the first child subsequently created by a |
| .BR fork (2) |
| terminates, then subsequent calls to |
| .BR fork (2) |
| fail with |
| .BR ENOMEM . |
| .PP |
| Only signals for which the "init" process has established a signal handler |
| can be sent to the "init" process by other members of the PID namespace. |
| This restriction applies even to privileged processes, |
| and prevents other members of the PID namespace from |
| accidentally killing the "init" process. |
| .PP |
| Likewise, a process in an ancestor namespace |
| can\(emsubject to the usual permission checks described in |
| .BR kill (2)\(emsend |
| signals to the "init" process of a child PID namespace only |
| if the "init" process has established a handler for that signal. |
| (Within the handler, the |
| .I siginfo_t |
| .I si_pid |
| field described in |
| .BR sigaction (2) |
| will be zero.) |
| .B SIGKILL |
| or |
| .B SIGSTOP |
| are treated exceptionally: |
| these signals are forcibly delivered when sent from an ancestor PID namespace. |
| Neither of these signals can be caught by the "init" process, |
| and so will result in the usual actions associated with those signals |
| (respectively, terminating and stopping the process). |
| .PP |
| Starting with Linux 3.4, the |
| .BR reboot (2) |
| system call causes a signal to be sent to the namespace "init" process. |
| See |
| .BR reboot (2) |
| for more details. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS Nesting PID namespaces |
| PID namespaces can be nested: |
| each PID namespace has a parent, |
| except for the initial ("root") PID namespace. |
| The parent of a PID namespace is the PID namespace of the process that |
| created the namespace using |
| .BR clone (2) |
| or |
| .BR unshare (2). |
| PID namespaces thus form a tree, |
| with all namespaces ultimately tracing their ancestry to the root namespace. |
| Since Linux 3.7, |
| .\" commit f2302505775fd13ba93f034206f1e2a587017929 |
| .\" The kernel constant MAX_PID_NS_LEVEL |
| the kernel limits the maximum nesting depth for PID namespaces to 32. |
| .PP |
| A process is visible to other processes in its PID namespace, |
| and to the processes in each direct ancestor PID namespace |
| going back to the root PID namespace. |
| In this context, "visible" means that one process |
| can be the target of operations by another process using |
| system calls that specify a process ID. |
| Conversely, the processes in a child PID namespace can't see |
| processes in the parent and further removed ancestor namespaces. |
| More succinctly: a process can see (e.g., send signals with |
| .BR kill (2), |
| set nice values with |
| .BR setpriority (2), |
| etc.) only processes contained in its own PID namespace |
| and in descendants of that namespace. |
| .PP |
| A process has one process ID in each of the layers of the PID |
| namespace hierarchy in which is visible, |
| and walking back though each direct ancestor namespace |
| through to the root PID namespace. |
| System calls that operate on process IDs always |
| operate using the process ID that is visible in the |
| PID namespace of the caller. |
| A call to |
| .BR getpid (2) |
| always returns the PID associated with the namespace in which |
| the process was created. |
| .PP |
| Some processes in a PID namespace may have parents |
| that are outside of the namespace. |
| For example, the parent of the initial process in the namespace |
| (i.e., the |
| .BR init (1) |
| process with PID 1) is necessarily in another namespace. |
| Likewise, the direct children of a process that uses |
| .BR setns (2) |
| to cause its children to join a PID namespace are in a different |
| PID namespace from the caller of |
| .BR setns (2). |
| Calls to |
| .BR getppid (2) |
| for such processes return 0. |
| .PP |
| While processes may freely descend into child PID namespaces |
| (e.g., using |
| .BR setns (2) |
| with a PID namespace file descriptor), |
| they may not move in the other direction. |
| That is to say, processes may not enter any ancestor namespaces |
| (parent, grandparent, etc.). |
| Changing PID namespaces is a one-way operation. |
| .PP |
| The |
| .BR NS_GET_PARENT |
| .BR ioctl (2) |
| operation can be used to discover the parental relationship |
| between PID namespaces; see |
| .BR ioctl_ns (2). |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS setns(2) and unshare(2) semantics |
| Calls to |
| .BR setns (2) |
| that specify a PID namespace file descriptor |
| and calls to |
| .BR unshare (2) |
| with the |
| .BR CLONE_NEWPID |
| flag cause children subsequently created |
| by the caller to be placed in a different PID namespace from the caller. |
| (Since Linux 4.12, that PID namespace is shown via the |
| .IR /proc/[pid]/ns/pid_for_children |
| file, as described in |
| .BR namespaces (7).) |
| These calls do not, however, |
| change the PID namespace of the calling process, |
| because doing so would change the caller's idea of its own PID |
| (as reported by |
| .BR getpid ()), |
| which would break many applications and libraries. |
| .PP |
| To put things another way: |
| a process's PID namespace membership is determined when the process is created |
| and cannot be changed thereafter. |
| Among other things, this means that the parental relationship |
| between processes mirrors the parental relationship between PID namespaces: |
| the parent of a process is either in the same namespace |
| or resides in the immediate parent PID namespace. |
| .PP |
| A process may call |
| .BR unshare (2) |
| with the |
| .B CLONE_NEWPID |
| flag only once. |
| After it has performed this operation, its |
| .IR /proc/PID/ns/pid_for_children |
| symbolic link will be empty until the first child is created in the namespace. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS Adoption of orphaned children |
| When a child process becomes orphaned, it is reparented to the "init" |
| process in the PID namespace of its parent |
| (unless one of the nearer ancestors of the parent employed the |
| .BR prctl (2) |
| .B PR_SET_CHILD_SUBREAPER |
| command to mark itself as the reaper of orphaned descendant processes). |
| Note that because of the |
| .BR setns (2) |
| and |
| .BR unshare (2) |
| semantics described above, this may be the "init" process in the PID |
| namespace that is the |
| .I parent |
| of the child's PID namespace, |
| rather than the "init" process in the child's own PID namespace. |
| .\" Furthermore, by definition, the parent of the "init" process |
| .\" of a PID namespace resides in the parent PID namespace. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS Compatibility of CLONE_NEWPID with other CLONE_* flags |
| In current versions of Linux, |
| .BR CLONE_NEWPID |
| can't be combined with |
| .BR CLONE_THREAD . |
| Threads are required to be in the same PID namespace such that |
| the threads in a process can send signals to each other. |
| Similarly, it must be possible to see all of the threads |
| of a processes in the |
| .BR proc (5) |
| filesystem. |
| Additionally, if two threads were in different PID |
| namespaces, the process ID of the process sending a signal |
| could not be meaningfully encoded when a signal is sent |
| (see the description of the |
| .I siginfo_t |
| type in |
| .BR sigaction (2)). |
| Since this is computed when a signal is enqueued, |
| a signal queue shared by processes in multiple PID namespaces |
| would defeat that. |
| .PP |
| .\" Note these restrictions were all introduced in |
| .\" 8382fcac1b813ad0a4e68a838fc7ae93fa39eda0 |
| .\" when CLONE_NEWPID|CLONE_VM was disallowed |
| In earlier versions of Linux, |
| .BR CLONE_NEWPID |
| was additionally disallowed (failing with the error |
| .BR EINVAL ) |
| in combination with |
| .BR CLONE_SIGHAND |
| .\" (restriction lifted in faf00da544045fdc1454f3b9e6d7f65c841de302) |
| (before Linux 4.3) as well as |
| .\" (restriction lifted in e79f525e99b04390ca4d2366309545a836c03bf1) |
| .BR CLONE_VM |
| (before Linux 3.12). |
| The changes that lifted these restrictions have also been ported to |
| earlier stable kernels. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS /proc and PID namespaces |
| A |
| .I /proc |
| filesystem shows (in the |
| .I /proc/[pid] |
| directories) only processes visible in the PID namespace |
| of the process that performed the mount, even if the |
| .I /proc |
| filesystem is viewed from processes in other namespaces. |
| .PP |
| After creating a new PID namespace, |
| it is useful for the child to change its root directory |
| and mount a new procfs instance at |
| .I /proc |
| so that tools such as |
| .BR ps (1) |
| work correctly. |
| If a new mount namespace is simultaneously created by including |
| .BR CLONE_NEWNS |
| in the |
| .IR flags |
| argument of |
| .BR clone (2) |
| or |
| .BR unshare (2), |
| then it isn't necessary to change the root directory: |
| a new procfs instance can be mounted directly over |
| .IR /proc . |
| .PP |
| From a shell, the command to mount |
| .I /proc |
| is: |
| .PP |
| .in +4n |
| .EX |
| $ mount \-t proc proc /proc |
| .EE |
| .in |
| .PP |
| Calling |
| .BR readlink (2) |
| on the path |
| .I /proc/self |
| yields the process ID of the caller in the PID namespace of the procfs mount |
| (i.e., the PID namespace of the process that mounted the procfs). |
| This can be useful for introspection purposes, |
| when a process wants to discover its PID in other namespaces. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS /proc files |
| .TP |
| .BR /proc/sys/kernel/ns_last_pid " (since Linux 3.3)" |
| .\" commit b8f566b04d3cddd192cfd2418ae6d54ac6353792 |
| This file |
| (which is virtualized per PID namespace) |
| displays the last PID that was allocated in this PID namespace. |
| When the next PID is allocated, |
| the kernel will search for the lowest unallocated PID |
| that is greater than this value, |
| and when this file is subsequently read it will show that PID. |
| .IP |
| This file is writable by a process that has the |
| .B CAP_SYS_ADMIN |
| or (since Linux 5.9) |
| .B CAP_CHECKPOINT_RESTORE |
| capability inside the user namespace that owns the PID namespace. |
| .\" This ability is necessary to support checkpoint restore in user-space |
| This makes it possible to determine the PID that is allocated |
| to the next process that is created inside this PID namespace. |
| .\" |
| .\" ============================================================ |
| .\" |
| .SS Miscellaneous |
| When a process ID is passed over a UNIX domain socket to a |
| process in a different PID namespace (see the description of |
| .B SCM_CREDENTIALS |
| in |
| .BR unix (7)), |
| it is translated into the corresponding PID value in |
| the receiving process's PID namespace. |
| .SH CONFORMING TO |
| Namespaces are a Linux-specific feature. |
| .SH EXAMPLES |
| See |
| .BR user_namespaces (7). |
| .SH SEE ALSO |
| .BR clone (2), |
| .BR reboot (2), |
| .BR setns (2), |
| .BR unshare (2), |
| .BR proc (5), |
| .BR capabilities (7), |
| .BR credentials (7), |
| .BR mount_namespaces (7), |
| .BR namespaces (7), |
| .BR user_namespaces (7), |
| .BR switch_root (8) |