| .\" Copyright, the authors of the Linux man-pages project |
| .\" |
| .\" SPDX-License-Identifier: Linux-man-pages-copyleft-var |
| .\" |
| .\" FIXME |
| .\" Linux 3.8 added MPOL_MF_LAZY, which needs to be documented. |
| .\" Does it also apply for move_pages()? |
| .\" |
| .\" commit b24f53a0bea38b266d219ee651b22dba727c44ae |
| .\" Author: Lee Schermerhorn <lee.schermerhorn@hp.com> |
| .\" Date: Thu Oct 25 14:16:32 2012 +0200 |
| .\" |
| .TH mbind 2 (date) "Linux man-pages (unreleased)" |
| .SH NAME |
| mbind \- set memory policy for a memory range |
| .SH LIBRARY |
| NUMA (Non-Uniform Memory Access) policy library |
| .RI ( libnuma ,\~ \-lnuma ) |
| .SH SYNOPSIS |
| .nf |
| .B "#include <numaif.h>" |
| .P |
| .BR "long mbind(" "unsigned long size, unsigned long maxnode;" |
| .BI " void " addr [ size "], unsigned long " size ", int " mode , |
| .BI " const unsigned long " nodemask [( maxnode " + ULONG_WIDTH \- 1)" |
| .B " / ULONG_WIDTH]," |
| .BI " unsigned long " maxnode ", unsigned int " flags ); |
| .fi |
| .SH DESCRIPTION |
| .BR mbind () |
| sets the NUMA memory policy, |
| which consists of a policy mode and zero or more nodes, |
| for the memory range starting with |
| .I addr |
| and continuing for |
| .I size |
| bytes. |
| The memory policy defines from which node memory is allocated. |
| .P |
| If the memory range specified by the |
| .IR addr " and " size |
| arguments includes an "anonymous" region of memory\[em]that is |
| a region of memory created using the |
| .BR mmap (2) |
| system call with the |
| .BR MAP_ANONYMOUS \[em]or |
| a memory-mapped file, mapped using the |
| .BR mmap (2) |
| system call with the |
| .B MAP_PRIVATE |
| flag, pages will be allocated only according to the specified |
| policy when the application writes (stores) to the page. |
| For anonymous regions, an initial read access will use a shared |
| page in the kernel containing all zeros. |
| For a file mapped with |
| .BR MAP_PRIVATE , |
| an initial read access will allocate pages according to the |
| memory policy of the thread that causes the page to be allocated. |
| This may not be the thread that called |
| .BR mbind (). |
| .P |
| The specified policy will be ignored for any |
| .B MAP_SHARED |
| mappings in the specified memory range. |
| Rather the pages will be allocated according to the memory policy |
| of the thread that caused the page to be allocated. |
| Again, this may not be the thread that called |
| .BR mbind (). |
| .P |
| If the specified memory range includes a shared memory region |
| created using the |
| .BR shmget (2) |
| system call and attached using the |
| .BR shmat (2) |
| system call, |
| pages allocated for the anonymous or shared memory region will |
| be allocated according to the policy specified, regardless of which |
| process attached to the shared memory segment causes the allocation. |
| If, however, the shared memory region was created with the |
| .B SHM_HUGETLB |
| flag, |
| the huge pages will be allocated according to the policy specified |
| only if the page allocation is caused by the process that calls |
| .BR mbind () |
| for that region. |
| .P |
| By default, |
| .BR mbind () |
| has an effect only for new allocations; if the pages inside |
| the range have been already touched before setting the policy, |
| then the policy has no effect. |
| This default behavior may be overridden by the |
| .B MPOL_MF_MOVE |
| and |
| .B MPOL_MF_MOVE_ALL |
| flags described below. |
| .P |
| The |
| .I mode |
| argument must specify one of |
| .BR MPOL_DEFAULT , |
| .BR MPOL_BIND , |
| .BR MPOL_INTERLEAVE , |
| .BR MPOL_WEIGHTED_INTERLEAVE , |
| .BR MPOL_PREFERRED , |
| .BR MPOL_PREFERRED_MANY , |
| or |
| .B MPOL_LOCAL |
| (which are described in detail below). |
| All policy modes except |
| .B MPOL_DEFAULT |
| require the caller to specify the node or nodes to which the mode applies, |
| via the |
| .I nodemask |
| argument. |
| .P |
| The |
| .I mode |
| argument may also include an optional |
| .IR "mode flag" . |
| The supported |
| .I "mode flags" |
| are: |
| .TP |
| .BR MPOL_F_NUMA_BALANCING " (since Linux 5.15)" |
| .\" commit bda420b985054a3badafef23807c4b4fa38a3dff |
| .\" commit 6d2aec9e123bb9c49cb5c7fc654f25f81e688e8c |
| When |
| .I mode |
| is |
| .BR MPOL_BIND , |
| enable the kernel NUMA balancing for the task if it is supported by the kernel. |
| If the flag isn't supported by the kernel, or is used with |
| .I mode |
| other than |
| .BR MPOL_BIND , |
| \-1 is returned and |
| .I errno |
| is set to |
| .BR EINVAL . |
| .TP |
| .BR MPOL_F_STATIC_NODES " (since Linux-2.6.26)" |
| A nonempty |
| .I nodemask |
| specifies physical node IDs. |
| Linux does not remap the |
| .I nodemask |
| when the thread moves to a different cpuset context, |
| nor when the set of nodes allowed by the thread's |
| current cpuset context changes. |
| .TP |
| .BR MPOL_F_RELATIVE_NODES " (since Linux-2.6.26)" |
| A nonempty |
| .I nodemask |
| specifies node IDs that are relative to the set of |
| node IDs allowed by the thread's current cpuset. |
| .P |
| .I nodemask |
| points to a bit mask of nodes containing up to |
| .I maxnode |
| bits. |
| The bit mask size is rounded to the next multiple of |
| .IR "sizeof(unsigned long)" , |
| but the kernel will use bits only up to |
| .IR maxnode . |
| A NULL value of |
| .I nodemask |
| or a |
| .I maxnode |
| value of zero specifies the empty set of nodes. |
| If the value of |
| .I maxnode |
| is zero, |
| the |
| .I nodemask |
| argument is ignored. |
| Where a |
| .I nodemask |
| is required, it must contain at least one node that is on-line, |
| allowed by the thread's current cpuset context |
| (unless the |
| .B MPOL_F_STATIC_NODES |
| mode flag is specified), |
| and contains memory. |
| .P |
| The |
| .I mode |
| argument must include one of the following values: |
| .TP |
| .B MPOL_DEFAULT |
| This mode requests that any nondefault policy be removed, |
| restoring default behavior. |
| When applied to a range of memory via |
| .BR mbind (), |
| this means to use the thread memory policy, |
| which may have been set with |
| .BR set_mempolicy (2). |
| If the mode of the thread memory policy is also |
| .BR MPOL_DEFAULT , |
| the system-wide default policy will be used. |
| The system-wide default policy allocates |
| pages on the node of the CPU that triggers the allocation. |
| For |
| .BR MPOL_DEFAULT , |
| the |
| .I nodemask |
| and |
| .I maxnode |
| arguments must be specify the empty set of nodes. |
| .TP |
| .B MPOL_BIND |
| This mode specifies a strict policy that restricts memory allocation to |
| the nodes specified in |
| .IR nodemask . |
| If |
| .I nodemask |
| specifies more than one node, page allocations will come from |
| the node with sufficient free memory that is closest to |
| the node where the allocation takes place. |
| Pages will not be allocated from any node not specified in the |
| IR nodemask . |
| (Before Linux 2.6.26, |
| .\" commit 19770b32609b6bf97a3dece2529089494cbfc549 |
| page allocations came from |
| the node with the lowest numeric node ID first, until that node |
| contained no free memory. |
| Allocations then came from the node with the next highest |
| node ID specified in |
| .I nodemask |
| and so forth, until none of the specified nodes contained free memory.) |
| .TP |
| .B MPOL_INTERLEAVE |
| This mode specifies that page allocations be interleaved across the |
| set of nodes specified in |
| .IR nodemask . |
| This optimizes for bandwidth instead of latency |
| by spreading out pages and memory accesses to those pages across |
| multiple nodes. |
| To be effective the memory area should be fairly large, |
| at least 1\ MB or bigger with a fairly uniform access pattern. |
| Accesses to a single page of the area will still be limited to |
| the memory bandwidth of a single node. |
| .TP |
| .BR MPOL_WEIGHTED_INTERLEAVE " (since Linux 6.9)" |
| .\" commit fa3bea4e1f8202d787709b7e3654eb0a99aed758 |
| This mode interleaves page allocations across the nodes specified in |
| .I nodemask |
| according to the weights in |
| .IR /sys/kernel/mm/mempolicy/weighted_interleave . |
| For example, if bits 0, 2, and 5 are set in |
| .IR nodemask , |
| and the contents of |
| .IR /sys/kernel/mm/mempolicy/weighted_interleave/node0 , |
| .IR /sys/ .\|.\|. /node2 , |
| and |
| .IR /sys/ .\|.\|. /node5 |
| are 4, 7, and 9, respectively, |
| then pages in this region will be allocated on nodes 0, 2, and 5 |
| in a 4:7:9 ratio. |
| .TP |
| .B MPOL_PREFERRED |
| This mode sets the preferred node for allocation. |
| The kernel will try to allocate pages from this |
| node first and fall back to other nodes if the |
| preferred nodes is low on free memory. |
| If |
| .I nodemask |
| specifies more than one node ID, the first node in the |
| mask will be selected as the preferred node. |
| If the |
| .I nodemask |
| and |
| .I maxnode |
| arguments specify the empty set, then the memory is allocated on |
| the node of the CPU that triggered the allocation. |
| .TP |
| .BR MPOL_PREFERRED_MANY " (since Linux 5.15)" |
| .\" commit b27abaccf8e8b012f126da0c2a1ab32723ec8b9f |
| Specifies a set of nodes for allocation; see |
| .BR set_mempolicy (2) |
| .TP |
| .BR MPOL_LOCAL " (since Linux 3.8)" |
| .\" commit 479e2802d09f1e18a97262c4c6f8f17ae5884bd8 |
| .\" commit f2a07f40dbc603c15f8b06e6ec7f768af67b424f |
| This mode specifies "local allocation"; the memory is allocated on |
| the node of the CPU that triggered the allocation (the "local node"). |
| The |
| .I nodemask |
| and |
| .I maxnode |
| arguments must specify the empty set. |
| If the "local node" is low on free memory, |
| the kernel will try to allocate memory from other nodes. |
| The kernel will allocate memory from the "local node" |
| whenever memory for this node is available. |
| If the "local node" is not allowed by the thread's current cpuset context, |
| the kernel will try to allocate memory from other nodes. |
| The kernel will allocate memory from the "local node" whenever |
| it becomes allowed by the thread's current cpuset context. |
| By contrast, |
| .B MPOL_DEFAULT |
| reverts to the memory policy of the thread (which may be set via |
| .BR set_mempolicy (2)); |
| that policy may be something other than "local allocation". |
| .P |
| If |
| .B MPOL_MF_STRICT |
| is passed in |
| .I flags |
| and |
| .I mode |
| is not |
| .BR MPOL_DEFAULT , |
| then the call fails with the error |
| .B EIO |
| if the existing pages in the memory range don't follow the policy. |
| .\" According to the kernel code, the following is not true |
| .\" --Lee Schermerhorn |
| .\" In Linux 2.6.16 or later the kernel will also try to move pages |
| .\" to the requested node with this flag. |
| .P |
| If |
| .B MPOL_MF_MOVE |
| is specified in |
| .IR flags , |
| then the kernel will attempt to move all the existing pages |
| in the memory range so that they follow the policy. |
| Pages that are shared with other processes will not be moved. |
| If |
| .B MPOL_MF_STRICT |
| is also specified, then the call fails with the error |
| .B EIO |
| if some pages could not be moved. |
| If the |
| .B MPOL_INTERLEAVE |
| policy was specified, |
| pages already residing on the specified nodes |
| will not be moved such that they are interleaved. |
| .P |
| If |
| .B MPOL_MF_MOVE_ALL |
| is passed in |
| .IR flags , |
| then the kernel will attempt to move all existing pages in the memory range |
| regardless of whether other processes use the pages. |
| The calling thread must be privileged |
| .RB ( CAP_SYS_NICE ) |
| to use this flag. |
| If |
| .B MPOL_MF_STRICT |
| is also specified, then the call fails with the error |
| .B EIO |
| if some pages could not be moved. |
| If the |
| .B MPOL_INTERLEAVE |
| policy was specified, |
| pages already residing on the specified nodes |
| will not be moved such that they are interleaved. |
| .\" --------------------------------------------------------------- |
| .SH RETURN VALUE |
| On success, |
| .BR mbind () |
| returns 0; |
| on error, \-1 is returned and |
| .I errno |
| is set to indicate the error. |
| .\" --------------------------------------------------------------- |
| .SH ERRORS |
| .\" I think I got all of the error returns. --Lee Schermerhorn |
| .TP |
| .B EFAULT |
| Part or all of the memory range specified by |
| .I nodemask |
| and |
| .I maxnode |
| points outside your accessible address space. |
| Or, there was an unmapped hole in the specified memory range specified by |
| .I addr |
| and |
| .IR size . |
| .TP |
| .B EINVAL |
| An invalid value was specified for |
| .I flags |
| or |
| .IR mode ; |
| or |
| .I addr + size |
| was less than |
| .IR addr ; |
| or |
| .I addr |
| is not a multiple of the system page size. |
| Or, |
| .I mode |
| is |
| .B MPOL_DEFAULT |
| and |
| .I nodemask |
| specified a nonempty set; |
| or |
| .I mode |
| is |
| .B MPOL_BIND |
| or |
| .B MPOL_INTERLEAVE |
| and |
| .I nodemask |
| is empty. |
| Or, |
| .I maxnode |
| exceeds a kernel-imposed limit. |
| .\" As at 2.6.23, this limit is "a page worth of bits", e.g., |
| .\" 8 * 4096 bits, assuming a 4kB page size. |
| Or, |
| .I nodemask |
| specifies one or more node IDs that are |
| greater than the maximum supported node ID. |
| Or, none of the node IDs specified by |
| .I nodemask |
| are on-line and allowed by the thread's current cpuset context, |
| or none of the specified nodes contain memory. |
| Or, the |
| .I mode |
| argument specified both |
| .B MPOL_F_STATIC_NODES |
| and |
| .BR MPOL_F_RELATIVE_NODES . |
| .TP |
| .B EIO |
| .B MPOL_MF_STRICT |
| was specified and an existing page was already on a node |
| that does not follow the policy; |
| or |
| .B MPOL_MF_MOVE |
| or |
| .B MPOL_MF_MOVE_ALL |
| was specified and the kernel was unable to move all existing |
| pages in the range. |
| .TP |
| .B ENOMEM |
| Insufficient kernel memory was available. |
| .TP |
| .B EPERM |
| The |
| .I flags |
| argument included the |
| .B MPOL_MF_MOVE_ALL |
| flag and the caller does not have the |
| .B CAP_SYS_NICE |
| privilege. |
| .\" --------------------------------------------------------------- |
| .SH STANDARDS |
| Linux. |
| .SH HISTORY |
| Linux 2.6.7. |
| .P |
| Support for huge page policy was added with Linux 2.6.16. |
| For interleave policy to be effective on huge page mappings the |
| policied memory needs to be tens of megabytes or larger. |
| .P |
| Before Linux 5.7. |
| .\" commit dcf1763546d76c372f3136c8d6b2b6e77f140cf0 |
| .B MPOL_MF_STRICT |
| was ignored on huge page mappings. |
| .P |
| .B MPOL_MF_MOVE |
| and |
| .B MPOL_MF_MOVE_ALL |
| are available only on Linux 2.6.16 and later. |
| .SH NOTES |
| For information on library support, see |
| .BR numa (7). |
| .P |
| NUMA policy is not supported on a memory-mapped file range |
| that was mapped with the |
| .B MAP_SHARED |
| flag. |
| .P |
| The |
| .B MPOL_DEFAULT |
| mode can have different effects for |
| .BR mbind () |
| and |
| .BR set_mempolicy (2). |
| When |
| .B MPOL_DEFAULT |
| is specified for |
| .BR set_mempolicy (2), |
| the thread's memory policy reverts to the system default policy |
| or local allocation. |
| When |
| .B MPOL_DEFAULT |
| is specified for a range of memory using |
| .BR mbind (), |
| any pages subsequently allocated for that range will use |
| the thread's memory policy, as set by |
| .BR set_mempolicy (2). |
| This effectively removes the explicit policy from the |
| specified range, "falling back" to a possibly nondefault |
| policy. |
| To select explicit "local allocation" for a memory range, |
| specify a |
| .I mode |
| of |
| .B MPOL_LOCAL |
| or |
| .B MPOL_PREFERRED |
| with an empty set of nodes. |
| This method will work for |
| .BR set_mempolicy (2), |
| as well. |
| .SH SEE ALSO |
| .BR get_mempolicy (2), |
| .BR getcpu (2), |
| .BR mmap (2), |
| .BR set_mempolicy (2), |
| .BR shmat (2), |
| .BR shmget (2), |
| .BR numa (3), |
| .BR cpuset (7), |
| .BR numa (7), |
| .BR numactl (8) |