| .\" Copyright 2003,2004 Andi Kleen, SuSE Labs. |
| .\" and Copyright 2007 Lee Schermerhorn, Hewlett Packard |
| .\" |
| .\" %%%LICENSE_START(VERBATIM_PROF) |
| .\" Permission is granted to make and distribute verbatim copies of this |
| .\" manual provided the copyright notice and this permission notice are |
| .\" preserved on all copies. |
| .\" |
| .\" Permission is granted to copy and distribute modified versions of this |
| .\" manual under the conditions for verbatim copying, provided that the |
| .\" entire resulting derived work is distributed under the terms of a |
| .\" permission notice identical to this one. |
| .\" |
| .\" Since the Linux kernel and libraries are constantly changing, this |
| .\" manual page may be incorrect or out-of-date. The author(s) assume no |
| .\" responsibility for errors or omissions, or for damages resulting from |
| .\" the use of the information contained herein. |
| .\" |
| .\" Formatted or processed versions of this manual, if unaccompanied by |
| .\" the source, must acknowledge the copyright and authors of this work. |
| .\" %%%LICENSE_END |
| .\" |
| .\" 2006-02-03, mtk, substantial wording changes and other improvements |
| .\" 2007-08-27, Lee Schermerhorn <Lee.Schermerhorn@hp.com> |
| .\" more precise specification of behavior. |
| .\" |
| .\" FIXME |
| .\" Linux 3.8 added MPOL_MF_LAZY, which needs to be documented. |
| .\" Does it also apply for move_pages()? |
| .\" |
| .\" commit b24f53a0bea38b266d219ee651b22dba727c44ae |
| .\" Author: Lee Schermerhorn <lee.schermerhorn@hp.com> |
| .\" Date: Thu Oct 25 14:16:32 2012 +0200 |
| .\" |
| .TH MBIND 2 2017-09-15 Linux "Linux Programmer's Manual" |
| .SH NAME |
| mbind \- set memory policy for a memory range |
| .SH SYNOPSIS |
| .nf |
| .B "#include <numaif.h>" |
| .PP |
| .BI "long mbind(void *" addr ", unsigned long " len ", int " mode , |
| .BI " const unsigned long *" nodemask ", unsigned long " maxnode , |
| .BI " unsigned " flags ); |
| .PP |
| Link with \fI\-lnuma\fP. |
| .fi |
| .SH DESCRIPTION |
| .BR mbind () |
| sets the NUMA memory policy, |
| which consists of a policy mode and zero or more nodes, |
| for the memory range starting with |
| .I addr |
| and continuing for |
| .I len |
| bytes. |
| The memory policy defines from which node memory is allocated. |
| .PP |
| If the memory range specified by the |
| .IR addr " and " len |
| arguments includes an "anonymous" region of memory\(emthat is |
| a region of memory created using the |
| .BR mmap (2) |
| system call with the |
| .BR MAP_ANONYMOUS \(emor |
| a memory-mapped file, mapped using the |
| .BR mmap (2) |
| system call with the |
| .B MAP_PRIVATE |
| flag, pages will be allocated only according to the specified |
| policy when the application writes (stores) to the page. |
| For anonymous regions, an initial read access will use a shared |
| page in the kernel containing all zeros. |
| For a file mapped with |
| .BR MAP_PRIVATE , |
| an initial read access will allocate pages according to the |
| memory policy of the thread that causes the page to be allocated. |
| This may not be the thread that called |
| .BR mbind (). |
| .PP |
| The specified policy will be ignored for any |
| .B MAP_SHARED |
| mappings in the specified memory range. |
| Rather the pages will be allocated according to the memory policy |
| of the thread that caused the page to be allocated. |
| Again, this may not be the thread that called |
| .BR mbind (). |
| .PP |
| If the specified memory range includes a shared memory region |
| created using the |
| .BR shmget (2) |
| system call and attached using the |
| .BR shmat (2) |
| system call, |
| pages allocated for the anonymous or shared memory region will |
| be allocated according to the policy specified, regardless of which |
| process attached to the shared memory segment causes the allocation. |
| If, however, the shared memory region was created with the |
| .B SHM_HUGETLB |
| flag, |
| the huge pages will be allocated according to the policy specified |
| only if the page allocation is caused by the process that calls |
| .BR mbind () |
| for that region. |
| .PP |
| By default, |
| .BR mbind () |
| has an effect only for new allocations; if the pages inside |
| the range have been already touched before setting the policy, |
| then the policy has no effect. |
| This default behavior may be overridden by the |
| .B MPOL_MF_MOVE |
| and |
| .B MPOL_MF_MOVE_ALL |
| flags described below. |
| .PP |
| The |
| .I mode |
| argument must specify one of |
| .BR MPOL_DEFAULT , |
| .BR MPOL_BIND , |
| .BR MPOL_INTERLEAVE , |
| .BR MPOL_PREFERRED , |
| or |
| .BR MPOL_LOCAL |
| (which are described in detail below). |
| All policy modes except |
| .B MPOL_DEFAULT |
| require the caller to specify the node or nodes to which the mode applies, |
| via the |
| .I nodemask |
| argument. |
| .PP |
| The |
| .I mode |
| argument may also include an optional |
| .IR "mode flag" . |
| The supported |
| .I "mode flags" |
| are: |
| .TP |
| .BR MPOL_F_STATIC_NODES " (since Linux-2.6.26)" |
| A nonempty |
| .I nodemask |
| specifies physical node IDs. |
| Linux does not remap the |
| .I nodemask |
| when the thread moves to a different cpuset context, |
| nor when the set of nodes allowed by the thread's |
| current cpuset context changes. |
| .TP |
| .BR MPOL_F_RELATIVE_NODES " (since Linux-2.6.26)" |
| A nonempty |
| .I nodemask |
| specifies node IDs that are relative to the set of |
| node IDs allowed by the thread's current cpuset. |
| .PP |
| .I nodemask |
| points to a bit mask of nodes containing up to |
| .I maxnode |
| bits. |
| The bit mask size is rounded to the next multiple of |
| .IR "sizeof(unsigned long)" , |
| but the kernel will use bits only up to |
| .IR maxnode . |
| A NULL value of |
| .I nodemask |
| or a |
| .I maxnode |
| value of zero specifies the empty set of nodes. |
| If the value of |
| .I maxnode |
| is zero, |
| the |
| .I nodemask |
| argument is ignored. |
| Where a |
| .I nodemask |
| is required, it must contain at least one node that is on-line, |
| allowed by the thread's current cpuset context |
| (unless the |
| .B MPOL_F_STATIC_NODES |
| mode flag is specified), |
| and contains memory. |
| .PP |
| The |
| .I mode |
| argument must include one of the following values: |
| .TP |
| .B MPOL_DEFAULT |
| This mode requests that any nondefault policy be removed, |
| restoring default behavior. |
| When applied to a range of memory via |
| .BR mbind (), |
| this means to use the thread memory policy, |
| which may have been set with |
| .BR set_mempolicy (2). |
| If the mode of the thread memory policy is also |
| .BR MPOL_DEFAULT , |
| the system-wide default policy will be used. |
| The system-wide default policy allocates |
| pages on the node of the CPU that triggers the allocation. |
| For |
| .BR MPOL_DEFAULT , |
| the |
| .I nodemask |
| and |
| .I maxnode |
| arguments must be specify the empty set of nodes. |
| .TP |
| .B MPOL_BIND |
| This mode specifies a strict policy that restricts memory allocation to |
| the nodes specified in |
| .IR nodemask . |
| If |
| .I nodemask |
| specifies more than one node, page allocations will come from |
| the node with sufficient free memory that is closest to |
| the node where the allocation takes place. |
| Pages will not be allocated from any node not specified in the |
| IR nodemask . |
| (Before Linux 2.6.26, |
| .\" commit 19770b32609b6bf97a3dece2529089494cbfc549 |
| page allocations came from |
| the node with the lowest numeric node ID first, until that node |
| contained no free memory. |
| Allocations then came from the node with the next highest |
| node ID specified in |
| .I nodemask |
| and so forth, until none of the specified nodes contained free memory.) |
| .TP |
| .B MPOL_INTERLEAVE |
| This mode specifies that page allocations be interleaved across the |
| set of nodes specified in |
| .IR nodemask . |
| This optimizes for bandwidth instead of latency |
| by spreading out pages and memory accesses to those pages across |
| multiple nodes. |
| To be effective the memory area should be fairly large, |
| at least 1\ MB or bigger with a fairly uniform access pattern. |
| Accesses to a single page of the area will still be limited to |
| the memory bandwidth of a single node. |
| .TP |
| .B MPOL_PREFERRED |
| This mode sets the preferred node for allocation. |
| The kernel will try to allocate pages from this |
| node first and fall back to other nodes if the |
| preferred nodes is low on free memory. |
| If |
| .I nodemask |
| specifies more than one node ID, the first node in the |
| mask will be selected as the preferred node. |
| If the |
| .I nodemask |
| and |
| .I maxnode |
| arguments specify the empty set, then the memory is allocated on |
| the node of the CPU that triggered the allocation. |
| .TP |
| .BR MPOL_LOCAL " (since Linux 3.8)" |
| .\" commit 479e2802d09f1e18a97262c4c6f8f17ae5884bd8 |
| .\" commit f2a07f40dbc603c15f8b06e6ec7f768af67b424f |
| This mode specifies "local allocation"; the memory is allocated on |
| the node of the CPU that triggered the allocation (the "local node"). |
| The |
| .I nodemask |
| and |
| .I maxnode |
| arguments must specify the empty set. |
| If the "local node" is low on free memory, |
| the kernel will try to allocate memory from other nodes. |
| The kernel will allocate memory from the "local node" |
| whenever memory for this node is available. |
| If the "local node" is not allowed by the thread's current cpuset context, |
| the kernel will try to allocate memory from other nodes. |
| The kernel will allocate memory from the "local node" whenever |
| it becomes allowed by the thread's current cpuset context. |
| By contrast, |
| .B MPOL_DEFAULT |
| reverts to the memory policy of the thread (which may be set via |
| .BR set_mempolicy (2)); |
| that policy may be something other than "local allocation". |
| .PP |
| If |
| .B MPOL_MF_STRICT |
| is passed in |
| .I flags |
| and |
| .I mode |
| is not |
| .BR MPOL_DEFAULT , |
| then the call fails with the error |
| .B EIO |
| if the existing pages in the memory range don't follow the policy. |
| .\" According to the kernel code, the following is not true |
| .\" --Lee Schermerhorn |
| .\" In 2.6.16 or later the kernel will also try to move pages |
| .\" to the requested node with this flag. |
| .PP |
| If |
| .B MPOL_MF_MOVE |
| is specified in |
| .IR flags , |
| then the kernel will attempt to move all the existing pages |
| in the memory range so that they follow the policy. |
| Pages that are shared with other processes will not be moved. |
| If |
| .B MPOL_MF_STRICT |
| is also specified, then the call fails with the error |
| .B EIO |
| if some pages could not be moved. |
| .PP |
| If |
| .B MPOL_MF_MOVE_ALL |
| is passed in |
| .IR flags , |
| then the kernel will attempt to move all existing pages in the memory range |
| regardless of whether other processes use the pages. |
| The calling thread must be privileged |
| .RB ( CAP_SYS_NICE ) |
| to use this flag. |
| If |
| .B MPOL_MF_STRICT |
| is also specified, then the call fails with the error |
| .B EIO |
| if some pages could not be moved. |
| .\" --------------------------------------------------------------- |
| .SH RETURN VALUE |
| On success, |
| .BR mbind () |
| returns 0; |
| on error, \-1 is returned and |
| .I errno |
| is set to indicate the error. |
| .\" --------------------------------------------------------------- |
| .SH ERRORS |
| .\" I think I got all of the error returns. --Lee Schermerhorn |
| .TP |
| .B EFAULT |
| Part or all of the memory range specified by |
| .I nodemask |
| and |
| .I maxnode |
| points outside your accessible address space. |
| Or, there was an unmapped hole in the specified memory range specified by |
| .IR addr |
| and |
| .IR len . |
| .TP |
| .B EINVAL |
| An invalid value was specified for |
| .I flags |
| or |
| .IR mode ; |
| or |
| .I addr + len |
| was less than |
| .IR addr ; |
| or |
| .I addr |
| is not a multiple of the system page size. |
| Or, |
| .I mode |
| is |
| .B MPOL_DEFAULT |
| and |
| .I nodemask |
| specified a nonempty set; |
| or |
| .I mode |
| is |
| .B MPOL_BIND |
| or |
| .B MPOL_INTERLEAVE |
| and |
| .I nodemask |
| is empty. |
| Or, |
| .I maxnode |
| exceeds a kernel-imposed limit. |
| .\" As at 2.6.23, this limit is "a page worth of bits", e.g., |
| .\" 8 * 4096 bits, assuming a 4kB page size. |
| Or, |
| .I nodemask |
| specifies one or more node IDs that are |
| greater than the maximum supported node ID. |
| Or, none of the node IDs specified by |
| .I nodemask |
| are on-line and allowed by the thread's current cpuset context, |
| or none of the specified nodes contain memory. |
| Or, the |
| .I mode |
| argument specified both |
| .B MPOL_F_STATIC_NODES |
| and |
| .BR MPOL_F_RELATIVE_NODES . |
| .TP |
| .B EIO |
| .B MPOL_MF_STRICT |
| was specified and an existing page was already on a node |
| that does not follow the policy; |
| or |
| .B MPOL_MF_MOVE |
| or |
| .B MPOL_MF_MOVE_ALL |
| was specified and the kernel was unable to move all existing |
| pages in the range. |
| .TP |
| .B ENOMEM |
| Insufficient kernel memory was available. |
| .TP |
| .B EPERM |
| The |
| .I flags |
| argument included the |
| .B MPOL_MF_MOVE_ALL |
| flag and the caller does not have the |
| .B CAP_SYS_NICE |
| privilege. |
| .\" --------------------------------------------------------------- |
| .SH VERSIONS |
| The |
| .BR mbind () |
| system call was added to the Linux kernel in version 2.6.7. |
| .SH CONFORMING TO |
| This system call is Linux-specific. |
| .SH NOTES |
| For information on library support, see |
| .BR numa (7). |
| .PP |
| NUMA policy is not supported on a memory-mapped file range |
| that was mapped with the |
| .B MAP_SHARED |
| flag. |
| .PP |
| The |
| .B MPOL_DEFAULT |
| mode can have different effects for |
| .BR mbind () |
| and |
| .BR set_mempolicy (2). |
| When |
| .B MPOL_DEFAULT |
| is specified for |
| .BR set_mempolicy (2), |
| the thread's memory policy reverts to the system default policy |
| or local allocation. |
| When |
| .B MPOL_DEFAULT |
| is specified for a range of memory using |
| .BR mbind (), |
| any pages subsequently allocated for that range will use |
| the thread's memory policy, as set by |
| .BR set_mempolicy (2). |
| This effectively removes the explicit policy from the |
| specified range, "falling back" to a possibly nondefault |
| policy. |
| To select explicit "local allocation" for a memory range, |
| specify a |
| .I mode |
| of |
| .B MPOL_LOCAL |
| or |
| .B MPOL_PREFERRED |
| with an empty set of nodes. |
| This method will work for |
| .BR set_mempolicy (2), |
| as well. |
| .PP |
| Support for huge page policy was added with 2.6.16. |
| For interleave policy to be effective on huge page mappings the |
| policied memory needs to be tens of megabytes or larger. |
| .PP |
| .B MPOL_MF_STRICT |
| is ignored on huge page mappings. |
| .PP |
| .B MPOL_MF_MOVE |
| and |
| .B MPOL_MF_MOVE_ALL |
| are available only on Linux 2.6.16 and later. |
| .SH SEE ALSO |
| .BR get_mempolicy (2), |
| .BR getcpu (2), |
| .BR mmap (2), |
| .BR set_mempolicy (2), |
| .BR shmat (2), |
| .BR shmget (2), |
| .BR numa (3), |
| .BR cpuset (7), |
| .BR numa (7), |
| .BR numactl (8) |