| .\" Copyright (C) 2001 David Gómez <davidge@jazzfree.com> |
| .\" |
| .\" %%%LICENSE_START(VERBATIM) |
| .\" Permission is granted to make and distribute verbatim copies of this |
| .\" manual provided the copyright notice and this permission notice are |
| .\" preserved on all copies. |
| .\" |
| .\" Permission is granted to copy and distribute modified versions of this |
| .\" manual under the conditions for verbatim copying, provided that the |
| .\" entire resulting derived work is distributed under the terms of a |
| .\" permission notice identical to this one. |
| .\" |
| .\" Since the Linux kernel and libraries are constantly changing, this |
| .\" manual page may be incorrect or out-of-date. The author(s) assume no |
| .\" responsibility for errors or omissions, or for damages resulting from |
| .\" the use of the information contained herein. The author(s) may not |
| .\" have taken the same level of care in the production of this manual, |
| .\" which is licensed free of charge, as they might when working |
| .\" professionally. |
| .\" |
| .\" Formatted or processed versions of this manual, if unaccompanied by |
| .\" the source, must acknowledge the copyright and authors of this work. |
| .\" %%%LICENSE_END |
| .\" |
| .\" Based on comments from mm/filemap.c. Last modified on 10-06-2001 |
| .\" Modified, 25 Feb 2002, Michael Kerrisk, <mtk.manpages@gmail.com> |
| .\" Added notes on MADV_DONTNEED |
| .\" 2010-06-19, mtk, Added documentation of MADV_MERGEABLE and |
| .\" MADV_UNMERGEABLE |
| .\" 2010-06-15, Andi Kleen, Add documentation of MADV_HWPOISON. |
| .\" 2010-06-19, Andi Kleen, Add documentation of MADV_SOFT_OFFLINE. |
| .\" 2011-09-18, Doug Goldstein <cardoe@cardoe.com> |
| .\" Document MADV_HUGEPAGE and MADV_NOHUGEPAGE |
| .\" |
| .TH MADVISE 2 2021-03-22 "Linux" "Linux Programmer's Manual" |
| .SH NAME |
| madvise \- give advice about use of memory |
| .SH SYNOPSIS |
| .nf |
| .B #include <sys/mman.h> |
| .PP |
| .BI "int madvise(void *" addr ", size_t " length ", int " advice ); |
| .fi |
| .PP |
| .RS -4 |
| Feature Test Macro Requirements for glibc (see |
| .BR feature_test_macros (7)): |
| .RE |
| .PP |
| .BR madvise (): |
| .nf |
| Since glibc 2.19: |
| _DEFAULT_SOURCE |
| Up to and including glibc 2.19: |
| _BSD_SOURCE |
| .fi |
| .SH DESCRIPTION |
| The |
| .BR madvise () |
| system call is used to give advice or directions to the kernel |
| about the address range beginning at address |
| .I addr |
| and with size |
| .I length |
| bytes |
| In most cases, |
| the goal of such advice is to improve system or application performance. |
| .PP |
| Initially, the system call supported a set of "conventional" |
| .I advice |
| values, which are also available on several other implementations. |
| (Note, though, that |
| .BR madvise () |
| is not specified in POSIX.) |
| Subsequently, a number of Linux-specific |
| .IR advice |
| values have been added. |
| .\" |
| .\" ====================================================================== |
| .\" |
| .SS Conventional advice values |
| The |
| .I advice |
| values listed below |
| allow an application to tell the kernel how it expects to use |
| some mapped or shared memory areas, so that the kernel can choose |
| appropriate read-ahead and caching techniques. |
| These |
| .I advice |
| values do not influence the semantics of the application |
| (except in the case of |
| .BR MADV_DONTNEED ), |
| but may influence its performance. |
| All of the |
| .I advice |
| values listed here have analogs in the POSIX-specified |
| .BR posix_madvise (3) |
| function, and the values have the same meanings, with the exception of |
| .BR MADV_DONTNEED . |
| .PP |
| The advice is indicated in the |
| .I advice |
| argument, which is one of the following: |
| .TP |
| .B MADV_NORMAL |
| No special treatment. |
| This is the default. |
| .TP |
| .B MADV_RANDOM |
| Expect page references in random order. |
| (Hence, read ahead may be less useful than normally.) |
| .TP |
| .B MADV_SEQUENTIAL |
| Expect page references in sequential order. |
| (Hence, pages in the given range can be aggressively read ahead, |
| and may be freed soon after they are accessed.) |
| .TP |
| .B MADV_WILLNEED |
| Expect access in the near future. |
| (Hence, it might be a good idea to read some pages ahead.) |
| .TP |
| .B MADV_DONTNEED |
| Do not expect access in the near future. |
| (For the time being, the application is finished with the given range, |
| so the kernel can free resources associated with it.) |
| .IP |
| After a successful |
| .B MADV_DONTNEED |
| operation, |
| the semantics of memory access in the specified region are changed: |
| subsequent accesses of pages in the range will succeed, but will result |
| in either repopulating the memory contents from the |
| up-to-date contents of the underlying mapped file |
| (for shared file mappings, shared anonymous mappings, |
| and shmem-based techniques such as System V shared memory segments) |
| or zero-fill-on-demand pages for anonymous private mappings. |
| .IP |
| Note that, when applied to shared mappings, |
| .BR MADV_DONTNEED |
| might not lead to immediate freeing of the pages in the range. |
| The kernel is free to delay freeing the pages until an appropriate moment. |
| The resident set size (RSS) of the calling process will be immediately |
| reduced however. |
| .IP |
| .B MADV_DONTNEED |
| cannot be applied to locked pages, Huge TLB pages, or |
| .BR VM_PFNMAP |
| pages. |
| (Pages marked with the kernel-internal |
| .B VM_PFNMAP |
| .\" http://lwn.net/Articles/162860/ |
| flag are special memory areas that are not managed |
| by the virtual memory subsystem. |
| Such pages are typically created by device drivers that |
| map the pages into user space.) |
| .\" |
| .\" ====================================================================== |
| .\" |
| .SS Linux-specific advice values |
| The following Linux-specific |
| .I advice |
| values have no counterparts in the POSIX-specified |
| .BR posix_madvise (3), |
| and may or may not have counterparts in the |
| .BR madvise () |
| interface available on other implementations. |
| Note that some of these operations change the semantics of memory accesses. |
| .TP |
| .BR MADV_REMOVE " (since Linux 2.6.16)" |
| .\" commit f6b3ec238d12c8cc6cc71490c6e3127988460349 |
| Free up a given range of pages |
| and its associated backing store. |
| This is equivalent to punching a hole in the corresponding byte |
| range of the backing store (see |
| .BR fallocate (2)). |
| Subsequent accesses in the specified address range will see |
| bytes containing zero. |
| .\" Databases want to use this feature to drop a section of their |
| .\" bufferpool (shared memory segments) - without writing back to |
| .\" disk/swap space. This feature is also useful for supporting |
| .\" hot-plug memory on UML. |
| .IP |
| The specified address range must be mapped shared and writable. |
| This flag cannot be applied to locked pages, Huge TLB pages, or |
| .BR VM_PFNMAP |
| pages. |
| .IP |
| In the initial implementation, only |
| .BR tmpfs (5) |
| was supported |
| .BR MADV_REMOVE ; |
| but since Linux 3.5, |
| .\" commit 3f31d07571eeea18a7d34db9af21d2285b807a17 |
| any filesystem which supports the |
| .BR fallocate (2) |
| .BR FALLOC_FL_PUNCH_HOLE |
| mode also supports |
| .BR MADV_REMOVE . |
| Hugetlbfs fails with the error |
| .BR EINVAL |
| and other filesystems fail with the error |
| .BR EOPNOTSUPP . |
| .TP |
| .BR MADV_DONTFORK " (since Linux 2.6.16)" |
| .\" commit f822566165dd46ff5de9bf895cfa6c51f53bb0c4 |
| .\" See http://lwn.net/Articles/171941/ |
| Do not make the pages in this range available to the child after a |
| .BR fork (2). |
| This is useful to prevent copy-on-write semantics from changing |
| the physical location of a page if the parent writes to it after a |
| .BR fork (2). |
| (Such page relocations cause problems for hardware that |
| DMAs into the page.) |
| .\" [PATCH] madvise MADV_DONTFORK/MADV_DOFORK |
| .\" Currently, copy-on-write may change the physical address of |
| .\" a page even if the user requested that the page is pinned in |
| .\" memory (either by mlock or by get_user_pages). This happens |
| .\" if the process forks meanwhile, and the parent writes to that |
| .\" page. As a result, the page is orphaned: in case of |
| .\" get_user_pages, the application will never see any data hardware |
| .\" DMA's into this page after the COW. In case of mlock'd memory, |
| .\" the parent is not getting the realtime/security benefits of mlock. |
| .\" |
| .\" In particular, this affects the Infiniband modules which do DMA from |
| .\" and into user pages all the time. |
| .\" |
| .\" This patch adds madvise options to control whether memory range is |
| .\" inherited across fork. Useful e.g. for when hardware is doing DMA |
| .\" from/into these pages. Could also be useful to an application |
| .\" wanting to speed up its forks by cutting large areas out of |
| .\" consideration. |
| .\" |
| .\" SEE ALSO: http://lwn.net/Articles/171941/ |
| .\" "Tweaks to madvise() and posix_fadvise()", 14 Feb 2006 |
| .TP |
| .BR MADV_DOFORK " (since Linux 2.6.16)" |
| Undo the effect of |
| .BR MADV_DONTFORK , |
| restoring the default behavior, whereby a mapping is inherited across |
| .BR fork (2). |
| .TP |
| .BR MADV_HWPOISON " (since Linux 2.6.32)" |
| .\" commit 9893e49d64a4874ea67849ee2cfbf3f3d6817573 |
| Poison the pages in the range specified by |
| .I addr |
| and |
| .IR length |
| and handle subsequent references to those pages |
| like a hardware memory corruption. |
| This operation is available only for privileged |
| .RB ( CAP_SYS_ADMIN ) |
| processes. |
| This operation may result in the calling process receiving a |
| .B SIGBUS |
| and the page being unmapped. |
| .IP |
| This feature is intended for testing of memory error-handling code; |
| it is available only if the kernel was configured with |
| .BR CONFIG_MEMORY_FAILURE . |
| .TP |
| .BR MADV_MERGEABLE " (since Linux 2.6.32)" |
| .\" commit f8af4da3b4c14e7267c4ffb952079af3912c51c5 |
| Enable Kernel Samepage Merging (KSM) for the pages in the range specified by |
| .I addr |
| and |
| .IR length . |
| The kernel regularly scans those areas of user memory that have |
| been marked as mergeable, |
| looking for pages with identical content. |
| These are replaced by a single write-protected page (which is automatically |
| copied if a process later wants to update the content of the page). |
| KSM merges only private anonymous pages (see |
| .BR mmap (2)). |
| .IP |
| The KSM feature is intended for applications that generate many |
| instances of the same data (e.g., virtualization systems such as KVM). |
| It can consume a lot of processing power; use with care. |
| See the Linux kernel source file |
| .I Documentation/admin\-guide/mm/ksm.rst |
| for more details. |
| .IP |
| The |
| .BR MADV_MERGEABLE |
| and |
| .BR MADV_UNMERGEABLE |
| operations are available only if the kernel was configured with |
| .BR CONFIG_KSM . |
| .TP |
| .BR MADV_UNMERGEABLE " (since Linux 2.6.32)" |
| Undo the effect of an earlier |
| .BR MADV_MERGEABLE |
| operation on the specified address range; |
| KSM unmerges whatever pages it had merged in the address range specified by |
| .IR addr |
| and |
| .IR length . |
| .TP |
| .BR MADV_SOFT_OFFLINE " (since Linux 2.6.33)" |
| .\" commit afcf938ee0aac4ef95b1a23bac704c6fbeb26de6 |
| Soft offline the pages in the range specified by |
| .I addr |
| and |
| .IR length . |
| The memory of each page in the specified range is preserved |
| (i.e., when next accessed, the same content will be visible, |
| but in a new physical page frame), |
| and the original page is offlined |
| (i.e., no longer used, and taken out of normal memory management). |
| The effect of the |
| .B MADV_SOFT_OFFLINE |
| operation is invisible to (i.e., does not change the semantics of) |
| the calling process. |
| .IP |
| This feature is intended for testing of memory error-handling code; |
| it is available only if the kernel was configured with |
| .BR CONFIG_MEMORY_FAILURE . |
| .TP |
| .BR MADV_HUGEPAGE " (since Linux 2.6.38)" |
| .\" commit 0af4e98b6b095c74588af04872f83d333c958c32 |
| .\" http://lwn.net/Articles/358904/ |
| .\" https://lwn.net/Articles/423584/ |
| Enable Transparent Huge Pages (THP) for pages in the range specified by |
| .I addr |
| and |
| .IR length . |
| Currently, Transparent Huge Pages work only with private anonymous pages (see |
| .BR mmap (2)). |
| The kernel will regularly scan the areas marked as huge page candidates |
| to replace them with huge pages. |
| The kernel will also allocate huge pages directly when the region is |
| naturally aligned to the huge page size (see |
| .BR posix_memalign (2)). |
| .IP |
| This feature is primarily aimed at applications that use large mappings of |
| data and access large regions of that memory at a time (e.g., virtualization |
| systems such as QEMU). |
| It can very easily waste memory (e.g., a 2\ MB mapping that only ever accesses |
| 1 byte will result in 2\ MB of wired memory instead of one 4\ KB page). |
| See the Linux kernel source file |
| .I Documentation/admin\-guide/mm/transhuge.rst |
| for more details. |
| .IP |
| Most common kernels configurations provide |
| .BR MADV_HUGEPAGE -style |
| behavior by default, and thus |
| .BR MADV_HUGEPAGE |
| is normally not necessary. |
| It is mostly intended for embedded systems, where |
| .BR MADV_HUGEPAGE -style |
| behavior may not be enabled by default in the kernel. |
| On such systems, |
| this flag can be used in order to selectively enable THP. |
| Whenever |
| .BR MADV_HUGEPAGE |
| is used, it should always be in regions of memory with |
| an access pattern that the developer knows in advance won't risk |
| to increase the memory footprint of the application when transparent |
| hugepages are enabled. |
| .IP |
| The |
| .BR MADV_HUGEPAGE |
| and |
| .BR MADV_NOHUGEPAGE |
| operations are available only if the kernel was configured with |
| .BR CONFIG_TRANSPARENT_HUGEPAGE . |
| .TP |
| .BR MADV_NOHUGEPAGE " (since Linux 2.6.38)" |
| Ensures that memory in the address range specified by |
| .IR addr |
| and |
| .IR length |
| will not be backed by transparent hugepages. |
| .TP |
| .BR MADV_DONTDUMP " (since Linux 3.4)" |
| .\" commit 909af768e88867016f427264ae39d27a57b6a8ed |
| .\" commit accb61fe7bb0f5c2a4102239e4981650f9048519 |
| Exclude from a core dump those pages in the range specified by |
| .I addr |
| and |
| .IR length . |
| This is useful in applications that have large areas of memory |
| that are known not to be useful in a core dump. |
| The effect of |
| .BR MADV_DONTDUMP |
| takes precedence over the bit mask that is set via the |
| .I /proc/[pid]/coredump_filter |
| file (see |
| .BR core (5)). |
| .TP |
| .BR MADV_DODUMP " (since Linux 3.4)" |
| Undo the effect of an earlier |
| .BR MADV_DONTDUMP . |
| .TP |
| .BR MADV_FREE " (since Linux 4.5)" |
| The application no longer requires the pages in the range specified by |
| .IR addr |
| and |
| .IR len . |
| The kernel can thus free these pages, |
| but the freeing could be delayed until memory pressure occurs. |
| For each of the pages that has been marked to be freed |
| but has not yet been freed, |
| the free operation will be canceled if the caller writes into the page. |
| After a successful |
| .B MADV_FREE |
| operation, any stale data (i.e., dirty, unwritten pages) will be lost |
| when the kernel frees the pages. |
| However, subsequent writes to pages in the range will succeed |
| and then kernel cannot free those dirtied pages, |
| so that the caller can always see just written data. |
| If there is no subsequent write, |
| the kernel can free the pages at any time. |
| Once pages in the range have been freed, the caller will |
| see zero-fill-on-demand pages upon subsequent page references. |
| .IP |
| The |
| .B MADV_FREE |
| operation |
| can be applied only to private anonymous pages (see |
| .BR mmap (2)). |
| In Linux before version 4.12, |
| .\" commit 93e06c7a645343d222c9a838834a51042eebbbf7 |
| when freeing pages on a swapless system, |
| the pages in the given range are freed instantly, |
| regardless of memory pressure. |
| .TP |
| .BR MADV_WIPEONFORK " (since Linux 4.14)" |
| .\" commit d2cd9ede6e193dd7d88b6d27399e96229a551b19 |
| Present the child process with zero-filled memory in this range after a |
| .BR fork (2). |
| This is useful in forking servers in order to ensure |
| that sensitive per-process data |
| (for example, PRNG seeds, cryptographic secrets, and so on) |
| is not handed to child processes. |
| .IP |
| The |
| .B MADV_WIPEONFORK |
| operation can be applied only to private anonymous pages (see |
| .BR mmap (2)). |
| .IP |
| Within the child created by |
| .BR fork (2), |
| the |
| .B MADV_WIPEONFORK |
| setting remains in place on the specified address range. |
| This setting is cleared during |
| .BR execve (2). |
| .TP |
| .BR MADV_KEEPONFORK " (since Linux 4.14)" |
| .\" commit d2cd9ede6e193dd7d88b6d27399e96229a551b19 |
| Undo the effect of an earlier |
| .BR MADV_WIPEONFORK . |
| .TP |
| .BR MADV_COLD " (since Linux 5.4)" |
| .\" commit 9c276cc65a58faf98be8e56962745ec99ab87636 |
| Deactivate a given range of pages. |
| This will make the pages a more probable |
| reclaim target should there be a memory pressure. |
| This is a nondestructive operation. |
| The advice might be ignored for some pages in the range when it is not |
| applicable. |
| .TP |
| .BR MADV_PAGEOUT " (since Linux 5.4)" |
| .\" commit 1a4e58cce84ee88129d5d49c064bd2852b481357 |
| Reclaim a given range of pages. |
| This is done to free up memory occupied by these pages. |
| If a page is anonymous, it will be swapped out. |
| If a page is file-backed and dirty, it will be written back to the backing |
| storage. |
| The advice might be ignored for some pages in the range when it is not |
| applicable. |
| .SH RETURN VALUE |
| On success, |
| .BR madvise () |
| returns zero. |
| On error, it returns \-1 and |
| .I errno |
| is set to indicate the error. |
| .SH ERRORS |
| .TP |
| .B EACCES |
| .I advice |
| is |
| .BR MADV_REMOVE , |
| but the specified address range is not a shared writable mapping. |
| .TP |
| .B EAGAIN |
| A kernel resource was temporarily unavailable. |
| .TP |
| .B EBADF |
| The map exists, but the area maps something that isn't a file. |
| .TP |
| .B EINVAL |
| .I addr |
| is not page-aligned or |
| .I length |
| is negative. |
| .\" .I length |
| .\" is zero, |
| .TP |
| .B EINVAL |
| .I advice |
| is not a valid. |
| .TP |
| .B EINVAL |
| .I advice |
| is |
| .B MADV_DONTNEED |
| or |
| .BR MADV_REMOVE |
| and the specified address range includes locked, Huge TLB pages, or |
| .B VM_PFNMAP |
| pages. |
| .TP |
| .B EINVAL |
| .I advice |
| is |
| .BR MADV_MERGEABLE |
| or |
| .BR MADV_UNMERGEABLE , |
| but the kernel was not configured with |
| .BR CONFIG_KSM . |
| .TP |
| .B EINVAL |
| .I advice |
| is |
| .BR MADV_FREE |
| or |
| .BR MADV_WIPEONFORK |
| but the specified address range includes file, Huge TLB, |
| .BR MAP_SHARED , |
| or |
| .BR VM_PFNMAP |
| ranges. |
| .TP |
| .B EIO |
| (for |
| .BR MADV_WILLNEED ) |
| Paging in this area would exceed the process's |
| maximum resident set size. |
| .TP |
| .B ENOMEM |
| (for |
| .BR MADV_WILLNEED ) |
| Not enough memory: paging in failed. |
| .TP |
| .B ENOMEM |
| Addresses in the specified range are not currently |
| mapped, or are outside the address space of the process. |
| .TP |
| .B EPERM |
| .I advice |
| is |
| .BR MADV_HWPOISON , |
| but the caller does not have the |
| .B CAP_SYS_ADMIN |
| capability. |
| .SH VERSIONS |
| Since Linux 3.18, |
| .\" commit d3ac21cacc24790eb45d735769f35753f5b56ceb |
| support for this system call is optional, |
| depending on the setting of the |
| .B CONFIG_ADVISE_SYSCALLS |
| configuration option. |
| .SH CONFORMING TO |
| .BR madvise () |
| is not specified by any standards. |
| Versions of this system call, implementing a wide variety of |
| .I advice |
| values, exist on many other implementations. |
| Other implementations typically implement at least the flags listed |
| above under |
| .IR "Conventional advice flags" , |
| albeit with some variation in semantics. |
| .PP |
| POSIX.1-2001 describes |
| .BR posix_madvise (3) |
| with constants |
| .BR POSIX_MADV_NORMAL , |
| .BR POSIX_MADV_RANDOM , |
| .BR POSIX_MADV_SEQUENTIAL , |
| .BR POSIX_MADV_WILLNEED , |
| and |
| .BR POSIX_MADV_DONTNEED , |
| and so on, with behavior close to the similarly named flags listed above. |
| .SH NOTES |
| .SS Linux notes |
| The Linux implementation requires that the address |
| .I addr |
| be page-aligned, and allows |
| .I length |
| to be zero. |
| If there are some parts of the specified address range |
| that are not mapped, the Linux version of |
| .BR madvise () |
| ignores them and applies the call to the rest (but returns |
| .B ENOMEM |
| from the system call, as it should). |
| .\" .SH HISTORY |
| .\" The |
| .\" .BR madvise () |
| .\" function first appeared in 4.4BSD. |
| .SH SEE ALSO |
| .BR getrlimit (2), |
| .BR mincore (2), |
| .BR mmap (2), |
| .BR mprotect (2), |
| .BR msync (2), |
| .BR munmap (2), |
| .BR prctl (2), |
| .BR process_madvise (2), |
| .BR posix_madvise (3), |
| .BR core (5) |