| |
| Problem: |
| When Linux needs to allocate memory it may find that there is |
| insufficient free memory so it needs to reclaim space that is in |
| use but not needed at the moment. There are several options: |
| |
| 1/ Shrink a kernel cache such as the inode or dentry cache. This |
| is fairly easy but provides limited returns. |
| 2/ Discard 'clean' pages from the page cache. This is easy, and |
| works well as long as there are clean pages in the page cache. |
| Similarly clean 'anonymous' pages can be discarded - if there |
| are any. |
| 3/ Write out some dirty page-cache pages so that they become clean. |
| The VM limits the number of dirty page-cache pages to e.g. 40% |
| of available memory so that (among other reasons) a "sync" will |
| not take excessively long. So there should never be excessive |
| amounts of dirty pagecache. |
| Writing out dirty page-cache pages involves work by the |
| filesystem which may need to allocate memory itself. To avoid |
| deadlock, filesystems use GFP_NOFS when allocating memory on the |
| write-out path. When this is used, cleaning dirty page-cache |
| pages is not an option so if the filesystem finds that memory |
| is tight, another option must be found. |
| 4/ Write out dirty anonymous pages to the "Swap" partition/file. |
| This is the most interesting for a couple of reasons. |
| a/ Unlike dirty page-cache pages, there is no need to write anon |
| pages out unless we are actually short of memory. Thus they |
| tend to be left to last. |
| b/ Anon pages tend to be updated randomly and unpredictably, and |
| flushing them out of memory can have a very significant |
| performance impact on the process using them. This contrasts |
| with page-cache pages which are often written sequentially |
| and often treated as "write-once, read-many". |
| So anon pages tend to be left until last to be cleaned, and may |
| be the only cleanable pages while there are still some dirty |
| page-cache pages (which are waiting on a GFP_NOFS allocation). |
| |
| [I don't find the above wholly satisfying. There seems to be too much |
| hand-waving. If someone can provide better text explaining why |
| swapout is a special case, that would be great.] |
| |
| So we need to be able to write to the swap file/partition without |
| needing to allocate any memory ... or only a small well controlled |
| amount. |
| |
| The VM reserves a small amount of memory that can only be allocated |
| for use as part of the swap-out procedure. It is only available to |
| processes with the PF_MEMALLOC flag set, which is typically just the |
| memory cleaner. |
| |
| Traditionally swap-out is performed directly to block devices (swap |
| files on block-device filesystems are supported by examining the |
| mapping from file offset to device offset in advance, and then using |
| the device offsets to write directly to the device). Block devices |
| are (required to be) written to pre-allocate any memory that might be |
| needed during write-out, and to block when the pre-allocated memory is |
| exhausted and no other memory is available. They can be sure not to |
| block forever as the pre-allocated memory will be returned as soon as |
| the data it is being used for has been written out. The primary |
| mechanism for pre-allocating memory is called "mempools". |
| |
| This approach does not work for writing anonymous pages |
| (i.e. swapping) over a network, using e.g NFS or NBD or iSCSI. |
| |
| |
| The main reason that it does not work is that when data from an anon |
| page is written to the network, we must wait for a reply to confirm |
| the data is safe. Receiving that reply will consume memory and, |
| significantly, we need to allocate memory to an incoming packet before |
| we can tell if it is the reply we are waiting for or not. |
| |
| The secondary reason is that the network code is not written to use |
| mempools and in most cases does not need to use them. Changing all |
| allocations in the networking layer to use mempools would be quite |
| intrusive, and would waste memory, and probably cause a slow-down in |
| the common case of not swapping over the network. |
| |
| These problems are addressed by enhancing the system of memory |
| reserves used by PF_MEMALLOC and requiring any in-kernel networking |
| client that is used for swap-out to indicate which sockets are used |
| for swapout so they can be handled specially in low memory situations. |
| |
| There are several major parts to this enhancement: |
| |
| 1/ page->reserve, GFP_MEMALLOC |
| |
| To handle low memory conditions we need to know when those |
| conditions exist. Having a global "low on memory" flag seems easy, |
| but its implementation is problematic. Instead we make it possible |
| to tell if a recent memory allocation required use of the emergency |
| memory pool. |
| For pages returned by alloc_page, the new page->reserve flag |
| can be tested. If this is set, then a low memory condition was |
| current when the page was allocated, so the memory should be used |
| carefully. (Because low memory conditions are transient, this |
| state is kept in an overloaded member instead of in page flags, which |
| would suggest a more permanent state.) |
| |
| For memory allocated using slab/slub: If a page that is added to a |
| kmem_cache is found to have page->reserve set, then a s->reserve |
| flag is set for the whole kmem_cache. Further allocations will only |
| be returned from that page (or any other page in the cache) if they |
| are emergency allocation (i.e. PF_MEMALLOC or GFP_MEMALLOC is set). |
| Non-emergency allocations will block in alloc_page until a |
| non-reserve page is available. Once a non-reserve page has been |
| added to the cache, the s->reserve flag on the cache is removed. |
| |
| Because slab objects have no individual state its hard to pass |
| reserve state along, the current code relies on a regular alloc |
| failing. There are various allocation wrappers help here. |
| |
| This allows us to |
| a/ request use of the emergency pool when allocating memory |
| (GFP_MEMALLOC), and |
| b/ to find out if the emergency pool was used. |
| |
| 2/ SK_MEMALLOC, sk_buff->emergency. |
| |
| When memory from the reserve is used to store incoming network |
| packets, the memory must be freed (and the packet dropped) as soon |
| as we find out that the packet is not for a socket that is used for |
| swap-out. |
| To achieve this we have an ->emergency flag for skbs, and an |
| SK_MEMALLOC flag for sockets. |
| When memory is allocated for an skb, it is allocated with |
| GFP_MEMALLOC (if we are currently swapping over the network at |
| all). If a subsequent test shows that the emergency pool was used, |
| ->emergency is set. |
| When the skb is finally attached to its destination socket, the |
| SK_MEMALLOC flag on the socket is tested. If the skb has |
| ->emergency set, but the socket does not have SK_MEMALLOC set, then |
| the skb is immediately freed and the packet is dropped. |
| This ensures that reserve memory is never queued on a socket that is |
| not used for swapout. |
| |
| Similarly, if an skb is ever queued for delivery to user-space for |
| example by netfilter, the ->emergency flag is tested and the skb is |
| released if ->emergency is set. (so obviously the storage route may |
| not pass through a userspace helper, otherwise the packets will never |
| arrive and we'll deadlock) |
| |
| This ensures that memory from the emergency reserve can be used to |
| allow swapout to proceed, but will not get caught up in any other |
| network queue. |
| |
| |
| 3/ pages_emergency |
| |
| The above would be sufficient if the total memory below the lowest |
| memory watermark (i.e the size of the emergency reserve) were known |
| to be enough to hold all transient allocations needed for writeout. |
| I'm a little blurry on how big the current emergency pool is, but it |
| isn't big and certainly hasn't been sized to allow network traffic |
| to consume any. |
| |
| We could simply make the size of the reserve bigger. However in the |
| common case that we are not swapping over the network, that would be |
| a waste of memory. |
| |
| So a new "watermark" is defined: pages_emergency. This is |
| effectively added to the current low water marks, so that pages from |
| this emergency pool can only be allocated if one of PF_MEMALLOC or |
| GFP_MEMALLOC are set. |
| |
| pages_emergency can be changed dynamically based on need. When |
| swapout over the network is required, pages_emergency is increased |
| to cover the maximum expected load. When network swapout is |
| disabled, pages_emergency is decreased. |
| |
| To determine how much to increase it by, we introduce reservation |
| groups.... |
| |
| 3a/ reservation groups |
| |
| The memory used transiently for swapout can be in a number of |
| different places. e.g. the network route cache, the network |
| fragment cache, in transit between network card and socket, or (in |
| the case of NFS) in sunrpc data structures awaiting a reply. |
| We need to ensure each of these is limited in the amount of memory |
| they use, and that the maximum is included in the reserve. |
| |
| The memory required by the network layer only needs to be reserved |
| once, even if there are multiple swapout paths using the network |
| (e.g. NFS and NDB and iSCSI, though using all three for swapout at |
| the same time would be unusual). |
| |
| So we create a tree of reservation groups. The network might |
| register a collection of reservations, but not mark them as being in |
| use. NFS and sunrpc might similarly register a collection of |
| reservations, and attach it to the network reservations as it |
| depends on them. |
| When swapout over NFS is requested, the NFS/sunrpc reservations are |
| activated which implicitly activates the network reservations. |
| |
| The total new reservation is added to pages_emergency. |
| |
| Provided each memory usage stays beneath the registered limit (at |
| least when allocating memory from reserves), the system will never |
| run out of emergency memory, and swapout will not deadlock. |
| |
| It is worth noting here that it is not critical that each usage |
| stays beneath the limit 100% of the time. Occasional excess is |
| acceptable provided that the memory will be freed again within a |
| short amount of time that does *not* require waiting for any event |
| that itself might require memory. |
| This is because, at all stages of transmit and receive, it is |
| acceptable to discard all transient memory associated with a |
| particular writeout and try again later. On transmit, the page can |
| be re-queued for later transmission. On receive, the packet can be |
| dropped assuming that the peer will resend after a timeout. |
| |
| Thus allocations that are truly transient and will be freed without |
| blocking do not strictly need to be reserved for. Doing so might |
| still be a good idea to ensure forward progress doesn't take too |
| long. |
| |
| 4/ low-mem accounting |
| |
| Most places that might hold on to emergency memory (e.g. route |
| cache, fragment cache etc) already place a limit on the amount of |
| memory that they can use. This limit can simply be reserved using |
| the above mechanism and no more needs to be done. |
| |
| However some memory usage might not be accounted with sufficient |
| firmness to allow an appropriate emergency reservation. The |
| in-flight skbs for incoming packets is one such example. |
| |
| To support this, a low-overhead mechanism for accounting memory |
| usage against the reserves is provided. This mechanism uses the |
| same data structure that is used to store the emergency memory |
| reservations through the addition of a 'usage' field. |
| |
| Before we attempt allocation from the memory reserves, we much check |
| if the resulting 'usage' is below the reservation. If so, we increase |
| the usage and attempt the allocation (which should succeed). If |
| the projected 'usage' exceeds the reservation we'll either fail the |
| allocation, or wait for 'usage' to decrease enough so that it would |
| succeed, depending on __GFP_WAIT. |
| |
| When memory that was allocated for that purpose is freed, the |
| 'usage' field is checked again. If it is non-zero, then the size of |
| the freed memory is subtracted from the usage, making sure the usage |
| never becomes less than zero. |
| |
| This provides adequate accounting with minimal overheads when not in |
| a low memory condition. When a low memory condition is encountered |
| it does add the cost of a spin lock necessary to serialise updates |
| to 'usage'. |
| |
| |
| |
| 5/ swapon/swapoff/swap_out/swap_in |
| |
| So that a filesystem (e.g. NFS) can know when to set SK_MEMALLOC on |
| any network socket that it uses, and can know when to account |
| reserve memory carefully, new address_space_operations are |
| available. |
| "swapon" requests that an address space (i.e a file) be make ready |
| for swapout. swap_out and swap_in request the actual IO. They |
| together must ensure that each swap_out request can succeed without |
| allocating more emergency memory that was reserved by swapon. swapoff |
| is used to reverse the state changes caused by swapon when we disable |
| the swap file. |
| |
| |
| Thanks for reading this far. I hope it made sense :-) |
| |
| Neil Brown (with updates from Peter Zijlstra) |
| |
| |