| From 7810e6781e0fcbca78b91cf65053f895bf59e85f Mon Sep 17 00:00:00 2001 |
| From: Vlastimil Babka <vbabka@suse.cz> |
| Date: Thu, 7 Jun 2018 17:09:29 -0700 |
| Subject: mm, page_alloc: do not break __GFP_THISNODE by zonelist reset |
| |
| From: Vlastimil Babka <vbabka@suse.cz> |
| |
| commit 7810e6781e0fcbca78b91cf65053f895bf59e85f upstream. |
| |
| In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for |
| allocations that can ignore memory policies. The zonelist is obtained |
| from current CPU's node. This is a problem for __GFP_THISNODE |
| allocations that want to allocate on a different node, e.g. because the |
| allocating thread has been migrated to a different CPU. |
| |
| This has been observed to break SLAB in our 4.4-based kernel, because |
| there it relies on __GFP_THISNODE working as intended. If a slab page |
| is put on wrong node's list, then further list manipulations may corrupt |
| the list because page_to_nid() is used to determine which node's |
| list_lock should be locked and thus we may take a wrong lock and race. |
| |
| Current SLAB implementation seems to be immune by luck thanks to commit |
| 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on |
| arbitrary node") but there may be others assuming that __GFP_THISNODE |
| works as promised. |
| |
| We can fix it by simply removing the zonelist reset completely. There |
| is actually no reason to reset it, because memory policies and cpusets |
| don't affect the zonelist choice in the first place. This was different |
| when commit 183f6371aac2 ("mm: ignore mempolicies when using |
| ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their |
| own restricted zonelists. |
| |
| We might consider this for 4.17 although I don't know if there's |
| anything currently broken. |
| |
| SLAB is currently not affected, but in kernels older than 4.7 that don't |
| yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page |
| allocated on arbitrary node") it is. That's at least 4.4 LTS. Older |
| ones I'll have to check. |
| |
| So stable backports should be more important, but will have to be |
| reviewed carefully, as the code went through many changes. BTW I think |
| that also the ac->preferred_zoneref reset is currently useless if we |
| don't also reset ac->nodemask from a mempolicy to NULL first (which we |
| probably should for the OOM victims etc?), but I would leave that for a |
| separate patch. |
| |
| Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz |
| Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
| Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") |
| Acked-by: Mel Gorman <mgorman@techsingularity.net> |
| Cc: Michal Hocko <mhocko@kernel.org> |
| Cc: David Rientjes <rientjes@google.com> |
| Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> |
| Cc: Vlastimil Babka <vbabka@suse.cz> |
| Cc: <stable@vger.kernel.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| |
| --- |
| mm/page_alloc.c | 2 -- |
| 1 file changed, 2 deletions(-) |
| |
| --- a/mm/page_alloc.c |
| +++ b/mm/page_alloc.c |
| @@ -3109,8 +3109,6 @@ retry: |
| * the allocation is high priority and these type of |
| * allocations are system rather than user orientated |
| */ |
| - ac->zonelist = node_zonelist(numa_node_id(), gfp_mask); |
| - |
| page = __alloc_pages_high_priority(gfp_mask, order, ac); |
| |
| if (page) { |