| From 710471ac9447ddc698ed9466b7aa7b60ad4805d2 Mon Sep 17 00:00:00 2001 |
| From: Vlastimil Babka <vbabka@suse.cz> |
| Date: Tue, 24 Jan 2017 15:18:32 -0800 |
| Subject: [PATCH] mm, page_alloc: fix check for NULL preferred_zone |
| |
| commit ea57485af8f4221312a5a95d63c382b45e7840dc upstream. |
| |
| Patch series "fix premature OOM regression in 4.7+ due to cpuset races". |
| |
| This is v2 of my attempt to fix the recent report based on LTP cpuset |
| stress test [1]. The intention is to go to stable 4.9 LTSS with this, |
| as triggering repeated OOMs is not nice. That's why the patches try to |
| be not too intrusive. |
| |
| Unfortunately why investigating I found that modifying the testcase to |
| use per-VMA policies instead of per-task policies will bring the OOM's |
| back, but that seems to be much older and harder to fix problem. I have |
| posted a RFC [2] but I believe that fixing the recent regressions has a |
| higher priority. |
| |
| Longer-term we might try to think how to fix the cpuset mess in a better |
| and less error prone way. I was for example very surprised to learn, |
| that cpuset updates change not only task->mems_allowed, but also |
| nodemask of mempolicies. Until now I expected the parameter to |
| alloc_pages_nodemask() to be stable. I wonder why do we then treat |
| cpusets specially in get_page_from_freelist() and distinguish HARDWALL |
| etc, when there's unconditional intersection between mempolicy and |
| cpuset. I would expect the nodemask adjustment for saving overhead in |
| g_p_f(), but that clearly doesn't happen in the current form. So we |
| have both crazy complexity and overhead, AFAICS. |
| |
| [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com |
| [2] https://lkml.kernel.org/r/7c459f26-13a6-a817-e508-b65b903a8378@suse.cz |
| |
| This patch (of 4): |
| |
| Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first |
| zone in a zonelist twice") we have a wrong check for NULL preferred_zone, |
| which can theoretically happen due to concurrent cpuset modification. We |
| check the zoneref pointer which is never NULL and we should check the zone |
| pointer. Also document this in first_zones_zonelist() comment per Michal |
| Hocko. |
| |
| Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") |
| Link: http://lkml.kernel.org/r/20170120103843.24587-2-vbabka@suse.cz |
| Signed-off-by: Vlastimil Babka <vbabka@suse.cz> |
| Acked-by: Mel Gorman <mgorman@techsingularity.net> |
| Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> |
| Cc: Ganapatrao Kulkarni <gpkulkarni@gmail.com> |
| Cc: Michal Hocko <mhocko@suse.com> |
| Cc: <stable@vger.kernel.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
| Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> |
| |
| diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h |
| index 7f2ae99e5daf..3a73b3b73dcc 100644 |
| --- a/include/linux/mmzone.h |
| +++ b/include/linux/mmzone.h |
| @@ -998,12 +998,16 @@ static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z, |
| * @zonelist - The zonelist to search for a suitable zone |
| * @highest_zoneidx - The zone index of the highest zone to return |
| * @nodes - An optional nodemask to filter the zonelist with |
| - * @zone - The first suitable zone found is returned via this parameter |
| + * @return - Zoneref pointer for the first suitable zone found (see below) |
| * |
| * This function returns the first zone at or below a given zone index that is |
| * within the allowed nodemask. The zoneref returned is a cursor that can be |
| * used to iterate the zonelist with next_zones_zonelist by advancing it by |
| * one before calling. |
| + * |
| + * When no eligible zone is found, zoneref->zone is NULL (zoneref itself is |
| + * never NULL). This may happen either genuinely, or due to concurrent nodemask |
| + * update due to cpuset modification. |
| */ |
| static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist, |
| enum zone_type highest_zoneidx, |
| diff --git a/mm/page_alloc.c b/mm/page_alloc.c |
| index 212a017a2f02..6fa120e8e9a1 100644 |
| --- a/mm/page_alloc.c |
| +++ b/mm/page_alloc.c |
| @@ -3687,7 +3687,7 @@ retry_cpuset: |
| */ |
| ac.preferred_zoneref = first_zones_zonelist(ac.zonelist, |
| ac.high_zoneidx, ac.nodemask); |
| - if (!ac.preferred_zoneref) { |
| + if (!ac.preferred_zoneref->zone) { |
| page = NULL; |
| goto no_zone; |
| } |
| -- |
| 2.10.1 |
| |