mm/vmscan.c: prevent useless kswapd loops

In production we have noticed hard lockups on large machines running
large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
sc->reclaim_idx is 0 which is a small zone.  The lru was couple hundred
GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
isolate_lru_pages() was basically skipping GiBs of pages while holding
the LRU spinlock with interrupt disabled.

On further inspection, it seems like there are two issues:

(1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
    node is still unbalanced), the classzone_idx is unintentionally set
    to 0 and the whole reclaim cycle of kswapd will try to reclaim only
    the lowest and smallest zone while traversing the whole memory.

(2) Fundamentally isolate_lru_pages() is really bad when the
    allocation has woken kswapd for a smaller zone on a very large machine
    running very large jobs.  It can hoard the LRU spinlock while skipping
    over 100s of GiBs of pages.

This patch only fixes (1).  (2) needs a more fundamental solution.  To
fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
invalid use the classzone_idx of the previous kswapd loop otherwise use
the one the waker has requested.

Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com
Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 file changed