mm/vmscan.c: prevent useless kswapd loops
In production we have noticed hard lockups on large machines running
large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
sc->reclaim_idx is 0 which is a small zone. The lru was couple hundred
GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
isolate_lru_pages() was basically skipping GiBs of pages while holding
the LRU spinlock with interrupt disabled.
On further inspection, it seems like there are two issues:
(1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
node is still unbalanced), the classzone_idx is unintentionally set
to 0 and the whole reclaim cycle of kswapd will try to reclaim only
the lowest and smallest zone while traversing the whole memory.
(2) Fundamentally isolate_lru_pages() is really bad when the
allocation has woken kswapd for a smaller zone on a very large machine
running very large jobs. It can hoard the LRU spinlock while skipping
over 100s of GiBs of pages.
This patch only fixes (1). (2) needs a more fundamental solution. To
fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
invalid use the classzone_idx of the previous kswapd loop otherwise use
the one the waker has requested.
Fixes: e716f2eb24de ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
Signed-off-by: Shakeel Butt <firstname.lastname@example.org>
Reviewed-by: Yang Shi <email@example.com>
Acked-by: Mel Gorman <firstname.lastname@example.org>
Cc: Johannes Weiner <email@example.com>
Cc: Michal Hocko <firstname.lastname@example.org>
Cc: Vlastimil Babka <email@example.com>
Cc: Hillf Danton <firstname.lastname@example.org>
Cc: Roman Gushchin <email@example.com>
Signed-off-by: Andrew Morton <firstname.lastname@example.org>
Signed-off-by: Linus Torvalds <email@example.com>
1 file changed