mm: add knob to tune lazyfreeing

MADV_FREEed page's hotness is very arguble.
Someone think it's hot while others are it's cold.

Quote from Shaohua
"
My main concern is the policy how we should treat the FREE pages. Moving it to
inactive lru is definitionly a good start, I'm wondering if it's enough. The
MADV_FREE increases memory pressure and cause unnecessary reclaim because of
the lazy memory free. While MADV_FREE is intended to be a better replacement of
MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
memory immediately. So I hope the MADV_FREE doesn't have impact on memory
pressure too. I'm thinking of adding an extra lru list and wartermark for this
to make sure FREE pages can be freed before system wide page reclaim. As you
said, this is arguable, but I hope we can discuss about this issue more.
"

Quote from me
"
It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
but it's not true because the page would be dirty state when VM want to reclaim.

I'm also against with your's suggestion which let's discard FREEed page before
system wide page reclaim because system would have lots of clean cold page
caches or anonymous pages. In such case, reclaiming of them would be better.
Yeb, it's really workload-dependent so we might need some heuristic which is
normally what we want to avoid.

Having said that, I agree with you we could do better than the deactivation
and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
"ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
When the MADV_FREE is called, we could move hinted pages from anon-LRU to
ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
it could promote it to acive-anon-LRU which would be very natural aging
concept because it mean someone touches the page recenlty.
With that, I don't want to bias one side and don't want to add some knob for
tuning the heuristic but let's rely on common fair aging scheme of VM.
"

Quote from Johannes
"
thread 1:
Even if we're wrong about the aging of those MADV_FREE pages, their
contents are invalidated; they can be discarded freely, and restoring
them is a mere GFP_ZERO allocation. All other anonymous pages have to
be written to disk, and potentially be read back.

[ Arguably, MADV_FREE pages should even be reclaimed before inactive
  page cache. It's the same cost to discard both types of pages, but
  restoring page cache involves IO. ]

It probably makes sense to stop thinking about them as anonymous pages
entirely at this point when it comes to aging. They're really not. The
LRU lists are split to differentiate access patterns and cost of page
stealing (and restoring). From that angle, MADV_FREE pages really have
nothing in common with in-use anonymous pages, and so they shouldn't
be on the same LRU list.

thread:2
What about them is hot? They contain garbage, you have to write to
them before you can use them. Granted, you might have to refetch
cachelines if you don't do cacheline-aligned populating writes, but
you can do a lot of them before it's more expensive than doing IO.

"

Quote from Daniel
"
thread:1
Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

thread:2
It's hot because applications churn through memory via the allocator.

Drop the pages and the application is now churning through page faults
and zeroing rather than simply reusing memory. It's not something that
may happen, it *will* happen. A page in the page cache *may* be reused,
but often won't be, especially when the I/O patterns don't line up well
with the way it works.

The whole point of the feature is not requiring the allocator to have
elaborate mechanisms for aging pages and throttling purging. That ends
up resulting in lots of memory held by userspace where the kernel can't
reclaim it under memory pressure. If it's dropped before page cache, it
isn't going to be able to replace any of that logic in allocators.

The page cache is speculative. Page caching by allocators is not really
speculative. Using MADV_FREE on the pages at all is speculative. The
memory is probably going to be reused fairly soon (unless the process
exits, and then it doesn't matter), but purging will end up reducing
memory usage for the portions that aren't.

It would be a different story for a full unpinning/pinning feature since
that would have other use cases (speculative caches), but this is really
only useful in allocators.
"
You could read all thread from https://lkml.org/lkml/2015/11/4/51

Yeah, with arguble issue and there is no one decision, I think it
means we should provide the knob "lazyfreeness"(I hope someone
give better naming).

It's similar to swapppiness so higher values will discard MADV_FREE
pages agreessively.
In this implementation, lazyfreeness starts from 100 which is max
of swappiness(ie, if you see 40 of /proc/sys/vm/lazyfreeness,
it means 100 + 40). Therefore, lazyfree LRU list always has more
reclaiming pressure than anonymous LRU but same with filecache
pressure when we consider default values(ie, 60 swappiness,
40 lazyfreeness). If user want to reclaim lazyfree LRU first to
prevent other LRU reclaiming, he could set lazyfreeness to above 80
which is double of default lazyfreeness.

There is one exception case. If system has low free memory and
file cache, it start to discard MADV_FREEed pages unconditionally
even though user set lazyfreeness to 0. It's same logic with
swappiness to prevent thrashing.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Minchan Kim <minchan@kernel.org>
10 files changed