releases/5.16.20/mm-fix-race-between-madv_free-reclaim-and-blkdev-dir.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From 45d9f326d09c629b841938f200d2b77cf5948a23 Mon Sep 17 00:00:00 2001
 From: Sasha Levin <sashal@kernel.org>
 Date: Thu, 7 Apr 2022 16:14:26 -0300
 Subject: mm: fix race between MADV_FREE reclaim and blkdev direct IO read

 From: Mauricio Faria de Oliveira <mfo@canonical.com>

 commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream.

 Problem:
 =======

 Userspace might read the zero-page instead of actual data from a direct IO
 read on a block device if the buffers have been called madvise(MADV_FREE)
 on earlier (this is discussed below) due to a race between page reclaim on
 MADV_FREE and blkdev direct IO read.

 - Race condition:
   ==============

 During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
 if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
 if the page is dirty).

 However, after try_to_unmap_one() returns to shrink_page_list(), it might
 keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
 _one_ page reference, from the isolation for page reclaim).

 Well, blkdev_direct_IO() gets references for all pages, and on READ
 operations it only sets them dirty _later_.

 So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
 IO read from block devices, and page reclaim happens during
 __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
 returns, but BEFORE the pages are set dirty, the situation happens.

 The direct IO read eventually completes.  Now, when userspace reads the
 buffers, the PTE is no longer there and the page fault handler
 do_anonymous_page() services that with the zero-page, NOT the data!

 A synthetic reproducer is provided.

 - Page faults:
   ===========

 If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
 happen, because that faults-in all pages as writeable, so
 do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
 direct IO.  The userspace reads don't fault as the PTE is there (thus
 zero-page is not used/setup).

 But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
 is no longer there; the subsequent page faults can't help:

 The data-read from the block device probably won't generate faults due to
 DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
 different virtual addresses (not user-mapped addresses) because `struct
 bio_vec` stores `struct page` to figure addresses out (which are different
 from user-mapped addresses) for the read.

 Thus userspace reads (to user-mapped addresses) still fault, then
 do_anonymous_page() gets another `struct page` that would address/ map to
 other memory than the `struct page` used by `struct bio_vec` for the read.
 (The original `struct page` is not available, since it wasn't freed, as
 page_ref_freeze() failed due to more page refs.  And even if it were
 available, its data cannot be trusted anymore.)

 Solution:
 ========

 One solution is to check for the expected page reference count in
 try_to_unmap_one().

 There should be one reference from the isolation (that is also checked in
 shrink_page_list() with page_ref_freeze()) plus one or more references
 from page mapping(s) (put in discard: label).  Further references mean
 that rmap/PTE cannot be unmapped/nuked.

 (Note: there might be more than one reference from mapping due to
 fork()/clone() without CLONE_VM, which use the same `struct page` for
 references, until the copy-on-write page gets copied.)

 So, additional page references (e.g., from direct IO read) now prevent the
 rmap/PTE from being unmapped/dropped; similarly to the page is not freed
 per shrink_page_list()/page_ref_freeze()).

 - Races and Barriers:
   ==================

 The new check in try_to_unmap_one() should be safe in races with
 bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
 done under the PTE lock.

 The fast path doesn't take the lock, but it checks if the PTE has changed
 and if so, it drops the reference and leaves the page for the slow path
 (which does take that lock).

 The fast path requires synchronization w/ full memory barrier: it writes
 the page reference count first then it reads the PTE later, while
 try_to_unmap() writes PTE first then it reads page refcount.

 And a second barrier is needed, as the page dirty flag should not be read
 before the page reference count (as in __remove_mapping()).  (This can be
 a load memory barrier only; no writes are involved.)

 Call stack/comments:

 - try_to_unmap_one()
   - page_vma_mapped_walk()
     - map_pte()			# see pte_offset_map_lock():
         pte_offset_map()
         spin_lock()

   - ptep_get_and_clear()	# write PTE
   - smp_mb()			# (new barrier) GUP fast path
   - page_ref_count()		# (new check) read refcount

   - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
       pte_unmap()
       spin_unlock()

 - bio_iov_iter_get_pages()
   - __bio_iov_iter_get_pages()
     - iov_iter_get_pages()
       - get_user_pages_fast()
         - internal_get_user_pages_fast()

           # fast path
           - lockless_pages_from_mm()
             - gup_{pgd,p4d,pud,pmd,pte}_range()
                 ptep = pte_offset_map()		# not _lock()
                 pte = ptep_get_lockless(ptep)

                 page = pte_page(pte)
                 try_grab_compound_head(page)	# inc refcount
                                             	# (RMW/barrier
                                              	#  on success)

                 if (pte_val(pte) != pte_val(*ptep)) # read PTE
                         put_compound_head(page) # dec refcount
                         			# go slow path

           # slow path
           - __gup_longterm_unlocked()
             - get_user_pages_unlocked()
               - __get_user_pages_locked()
                 - __get_user_pages()
                   - follow_{page,p4d,pud,pmd}_mask()
                     - follow_page_pte()
                         ptep = pte_offset_map_lock()
                         pte = *ptep
                         page = vm_normal_page(pte)
                         try_grab_page(page)	# inc refcount
                         pte_unmap_unlock()

 - Huge Pages:
   ==========

 Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
 (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
 (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
 thus should reach shrink_page_list() -> split_huge_page_to_list() before
 try_to_unmap[_one](), so it deals with normal pages only.

 (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
 which should not or be rare, the page refcount should be greater than
 mapcount: the head page is referenced by tail pages.  That also prevents
 checking the head `page` then incorrectly call page_remove_rmap(subpage)
 for a tail page, that isn't even in the shrink_page_list()'s page_list (an
 effect of split huge pmd/pmvw), as it might happen today in this unlikely
 scenario.)

 MADV_FREE'd buffers:
 ===================

 So, back to the "if MADV_FREE pages are used as buffers" note.  The case
 is arguable, and subject to multiple interpretations.

 The madvise(2) manual page on the MADV_FREE advice value says:

 1) 'After a successful MADV_FREE ... data will be lost when
    the kernel frees the pages.'
 2) 'the free operation will be canceled if the caller writes
    into the page' / 'subsequent writes ... will succeed and
    then [the] kernel cannot free those dirtied pages'
 3) 'If there is no subsequent write, the kernel can free the
    pages at any time.'

 Thoughts, questions, considerations... respectively:

 1) Since the kernel didn't actually free the page (page_ref_freeze()
    failed), should the data not have been lost? (on userspace read.)
 2) Should writes performed by the direct IO read be able to cancel
    the free operation?
    - Should the direct IO read be considered as 'the caller' too,
      as it's been requested by 'the caller'?
    - Should the bio technique to dirty pages on return to userspace
      (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
      be considered in another/special way here?
 3) Should an upcoming write from a previously requested direct IO
    read be considered as a subsequent write, so the kernel should
    not free the pages? (as it's known at the time of page reclaim.)

 And lastly:

 Technically, the last point would seem a reasonable consideration and
 balance, as the madvise(2) manual page apparently (and fairly) seem to
 assume that 'writes' are memory access from the userspace process (not
 explicitly considering writes from the kernel or its corner cases; again,
 fairly)..  plus the kernel fix implementation for the corner case of the
 largely 'non-atomic write' encompassed by a direct IO read operation, is
 relatively simple; and it helps.

 Reproducer:
 ==========

 @ test.c (simplified, but works)

 	#define _GNU_SOURCE
 	#include <fcntl.h>
 	#include <stdio.h>
 	#include <unistd.h>
 	#include <sys/mman.h>

 	int main() {
 		int fd, i;
 		char *buf;

 		fd = open(DEV, O_RDONLY | O_DIRECT);

 		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                 	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

 		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
 			buf[i] = 1; // init to non-zero

 		madvise(buf, BUF_SIZE, MADV_FREE);

 		read(fd, buf, BUF_SIZE);

 		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
 			printf("%p: 0x%x\n", &buf[i], buf[i]);

 		return 0;
 	}

 @ block/fops.c (formerly fs/block_dev.c)

 	+#include <linux/swap.h>
 	...
 	... __blkdev_direct_IO[_simple](...)
 	{
 	...
 	+	if (!strcmp(current->comm, "good"))
 	+		shrink_all_memory(ULONG_MAX);
 	+
          	ret = bio_iov_iter_get_pages(...);
 	+
 	+	if (!strcmp(current->comm, "bad"))
 	+		shrink_all_memory(ULONG_MAX);
 	...
 	}

 @ shell

         # NUM_PAGES=4
         # PAGE_SIZE=$(getconf PAGE_SIZE)

         # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
         # DEV=$(losetup -f --show test.img)

         # gcc -DDEV=\"$DEV\" \
               -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
               -DPAGE_SIZE=${PAGE_SIZE} \
                test.c -o test

         # od -tx1 $DEV
         0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
         *
         0040000

         # mv test good
         # ./good
         0x7f7c10418000: 0x79
         0x7f7c10419000: 0x79
         0x7f7c1041a000: 0x79
         0x7f7c1041b000: 0x79

         # mv good bad
         # ./bad
         0x7fa1b8050000: 0x0
         0x7fa1b8051000: 0x0
         0x7fa1b8052000: 0x0
         0x7fa1b8053000: 0x0

 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
 support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
 do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].

 - v5.17-rc3:

         # for i in {1..1000}; do ./good; done \
             | cut -d: -f2 | sort | uniq -c
            4000  0x79

         # mv good bad
         # for i in {1..1000}; do ./bad; done \
             | cut -d: -f2 | sort | uniq -c
            4000  0x0

         # free | grep Swap
         Swap:             0           0           0

 - v4.5:

         # for i in {1..1000}; do ./good; done \
             | cut -d: -f2 | sort | uniq -c
            4000  0x79

         # mv good bad
         # for i in {1..1000}; do ./bad; done \
             | cut -d: -f2 | sort | uniq -c
            2702  0x0
            1298  0x79

         # swapoff -av
         swapoff /swap

         # for i in {1..1000}; do ./bad; done \
             | cut -d: -f2 | sort | uniq -c
            4000  0x79

 Ceph/TCMalloc:
 =============

 For documentation purposes, the use case driving the analysis/fix is Ceph
 on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
 release unused memory to the system from the mmap'ed page heap (might be
 committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
 -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
 TCMalloc_SystemCommit() -> do nothing.

 Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
 release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
 just 'disappeared' on Ceph on later Ubuntu releases but is still present
 in the kernel, and can be hit by other use cases.

 The observed issue seems to be the old Ceph bug #22464 [1], where checksum
 mismatches are observed (and instrumentation with buffer dumps shows
 zero-pages read from mmap'ed/MADV_FREE'd page ranges).

 The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
 mostly worked around with a retry mechanism, but other parts of Ceph could
 still hit that (rocksdb).  Anyway, it's less likely to be hit again as
 TCMalloc switched out of MADV_FREE by default.

 (Some kernel versions/reports from the Ceph bug, and relation with
 the MADV_FREE introduction/changes; TCMalloc versions not checked.)
 - 4.4 good
 - 4.5 (madv_free: introduction)
 - 4.9 bad
 - 4.10 good? maybe a swapless system
 - 4.12 (madv_free: no longer free instantly on swapless systems)
 - 4.13 bad

 [1] https://tracker.ceph.com/issues/22464

 Thanks:
 ======

 Several people contributed to analysis/discussions/tests/reproducers in
 the first stages when drilling down on ceph/tcmalloc/linux kernel:

 - Dan Hill
 - Dan Streetman
 - Dongdong Tao
 - Gavin Guo
 - Gerald Yang
 - Heitor Alves de Siqueira
 - Ioanna Alifieraki
 - Jay Vosburgh
 - Matthew Ruffell
 - Ponnuvel Palaniyappan

 Reviews, suggestions, corrections, comments:

 - Minchan Kim
 - Yu Zhao
 - Huang, Ying
 - John Hubbard
 - Christoph Hellwig

 [mfo@canonical.com: v4]
   Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com

 Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
 Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
 Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
 Cc: Minchan Kim <minchan@kernel.org>
 Cc: Yu Zhao <yuzhao@google.com>
 Cc: Yang Shi <shy828301@gmail.com>
 Cc: Miaohe Lin <linmiaohe@huawei.com>
 Cc: Dan Hill <daniel.hill@canonical.com>
 Cc: Dan Streetman <dan.streetman@canonical.com>
 Cc: Dongdong Tao <dongdong.tao@canonical.com>
 Cc: Gavin Guo <gavin.guo@canonical.com>
 Cc: Gerald Yang <gerald.yang@canonical.com>
 Cc: Heitor Alves de Siqueira <halves@canonical.com>
 Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
 Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
 Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
 Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
 Cc: <stable@vger.kernel.org>
 Cc: Christoph Hellwig <hch@infradead.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 [mfo: backport: replace folio/test_flag with page/flag equivalents;
  real Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)") in v4.]
 Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
 Signed-off-by: Sasha Levin <sashal@kernel.org>
 ---
  mm/rmap.c | 25 ++++++++++++++++++++++++-
  1 file changed, 24 insertions(+), 1 deletion(-)

 diff --git a/mm/rmap.c b/mm/rmap.c
 index 163ac4e6bcee..26d3d4824c3a 100644
 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -1570,7 +1570,30 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

  			/* MADV_FREE page check */
  			if (!PageSwapBacked(page)) {
 -				if (!PageDirty(page)) {
 +				int ref_count, map_count;
 +
 +				/*
 +				 * Synchronize with gup_pte_range():
 +				 * - clear PTE; barrier; read refcount
 +				 * - inc refcount; barrier; read PTE
 +				 */
 +				smp_mb();
 +
 +				ref_count = page_ref_count(page);
 +				map_count = page_mapcount(page);
 +
 +				/*
 +				 * Order reads for page refcount and dirty flag
 +				 * (see comments in __remove_mapping()).
 +				 */
 +				smp_rmb();
 +
 +				/*
 +				 * The only page refs must be one from isolation
 +				 * plus the rmap(s) (dropped by discard:).
 +				 */
 +				if (ref_count == 1 + map_count &&
 +				    !PageDirty(page)) {
  					/* Invalidate as we cleared the pte */
  					mmu_notifier_invalidate_range(mm,
  						address, address + PAGE_SIZE);
 --
 2.35.1
	From 45d9f326d09c629b841938f200d2b77cf5948a23 Mon Sep 17 00:00:00 2001
	From: Sasha Levin <sashal@kernel.org>
	Date: Thu, 7 Apr 2022 16:14:26 -0300
	Subject: mm: fix race between MADV_FREE reclaim and blkdev direct IO read

	From: Mauricio Faria de Oliveira <mfo@canonical.com>

	commit 6c8e2a256915a223f6289f651d6b926cd7135c9e upstream.

	Problem:
	=======

	Userspace might read the zero-page instead of actual data from a direct IO
	read on a block device if the buffers have been called madvise(MADV_FREE)
	on earlier (this is discussed below) due to a race between page reclaim on
	MADV_FREE and blkdev direct IO read.

	- Race condition:
	==============

	During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
	if the page is not dirty, then discards its rmap PTE(s) (vs. remap back
	if the page is dirty).

	However, after try_to_unmap_one() returns to shrink_page_list(), it might
	keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
	_one_ page reference, from the isolation for page reclaim).

	Well, blkdev_direct_IO() gets references for all pages, and on READ
	operations it only sets them dirty _later_.

	So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
	IO read from block devices, and page reclaim happens during
	__blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
	returns, but BEFORE the pages are set dirty, the situation happens.

	The direct IO read eventually completes. Now, when userspace reads the
	buffers, the PTE is no longer there and the page fault handler
	do_anonymous_page() services that with the zero-page, NOT the data!

	A synthetic reproducer is provided.

	- Page faults:
	===========

	If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
	happen, because that faults-in all pages as writeable, so
	do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
	direct IO. The userspace reads don't fault as the PTE is there (thus
	zero-page is not used/setup).

	But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
	is no longer there; the subsequent page faults can't help:

	The data-read from the block device probably won't generate faults due to
	DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
	different virtual addresses (not user-mapped addresses) because `struct
	bio_vec` stores `struct page` to figure addresses out (which are different
	from user-mapped addresses) for the read.

	Thus userspace reads (to user-mapped addresses) still fault, then
	do_anonymous_page() gets another `struct page` that would address/ map to
	other memory than the `struct page` used by `struct bio_vec` for the read.
	(The original `struct page` is not available, since it wasn't freed, as
	page_ref_freeze() failed due to more page refs. And even if it were
	available, its data cannot be trusted anymore.)

	Solution:
	========

	One solution is to check for the expected page reference count in
	try_to_unmap_one().

	There should be one reference from the isolation (that is also checked in
	shrink_page_list() with page_ref_freeze()) plus one or more references
	from page mapping(s) (put in discard: label). Further references mean
	that rmap/PTE cannot be unmapped/nuked.

	(Note: there might be more than one reference from mapping due to
	fork()/clone() without CLONE_VM, which use the same `struct page` for
	references, until the copy-on-write page gets copied.)

	So, additional page references (e.g., from direct IO read) now prevent the
	rmap/PTE from being unmapped/dropped; similarly to the page is not freed
	per shrink_page_list()/page_ref_freeze()).

	- Races and Barriers:
	==================

	The new check in try_to_unmap_one() should be safe in races with
	bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
	done under the PTE lock.

	The fast path doesn't take the lock, but it checks if the PTE has changed
	and if so, it drops the reference and leaves the page for the slow path
	(which does take that lock).

	The fast path requires synchronization w/ full memory barrier: it writes
	the page reference count first then it reads the PTE later, while
	try_to_unmap() writes PTE first then it reads page refcount.

	And a second barrier is needed, as the page dirty flag should not be read
	before the page reference count (as in __remove_mapping()). (This can be
	a load memory barrier only; no writes are involved.)

	Call stack/comments:

	- try_to_unmap_one()
	- page_vma_mapped_walk()
	- map_pte() # see pte_offset_map_lock():
	pte_offset_map()
	spin_lock()

	- ptep_get_and_clear() # write PTE
	- smp_mb() # (new barrier) GUP fast path
	- page_ref_count() # (new check) read refcount

	- page_vma_mapped_walk_done() # see pte_unmap_unlock():
	pte_unmap()
	spin_unlock()

	- bio_iov_iter_get_pages()
	- __bio_iov_iter_get_pages()
	- iov_iter_get_pages()
	- get_user_pages_fast()
	- internal_get_user_pages_fast()

	# fast path
	- lockless_pages_from_mm()
	- gup_{pgd,p4d,pud,pmd,pte}_range()
	ptep = pte_offset_map() # not _lock()
	pte = ptep_get_lockless(ptep)

	page = pte_page(pte)
	try_grab_compound_head(page) # inc refcount
	# (RMW/barrier
	# on success)

	if (pte_val(pte) != pte_val(*ptep)) # read PTE
	put_compound_head(page) # dec refcount
	# go slow path

	# slow path
	- __gup_longterm_unlocked()
	- get_user_pages_unlocked()
	- __get_user_pages_locked()
	- __get_user_pages()
	- follow_{page,p4d,pud,pmd}_mask()
	- follow_page_pte()
	ptep = pte_offset_map_lock()
	pte = *ptep
	page = vm_normal_page(pte)
	try_grab_page(page) # inc refcount
	pte_unmap_unlock()

	- Huge Pages:
	==========

	Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
	(aka lazyfree) pages are PageAnon() && !PageSwapBacked()
	(madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
	thus should reach shrink_page_list() -> split_huge_page_to_list() before
	try_to_unmap[_one](), so it deals with normal pages only.

	(And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
	which should not or be rare, the page refcount should be greater than
	mapcount: the head page is referenced by tail pages. That also prevents
	checking the head `page` then incorrectly call page_remove_rmap(subpage)
	for a tail page, that isn't even in the shrink_page_list()'s page_list (an
	effect of split huge pmd/pmvw), as it might happen today in this unlikely
	scenario.)

	MADV_FREE'd buffers:
	===================

	So, back to the "if MADV_FREE pages are used as buffers" note. The case
	is arguable, and subject to multiple interpretations.

	The madvise(2) manual page on the MADV_FREE advice value says:

	1) 'After a successful MADV_FREE ... data will be lost when
	the kernel frees the pages.'
	2) 'the free operation will be canceled if the caller writes
	into the page' / 'subsequent writes ... will succeed and
	then [the] kernel cannot free those dirtied pages'
	3) 'If there is no subsequent write, the kernel can free the
	pages at any time.'

	Thoughts, questions, considerations... respectively:

	1) Since the kernel didn't actually free the page (page_ref_freeze()
	failed), should the data not have been lost? (on userspace read.)
	2) Should writes performed by the direct IO read be able to cancel
	the free operation?
	- Should the direct IO read be considered as 'the caller' too,
	as it's been requested by 'the caller'?
	- Should the bio technique to dirty pages on return to userspace
	(bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
	be considered in another/special way here?
	3) Should an upcoming write from a previously requested direct IO
	read be considered as a subsequent write, so the kernel should
	not free the pages? (as it's known at the time of page reclaim.)

	And lastly:

	Technically, the last point would seem a reasonable consideration and
	balance, as the madvise(2) manual page apparently (and fairly) seem to
	assume that 'writes' are memory access from the userspace process (not
	explicitly considering writes from the kernel or its corner cases; again,
	fairly).. plus the kernel fix implementation for the corner case of the
	largely 'non-atomic write' encompassed by a direct IO read operation, is
	relatively simple; and it helps.

	Reproducer:
	==========

	@ test.c (simplified, but works)

	#define _GNU_SOURCE
	#include <fcntl.h>
	#include <stdio.h>
	#include <unistd.h>
	#include <sys/mman.h>

	int main() {
	int fd, i;
	char *buf;

	fd = open(DEV, O_RDONLY \| O_DIRECT);

	buf = mmap(NULL, BUF_SIZE, PROT_READ \| PROT_WRITE,
	MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0);

	for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
	buf[i] = 1; // init to non-zero

	madvise(buf, BUF_SIZE, MADV_FREE);

	read(fd, buf, BUF_SIZE);

	for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
	printf("%p: 0x%x\n", &buf[i], buf[i]);

	return 0;
	}

	@ block/fops.c (formerly fs/block_dev.c)

	+#include <linux/swap.h>
	...
	... __blkdev_direct_IO[_simple](...)
	{
	...
	+ if (!strcmp(current->comm, "good"))
	+ shrink_all_memory(ULONG_MAX);
	+
	ret = bio_iov_iter_get_pages(...);
	+
	+ if (!strcmp(current->comm, "bad"))
	+ shrink_all_memory(ULONG_MAX);
	...
	}

	@ shell

	# NUM_PAGES=4
	# PAGE_SIZE=$(getconf PAGE_SIZE)

	# yes \| dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
	# DEV=$(losetup -f --show test.img)

	# gcc -DDEV=\"$DEV\" \
	-DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
	-DPAGE_SIZE=${PAGE_SIZE} \
	test.c -o test

	# od -tx1 $DEV
	0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
	*
	0040000

	# mv test good
	# ./good
	0x7f7c10418000: 0x79
	0x7f7c10419000: 0x79
	0x7f7c1041a000: 0x79
	0x7f7c1041b000: 0x79

	# mv good bad
	# ./bad
	0x7fa1b8050000: 0x0
	0x7fa1b8051000: 0x0
	0x7fa1b8052000: 0x0
	0x7fa1b8053000: 0x0

	Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
	support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap
	do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].

	- v5.17-rc3:

	# for i in {1..1000}; do ./good; done \
	\| cut -d: -f2 \| sort \| uniq -c
	4000 0x79

	# mv good bad
	# for i in {1..1000}; do ./bad; done \
	\| cut -d: -f2 \| sort \| uniq -c
	4000 0x0

	# free \| grep Swap
	Swap: 0 0 0

	- v4.5:

	# for i in {1..1000}; do ./good; done \
	\| cut -d: -f2 \| sort \| uniq -c
	4000 0x79

	# mv good bad
	# for i in {1..1000}; do ./bad; done \
	\| cut -d: -f2 \| sort \| uniq -c
	2702 0x0
	1298 0x79

	# swapoff -av
	swapoff /swap

	# for i in {1..1000}; do ./bad; done \
	\| cut -d: -f2 \| sort \| uniq -c
	4000 0x79

	Ceph/TCMalloc:
	=============

	For documentation purposes, the use case driving the analysis/fix is Ceph
	on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
	release unused memory to the system from the mmap'ed page heap (might be
	committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
	-> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
	TCMalloc_SystemCommit() -> do nothing.

	Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
	release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
	just 'disappeared' on Ceph on later Ubuntu releases but is still present
	in the kernel, and can be hit by other use cases.

	The observed issue seems to be the old Ceph bug #22464 [1], where checksum
	mismatches are observed (and instrumentation with buffer dumps shows
	zero-pages read from mmap'ed/MADV_FREE'd page ranges).

	The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
	mostly worked around with a retry mechanism, but other parts of Ceph could
	still hit that (rocksdb). Anyway, it's less likely to be hit again as
	TCMalloc switched out of MADV_FREE by default.

	(Some kernel versions/reports from the Ceph bug, and relation with
	the MADV_FREE introduction/changes; TCMalloc versions not checked.)
	- 4.4 good
	- 4.5 (madv_free: introduction)
	- 4.9 bad
	- 4.10 good? maybe a swapless system
	- 4.12 (madv_free: no longer free instantly on swapless systems)
	- 4.13 bad

	[1] https://tracker.ceph.com/issues/22464

	Thanks:
	======

	Several people contributed to analysis/discussions/tests/reproducers in
	the first stages when drilling down on ceph/tcmalloc/linux kernel:

	- Dan Hill
	- Dan Streetman
	- Dongdong Tao
	- Gavin Guo
	- Gerald Yang
	- Heitor Alves de Siqueira
	- Ioanna Alifieraki
	- Jay Vosburgh
	- Matthew Ruffell
	- Ponnuvel Palaniyappan

	Reviews, suggestions, corrections, comments:

	- Minchan Kim
	- Yu Zhao
	- Huang, Ying
	- John Hubbard
	- Christoph Hellwig

	[mfo@canonical.com: v4]
	Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com

	Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
	Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
	Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
	Cc: Minchan Kim <minchan@kernel.org>
	Cc: Yu Zhao <yuzhao@google.com>
	Cc: Yang Shi <shy828301@gmail.com>
	Cc: Miaohe Lin <linmiaohe@huawei.com>
	Cc: Dan Hill <daniel.hill@canonical.com>
	Cc: Dan Streetman <dan.streetman@canonical.com>
	Cc: Dongdong Tao <dongdong.tao@canonical.com>
	Cc: Gavin Guo <gavin.guo@canonical.com>
	Cc: Gerald Yang <gerald.yang@canonical.com>
	Cc: Heitor Alves de Siqueira <halves@canonical.com>
	Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
	Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
	Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
	Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
	Cc: <stable@vger.kernel.org>
	Cc: Christoph Hellwig <hch@infradead.org>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
	[mfo: backport: replace folio/test_flag with page/flag equivalents;
	real Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)") in v4.]
	Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
	Signed-off-by: Sasha Levin <sashal@kernel.org>
	---
	mm/rmap.c \| 25 ++++++++++++++++++++++++-
	1 file changed, 24 insertions(+), 1 deletion(-)

	diff --git a/mm/rmap.c b/mm/rmap.c
	index 163ac4e6bcee..26d3d4824c3a 100644
	--- a/mm/rmap.c
	+++ b/mm/rmap.c
	@@ -1570,7 +1570,30 @@ static bool try_to_unmap_one(struct page page, struct vm_area_struct vma,

	/* MADV_FREE page check */
	if (!PageSwapBacked(page)) {
	- if (!PageDirty(page)) {
	+ int ref_count, map_count;
	+
	+ /*
	+ * Synchronize with gup_pte_range():
	+ * - clear PTE; barrier; read refcount
	+ * - inc refcount; barrier; read PTE
	+ */
	+ smp_mb();
	+
	+ ref_count = page_ref_count(page);
	+ map_count = page_mapcount(page);
	+
	+ /*
	+ * Order reads for page refcount and dirty flag
	+ * (see comments in __remove_mapping()).
	+ */
	+ smp_rmb();
	+
	+ /*
	+ * The only page refs must be one from isolation
	+ * plus the rmap(s) (dropped by discard:).
	+ */
	+ if (ref_count == 1 + map_count &&
	+ !PageDirty(page)) {
	/* Invalidate as we cleared the pte */
	mmu_notifier_invalidate_range(mm,
	address, address + PAGE_SIZE);
	--
	2.35.1