mm: COW: restore full accuracy in page reuse

This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c.
This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3.
This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4.
This reverts commit 29a951dfb3c3263c3a0f3bd9f7f2c2cfde4baedb.
This reverts commit 9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623.
This reverts commit 6ce64428d62026a10cb5d80138ff2f90cc21d367.
This reverts commit f4c4a3f48480730214c4f02ffa480f6bf5b0718f.
This reverts commit feb889fb40fafc6933339cf1cca8f770126819fb.

After a GUP pin is taken, changing the physical address of the page
mapped in the pagetable causes the holder of the GUP pin to lose
coherency with the CPU. It's a feature that a GUP pinned page acts as
an anchor for the physical address mapped in the pagetable and
prevents it from being copied by a spurious COW fault.

The page_count check in do_wp_page achieves the exact opposite: it
guarantees a GUP pinned page is copied during a wrprotect fault, even
if it is exclusive and not shared with other processes.

Checking page_count instead of page_mapcount in do_wp_page makes it
impossible to wrprotect any page under a readonly long term GUP pin.

For example it becomes impossible to use clear_refs to track CPU
writes on a virtual region mapped by a readonly RDMA FOLL_LONGTERM GUP
pin. This commit resolves the ABI break which lead to silent mm
corruption for certain workloads that worked solid before.

The inaccuracy added to the COW fault doesn't seem to provide benefits
to the MM, but it breaks long term readonly GUP pins and it introduces
other related horizontal performance regressions.

This applies to FOLL_PIN and FOLL_GET alike, because from the MM
standpoint there is no difference.

However, as shown in the below example, even short term GUP pins start
exhibiting unexpected results with the page_count check. The example
below uses O_DIRECT, on a process under clear_refs. There's no long
term GUP pin here. All GUP pins are transient. fork is never called
and even when clear_refs is cleared by an external program, fork()
would not be involved.

thread0				thread1			other process
							(or thread 3)
=============			=============		===========
read syscall
O_DIRECT
read DMA to vaddr+0
len = 512 bytes
GUP(FOLL_WRITE)
DMA writing to RAM started
							clear_refs
							pte_wrprotect
				write vaddrA+512
				page_count == 2
				wp_copy_page
read syscall returns
read lost

The above O_DIRECT read data corruption is also reproducible with only
swap enabled without requiring clear_refs. It is enough that the page
pinned by O_DIRECT gets swapped out while the DMA is in flight.

Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE
granularity is an ABI break, recvmsg kernel TLS and plenty of other
GUP FOLL_WRITE iov_iter_get_pages users would write to the memory with
sub-PAGE_SIZE granularity too. Those recvmsgs are silently lost too as
well as the above O_DIRECT gets lost.

In addition architectures with a PAGE_SIZE much bigger than 4k are
more likely to do I/O at sub-PAGE_SIZE granularity.

That COW must not happen too much is documented in commit
6d0a07edd17cfc12fdc1f36de8072fa17cc3666f:

==
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
==

And in the current comment above the THP mapcount:

==
 * [..] we only use
 * page_trans_huge_mapcount() in the copy-on-write faults where we
 * need full accuracy to avoid breaking page pinning, [..]
==

This revert causes no theoretical security regression because THP is
not yet using page_count in do_huge_pmd_wp_page.

The alternative of this patch would be to replace the mapcount with
the page_count in do_huge_pmd_wp_page too in order to really close the
vmsplice long term GUP attack from child to parent that way.

However, because a single transient GUP pin on a tail page would
elevate the page_count for all other tail pages (unlike the mapcount
which is single page granular), if the COW page reuse inaccuracy would
be applied to THP too, it could cross different vmas and the effect
could happen at a distance in vmas of different processes. A single
GUP pin taken on a subpage mapped in a different process could trigger
511 false positive COWs copies in the local process in the worst case,
after a fork(). Which would be yet an additional inefficiency on top
of the previous cons.

Another purely performance related regression caused by the spurious
COW copies of exclusive anonymous memory, is that we cannot assume
anymore that deferred TLB flushes after wrprotection are noops with
respect to the COW fault. The stale writable TLBs can be seen as the
equivalent of a driver holding a GUP pin on exclusive (non shared)
anonymous memory. If spurious COW copies can happen, the physical
address of the page will change after the copy. So before the COW
fault starts copying the memory to the new physical location, to avoid
losing writes on the page, the stale writable TLBs have to be
invalidated. This is non a functional regression because it is always
possible to "flush the TLB" at any time (or alternatively to wait for
the deferred TLB flushes to be executed). However, what's not possible
is to "flush the GUP pin" (nor to wait for it to be released).

lkp@lists.01.org, lkp@intel.com, https://github.com/intel/lkp-tests.git reports:

--------
FYI, we noticed a 1.5% improvement of vm-scalability.median due to commit:

commit: bcb0df12bc47f2c2bb42b66eb77fe34e509de0b8 ("mm: COW: restore full accuracy in page reuse")

in testcase: vm-scalability
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 256G memory
with following parameters:

        runtime: 300s
        size: 8T
        test: anon-cow-seq-hugetlb
        cpufreq_governor: performance
        ucode: 0x5003006

                                vm-scalability.median

  366000 +------------------------------------------------------------------+
         |                                                 O                |
  364000 |-+               O O    O                             O      O O  |
         |      O     O   O    O         O         O      O    O  O OO      |
         |  O O    O    O       O      O   O  O  O   O  O    O            O |
  362000 |-O                                O         O                     |
         |       O   O              O+          O          +                |
  360000 |-+                         ::                    :+   +           |
         |               .+          ::                   +  +  ::   +.+    |
  358000 |-+           .+ :   .+    :  +   +    +    ++   :   : ::   :      |
         |.+   .+ .+. +    :.+  +.+.:   + +:   +:   +  : :    ::  +  :      |
         |  :.+  +   +     +        +    +  :.+  :.+   : :     +   +:       |
  356000 |-++                               +    +      :           +       |
         |                                              +                   |
  354000 +------------------------------------------------------------------+

[*] bisect-good sample
[O] bisect-bad  sample
--------
FYI, we noticed a 24.9% improvement of pmbench.throughput.aps due to commit:

commit: c4e16aa6ded51f55740e7f24525e5818cb35806c ("mm: COW: restore full accuracy in page reuse")

in testcase: pmbench
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Platinum 8260L CPU @ 2.40GHz with 128G memory
with following parameters:

        runtime: 600s
        disk: 1SSD
        priority: 1
        cpunodebind: 1
        membind: 1
        thp_enabled: never
        thp_defrag: never
        nr_processes: 2
        nr_threads: 4
        pattern: normal_ih
        ratio: 90
        cold: 1
        initialize: 1
        total_setsize: 128G
        cpu_node_bind: even
        cpufreq_governor: performance
        debug-setup: yhnvme-mth
        sc_numa_balancing: 0
        sc_swappiness: 100
        ucode: 0x400002c

test-description: pmbench - paging and virtual memory benchmark

                            pmbench.latency.ns.average

  2600 +--------------------------------------------------------------------+
  2500 |-+             :                                                    |
       |      +.   +  : :                        +                          |
  2400 |-+   +  +  :+ : :                        :                          |
  2300 |-+  +    +:  +  :                        ::                         |
  2200 |++.+      +      :     .+. +   +.+      : :  +          +           |
  2100 |-+               +   .+   + : :  :      +  :+ :        : :         :|
       |                  +.+       : :   :.+. +   +  :    .+. : +.+. .++. :|
  2000 |-+                           +    +   +        +.++   +      +    + |
  1900 |-+                                                                  |
  1800 |-+                                                                  |
  1700 |-+      O    O                                                      |
       |    O     O                                                         |
  1600 |-O O  O    O   O                                                    |
  1500 +--------------------------------------------------------------------+

[*] bisect-good sample
[O] bisect-bad  sample
--------

pmbench shows a performance improvement higher than with the below
patch applied on top of 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. That
is because the below fix still cannot avoid extra allocation and
copies when the COW fault erroneously fails to reuse the
swapcache. The below patch and extra code can then be reverted as well
on top of this commit, because it becomes a noop.

https://lkml.kernel.org/r/20210519013313.1274454-1-ying.huang@intel.com

Commit feb889fb40fafc6933339cf1cca8f770126819fb says: "(c) we could
just make do_wp_page() not COW the pinned page (which was what we
historically did before that "mm: do_wp_page() simplification"
commit)". This commit precisely implements option c) so
feb889fb40fafc6933339cf1cca8f770126819fb becomes unnecessary and can
be reverted too.

Last but not the least: the check for page_maybe_dma_pinned() added by
both feb889fb40fafc6933339cf1cca8f770126819fb and
9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623 in the attempt to avoid
silent MM corruption if using FOLL_LONGTERM with swap enabled or
FOLL_LONGTERM in combination with clear_refs tracking can SMP race
with pin-fast. So those two commits were only effective at rendering
the silent MM corruption less reproducible, they didn't prevent it. To
close the race the check for page_maybe_dma_pinned() would need to be
protected by the write_protect_seq like in fork(). So this commit that
reverts 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 eliminates the risk
of silent MM corruption in the two aforementioned scenarios too.

Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarcange@redhat.com
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
7 files changed