Merge remote-tracking branch 'gitlab/main-5.15.y' into main-5.15.y

d2f3ce7f42140d4830379ae2131148251b1654d6

- revert one more page_count check in the NUMA hinting fault path

- added the bpf_prefault writable tracepoint which is the dependency
  to run BPFML in non simulation mode

b6580318481399405e1368d38d13ec5c58071352

- remove a PageAnonGup() BUG_ON() from unuse_pte() because it
  triggered a false positive during swapoff().

51d469787f3e79c2f4540267dd204ed75a532613

- optimize the universal fix for the synchronicity of all GUP pins
  further with the PageAnonGup filter

ac721fb4219c46772f4dd5861e83335eb3718214

- add synchronicity to all GUP pins, including thread+gup+fork at
  hardblocksize subpage granularity without requiring FOLL_PIN

38ecec4f7a55bb85c8affa97c546991a5351f326

- defer the ksm wse opt-in experimental feature

- deliver full accuracy to the THP mapcount with thp_idx to avoid GUP
  pin synchronicity loss after a THP virtual split followed by fork

- fix a FOLL_LONGTERM synchronicity loss in presence of swapping over
  raid5/blk-integrity that would enable SWP_STABLE_WRITES

- fix a FOLL_LONGTERM synchronicity loss if multiple threads take a
  long term GUP pin at the same time while a child exists

8f74a7d45e5a7d33084bd0958f7a491bbd8f3fc5

- fix kvm_mmu_notifier_change_pte()

608641cdbe7a69741088f8bd0072cab4e1943ff6

- tentative working set estimation for KSM
- fix missing young bit after NUMA migrate-on-fault and THP splits

47b23851febb91384f497da51c77f23b47376eba

- v5 fix for the KSM swapin anon_vma use after free.

e4dbaa0db656bc7e3fce23695c00dc8fc15d96ea

- v4 fix for the KSM swapin anon_vma use after free.

ce4bc19086d1e96ca9ed1491a39aa5291d14b08e

- v3 fix for the KSM swapin anon_vma use after free.

- Cleanup FOLL_UNSHARE definition from code and commit headers and
  other no-op cleanups. It's a further sync-up with the cleanups from
  the v1 "mm: COW fixes part 1: fix the COW security issue for THP and
  hugetlb" submit.

a8e5bf4916fecaacbe49f60cedcf20c658b54707

- Tentative fix for the false positive BUILD_BUG_ON build error on some
  arches reported by the kernel test robot.

- Fixed KSM checksum initialization reported by Dan Carpenter and the
  kernel test robot with smatch.

- Worked around a coding style warning from the kernel test robot.

8a4fc2ffa29df05a65a5d662e0db910dcb93a176

- The mprotect optimization that was proposed upstream to skip
  spurious COW faults had a bug in not checking the swapcount which
  could result in erroneously skipping the COW fault with swap
  enabled. This implementation inherited the same bug that the
  original upstream posted patch had. The bug has been found by source
  review and it has been fixed: in this implementation the swapcount
  is now taken into account as required for safety.

e60b432637711574fba6507c2dbc26043f2f7e9e

- optimized wp_page_unshare() with can_read_pin_swap_page(), in
  addition this change is a dependency for the PageKsm FOLL_MM_SYNC
  rework.

- reworked from scratch PageKsm FOLL_MM_SYNC using
  can_read_pin_swap_page(). Enforcing that no FOLL_LONGTERM read pin
  can be ever taken on any PageKsm feels simpler in comparison to
  enforcing no PageAnon can be converted to PageKsm if there's any
  outstanding pin and that no wrprotected PageAnon can be replaced by
  an equal PageKsm if the PageAnon had any outstanding FOLL_LONGTERM
  pins. Both guarantees are required for FOLL_MM_SYNC to deliver
  full synchronicity to FOLL_LONGTERM pins on VM_MERGEABLE vmas too.

e8a5fe3acb45be705bde7d167d4d89ea6151bec9

- gup_must_unshare() optimized with can_read_pin_swap_page().

- added the page lock in the hugetlbfs gup_must_unshare() path to
  protect against page migration. It'd be ideal if page migration could
  be improved to count how many migration entries it installed and then
  drop the mapcount accordingly only after the refcount freezing.

- Improved FOLL_MM_SYNC for PageKsm: KSM code should cooperate with
  GUP and make sure to never de-dup pages with GUP pins. GUP already does
  its part in unsharing PageKsm pages with the COR fault before taking
  readonly FOLL_LONGTERM pins (with FOLL_MM_SYNC implicitly set).

- Minor: added more consistency to the SWAP=n version of
  reuse_swap_page(), just in case.

129b654f78e4e2386d823d616201b0775d69b382

- More noop cleanups.

- Added a missing update_mmu_tlb() which is also a noop for all arches
  except mips.

c1e6044c5bd1ed2592f7196e7ad99b8c47f7787c

- A solution based on the FOLL_UNSHARE+COR solution that originated in
  this tree has been proposed upstream and the review showed the
  gup_must_unshare() didn't properly take into account the swapcount.

  The lack of swapcount calculation reported upstream is a minor
  implementation issue and requires no change in design to fix. In
  fact it has been fixed in less than 48 hours as demonstrated by this
  quick hotfix update.

  It's worth pointing out that the lack of swapcount calculation in
  the previous version caused zero regressions compared to upstream
  v5.7 and in fact the previous version was preferable than v5.7.

  As opposed upstream still randomly corrupts memory if swap is
  enabled with O_DIRECT + swap if using 64k PAGE SIZE on aarch64 and a
  4k db blocksize, with io_uring and all FOLL_LONGTERM and causes
  various horizontal regressions (for example all swapcache is COWed
  unconditionally even if it's exclusive).

  At the time of this writing, this is the only known solution that
  resolves all known security issues and that introduces zero user ABI
  regression compared to v5.7 and that retains the full power of the
  MM.

  In fact this goes beyond what v5.7 could do: with FOLL_MM_SYNC for
  the first time this solution provides full POSIX semantics to all
  FOLL_LONGTERM and short term pins by leveraging the COR (Copy On
  Read) fault.

170df1aaab8e5dc923479b75d91500d6cf366796

- Peter Xu discovered that the THP path of __page_mapcount was reading
  the first tail page instead of the right tailpage in a doublemap.
  This has been corrected.

- David Hildenbrand reported that __page_mapcount and gup_must_unshare
  shared some code paths between THP and hugetlbfs, but the mapcount
  seqcount wasn't initialized in hugetlbfs which could result in a
  softlockup. This has been corrected and the hugetlbfs paths in
  __page_mapcount and gup_must_unshare don't share the same code paths
  anymore.

- Merged a permutation from David Hildenbrand that simplifies
  __split_huge_pmd_locked() and reduces the
  page_trans_huge_mapcount_lock() hold time as well.

- Merged FOLL_NOUNSHARE from David Hildenbrand "deactivate" the COR
  fault in follow_page(). follow_page() is special because the kernel
  is the "user" and the kernel intends to work on the real thing, not
  on the post-COR copy. Obtaining a (post-COR) copy of the page is
  functionally harmless from the userland point of view, but it'd
  defeat various kernel MM optimizations.

- Added a tentative fix for an user after free in KSM rmap reported
  upstream.

- Added a tentative fix to eliminate the KVM COW side channel.

08afe7e6dd05f64c64af20ea11825067d497004b

- added the COR fault and the FAULT_FLAG_UNSHARE support to hugetlbfs.

806134b9aae1c4f00d92bf942869adb0b0e257e4

- added "mm/userfaultfd: provide unmasked address on page-fault".

828efbc74a232a39869c1612af02bd75b98bb497

- Improved the 3771dc26618494d2fca1f8489cc1581a63a51ce8 commit header.

d22a27cc8aa072e6fb002ec127b5891253658055

- cleanup gup_must_unshare(): added is_fast_only_in_irq() to document
  and deduplicate the irq_count() check.

6d6837a51fe0e71e3dc9c10deefbd616aeaec1fa

- Added feb889fb40fafc6933339cf1cca8f770126819fb to the list of
  reverts since it's unnecessary after reverting
  09854ba94c6aad7886996bfbee2530b3d8a7f4f4.

- Documented more details on the SMP race against pin-fast of
  feb889fb40fafc6933339cf1cca8f770126819fb and
  9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623 at the end of the commit
  header of 5f3f91f23e41359338a41991fe19e4735d7e56e4 ("mm: COW:
  restore full accuracy in page reuse").

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>