xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs.
If the number of TLB entries to flush exceeds
tlb_single_page_flush_ceiling, this results in entire TLB neing
flushed on all CPUs. A malicious userspace app can exploit the
dual mapping of a physical page caused by physmap only on the CPU
it is running on. There is no good reason to incur the very high
cost of TLB flush on CPUs that may never run the malicious app or
do not have any TLB entries for the malicious app. The cost of
full TLB flush goes up dramatically on machines with high core
count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. This pending
TLB flush is posted for each task separately and TLB is flushed on
a CPU when a task is scheduled on it that has a pending TLB flush
posted for that CPU. This patch does two things - (1) it
potentially aggregates multiple TLB flushes into one, and (2) it
avoids TLB flush on CPUs that never run the task that caused a TLB
flush. This has very significant impact especially on machines
with large core counts. To illustrate this, kernel was compiled
with -j on two classes of machines - a server with high core count
and large amount of memory, and a desktop class machine with more
modest specs. System time from "make -j" from vanilla 4.20 kernel,
4.20 with XPFO patches before applying this patch and after
applying this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20				915.183s
4.19+XPFO			24129.354s	26.366x
4.19+XPFO+Deferred flush	1216.987s	 1.330xx

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20				607.671s
4.19+XPFO			1588.646s	2.614x
4.19+XPFO+Deferred flush	794.473s	1.307xx

This patch could use more optimization. For instance, it posts a
pending full TLB flush for other CPUs even when number of TLB
entries being flushed does not exceed tlb_single_page_flush_ceiling.
Batching more TLB entry flushes, as was suggested for earlier
version of these patches, can help reduce these cases. This same
code should be implemented for other architectures as well once
finalized.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
4 files changed