arm64: tlb: skip tlbi broadcast
With multiple NUMA nodes and multiple sockets, the tlbi broadcast
shall be delivered through the interconnects in turn increasing the
CPU interconnect traffic and the latency of the tlbi broadcast
instruction. To avoid the synchronous delivery of the tlbi broadcast
before the tlbi instruction can be retired, the hardware would need to
implement a replicated mm_cpumask bitflag for each ASID and every CPU
would need to tell every other CPU which ASID is being loaded. Exactly
what x86 does with mm_cpumask in software.
Even within a single NUMA node the latency of the tlbi broadcast
instruction increases almost linearly with the number of CPUs trying
to send tlbi broadcasts at the same time.
If a single thread of the process is running and it's also running in
the CPU issuing the TLB flush, or if no thread of the process are
running, we can achieve full SMP scalability in the arm64 TLB flushng
by skipping the tlbi broadcasting.
After the local TLB flush this means the ASID context goes out of sync
in all CPUs except the local one. This can be tracked on the per-mm
cpumask: if the bit is set it means the ASID context is stale for that
CPU. This results in an extra local ASID TLB flush only when threads
are running in new CPUs after a TLB flush.
Skipping the tlbi instruction broadcasting is already implemented in
local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
flush_tlb_range() and flush_tlb_page() too.
The below benchmarks are measured on a non-NUMA 32 CPUs system (ARMv8
Ampere), so it should be far from a worst case scenario: the
enterprise kernel config allows multiple NUMA nodes with NR_CPUS set
by default to 4096.
=== stock ===
# cat for-each-cpu.sh
#!/bin/bash
for i in $(seq `nproc`); do
"$@" &>/dev/null &
done
wait
# perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
[..]
2.1696 +- 0.0122 seconds time elapsed ( +- 0.56% )
# perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
[..]
0.99018 +- 0.00360 seconds time elapsed ( +- 0.36% )
# cat sort-compute
#!/bin/bash
for x in `seq 256`; do
for i in `seq 32`; do /usr/bin/sort </usr/bin/sort >/dev/null; done &
done
wait
# perf stat -r 10 -e dummy ./sort-compute
[..]
1.8094 +- 0.0139 seconds time elapsed ( +- 0.77% )
[..]
=== patch applied ===
# perf stat -r 10 -e dummy ./for-each-cpu.sh ./mprotect-threaded 10000
[..]
0.13941 +- 0.00449 seconds time elapsed ( +- 3.22% )
# perf stat -r 10 -e dummy ./for-each-cpu.sh ./gperftools/tcmalloc_large_heap_fragmentation_unittest
[..]
0.90510 +- 0.00262 seconds time elapsed ( +- 0.29% )
# perf stat -r 10 -e dummy ./sort-compute
[..]
1.64025 +- 0.00618 seconds time elapsed ( +- 0.38% )
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
5 files changed