percpu: changes for v6.6

percpu
* A couple cleanups by Baoquan He and Bibo Mao. The only behavior change
  is to start printing messages if we're under the warn limit for failed
  atomic allocations.

percpu_counter
* Shakeel introduced percpu counters into mm_struct which caused percpu
  allocations be on the hot path [1]. Originally I spent some time
  trying to improve the percpu allocator, but instead preferred what
  Mateusz Guzik proposed grouping at the allocation site,
  percpu_counter_init_many(). This allows a single percpu allocation to
  be shared by the counters. I like this approach because it creates a
  shared lifetime by the allocations. Additionally, I believe many inits
  have higher level synchronization requirements, like percpu_counter
  does against HOTPLUG_CPU. Therefore we can group these optimizations
  together.

[1] https://lore.kernel.org/linux-mm/20221024052841.3291983-1-shakeelb@google.com/
kernel/fork: group allocation/free of per-cpu counters for mm struct

A trivial execve scalability test which tries to be very friendly
(statically linked binaries, all separate) is predominantly bottlenecked
by back-to-back per-cpu counter allocations which serialize on global
locks.

Ease the pain by allocating and freeing them in one go.

Bench can be found here:
http://apollo.backplane.com/DFlyMisc/doexec.c

$ cc -static -O2 -o static-doexec doexec.c
$ ./static-doexec $(nproc)

Even at a very modest scale of 26 cores (ops/s):
before:	133543.63
after:	186061.81 (+39%)

While with the patch these allocations remain a significant problem,
the primary bottleneck shifts to page release handling.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20230823050609.2228718-3-mjguzik@gmail.com
[Dennis: reflowed 1 line]
Signed-off-by: Dennis Zhou <dennis@kernel.org>
1 file changed