percpu_counter: add a cmpxchg-based _add_batch variant

This was "percpu_counter: reimplement _add_batch with __this_cpu_cmpxchg".

I chatted with vbabka a little bit and he pointed me at mod_zone_state,
which does the same thing I needed except dodges preemption -- turns out
cmpxchg with a gs-prefixed argument is safe here.

================ cut here ================

Interrupt disable/enable trips are quite expensive on x86-64 compared to
a mere cmpxchg (note: no lock prefix!) and percpu counters are used
quite often.

With this change I get a bump of 1% ops/s for negative path lookups,
plugged into will-it-scale:

void testcase(unsigned long long *iterations, unsigned long nr)
{
        while (1) {
                int fd = open("/tmp/nonexistent", O_RDONLY);
                assert(fd == -1);

                (*iterations)++;
        }
}

The win would be higher if it was not for other slowdowns, but one has
to start somewhere.

v2:
- dodge preemption
- use this_cpu_try_cmpxchg
- keep the old variant depending on CONFIG_HAVE_CMPXCHG_LOCAL

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
1 file changed