sched_ext: Reduce DSQ lock contention in consume_dispatch_q()

Replace raw_spin_lock() with raw_spin_trylock() when taking the DSQ lock
in consume_dispatch_q(). If the lock is contended, kick the current CPU
to retry on the next balance instead of spinning.

Under high load multiple CPUs can contend on the same DSQ lock. With a
spin_lock, waiters spin on the same cache line, wasting cycles and
increasing cache coherency traffic, which can slow the lock holder. With
trylock, waiters back off and retry later, so the holder can complete
faster and the backing-off CPUs have a chance to consume other DSQs or run
tasks.

When in bypass mode scx_kick_cpu() is suppressed, so just fall back to
raw_spin_lock() to guarantee forward progress.

Since this slightly changes the behavior of scx_bpf_dsq_move_to_local(),
update the documentation to clarify that a false return value means no
eligible task could be consumed from the DSQ. This covers both the case
of an empty DSQ and any other condition that prevented task consumption.

Benchmarks that generate many enqueue/dispatch events (e.g., schbench)
show around 2-3x higher throughput with most of the scx schedulers with
this change applied.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
1 file changed