arm64/sve: Improve performance when handling SVE access traps

This patch series aims to improve the performance of handling SVE access
traps, earlier versions were originally written by Julien Gral but based
on discussions on previous versions the patches have been substantially
reworked to use a different approach.  The patches are now different
enough that I set myself as the author, hopefully that's OK for Julien.

Per the syscall ABI, SVE registers will be unknown after a syscall.  In
practice, the kernel will disable SVE and the registers will be zeroed
(except the first 128 bits of each vector) on the next SVE instruction.
Currently we do this by saving the FPSIMD state to memory, converting to
the matching SVE state and then reloading the registers on return to
userspace.  This requires a lot of memory accesses that we shouldn't
need, improve this by reworking the SVE state tracking so we track if we
should trap on executing SVE instructions separately to if we need to
save the full register state.  This allows us to avoid tracking the full
SVE state until we need to return to userspace and to convert directly
in registers in the common case where the FPSIMD state is still in
registers then, reducing overhead in these cases.

As with current mainline we disable SVE on every syscall.  This may not
be ideal for applications that mix SVE and syscall usage, strategies
such as SH's fpu_counter may perform better but we need to assess the
performance on a wider range of systems than are currently available
before implementing anything, this rework will make that easier.

It is also possible to optimize the case when the SVE vector length
is 128-bit (ie the same size as the FPSIMD vectors).  This could be
explored in the future, it becomes a lot easier to do with this
implementation.

I need to confirm if this still needs an update in KVM to handle
TIF_SVE_FPSIMD_REGS properly, I'll do that as part of redoing KVM
testing but that'll take a little while and felt it was important to get
this out for review now.

v8:
 - Replace TIF_SVE_FULL_REGS with TIF_SVE_FPSIMD_REGS, inverting the
   sense of the flag.  This is more in line with a convention mentioned
   by Dave and fixes some issues that I turned up in testing after doing
   some of the other updates.
 - Clarify that we only need to do anything with TIF_SVE_FPSIMD_REGS on
   entry to the kernel if TIF_SVE_EXEC is set and that the flag is
   always set on exit to userspace if TIF_SVE_EXEC is set.
 - Use a local pointer for fpsimd_state in task_fpsimd_load().
 - Restructure task_fpsimd_load() for readability.
 - Explicitly ensure that TIF_SVE_EXEC is set in
   sve_set_vector_length(), fpsimd_signal_preserve_current_state(),
   sve_init_header_from_task().
 - Drop several more hopefully redundant system_supports_sve() checks,
   splitting that out into a separate patch.
 - More use of vq_minus_1.
v7:
 - A few minor cosmetic updates and one bugfix for
   fpsimd_update_current_state().
v6:
 - Substantially rework the patch so that TIF_SVE is now replaced by
   two flags TIF_SVE_EXEC and TIF_SVE_FULL_REGS.
 - Return to disabling SVE after every syscall as for current
   mainine rather than leaving it enabled unless reset via ptrace.
v5:
 - Rebase onto v5.10-rc2.
 - Explicitly support the case where TIF_SVE and TIF_SVE_NEEDS_FLUSH are
   set simultaneously, though this is not currently expected to happen.
 - Extensively revised the documentation for TIF_SVE and
   TIF_SVE_NEEDS_FLUSH to hopefully make things more clear together with
   the above, I hope this addresses the comments on the prior version
   but it really needs fresh eyes to tell if that's actually the case.
 - Make comments in ptrace.c more precise.
 - Remove some redundant checks for system_has_sve().
v4:
 - Rebase onto v5.9-rc2
 - Address review comments from Dave Martin, mostly documentation but
   also some refactorings to ensure we don't check capabilities multiple
   times and the addition of some WARN_ONs to make sure assumptions we
   are making about what TIF_ flags can be set when are true.
v3:
 - Rebased to current kernels.
 - Addressed review comments from v2, mostly around tweaks in the
   documentation.
arm64/sve: Rework SVE trap access to minimise memory access

When we take a SVE access trap only the subset of the SVE Z0-Z31
registers shared with the FPSIMD V0-V31 registers is valid, the rest
of the bits in the SVE registers must be cleared before returning to
userspace.  Currently we do this by saving the current FPSIMD register
state to the task struct and then using that to initalize the copy of
the SVE registers in the task struct so they can be loaded from there
into the registers.  This requires a lot more memory access than we
need, especially in the case where we return to userspace without
otherwise needing to save the register state to memory.

The newly added TIF_SVE_FPSIMD_REGS can be used to reduce this overhead -
instead of doing the conversion immediately we can set that flag as well
as TIF_SVE_EXEC.  This means that until we return to userspace
we only need to store the FPSIMD registers and if (as should be the
common case) the hardware still has the task state and does not need
that to be reloaded from the task struct we can do the initialization of
the SVE state entirely in registers.  In the event that we do need to
reload the registers from the task struct only the FPSIMD subset needs
to be loaded from memory.

If the FPSIMD state is loaded then we need to set the vector length.
This is because the vector length is only set when loading from memory,
the expectation is that the vector length is set when TIF_SVE_EXEC is
set.  We also need to rebind the task to the CPU so the newly allocated
SVE state is used when the task is saved.

This is based on earlier work by Julien Gral implementing a similar idea.

Signed-off-by: Mark Brown <broonie@kernel.org>
3 files changed