arm64: do_csum: implement accelerated scalar version

It turns out that the IP checksumming code is still exercised often,
even though one might expect that modern NICs with checksum offload
have no use for it. However, as Lingyan points out, there are
combinations of features where the network stack may still fall back
to software checksumming, and so it makes sense to provide an
optimized implementation in software as well.

So provide an implementation of do_csum() in scalar assembler, which,
unlike C, gives direct access to that carry flag, making the code run
substantially faster.

On Cortex-A57, this implementation is on par with Lingyan's NEON
implementation, and roughly 7x as fast as the generic C code.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
3 files changed