| From: Arnd Bergmann <arnd@arndb.de> |
| Date: Fri, 3 Feb 2017 23:33:23 +0100 |
| Subject: crypto: improve gcc optimization flags for serpent and wp512 |
| |
| commit 7d6e9105026788c497f0ab32fa16c82f4ab5ff61 upstream. |
| |
| An ancient gcc bug (first reported in 2003) has apparently resurfaced |
| on MIPS, where kernelci.org reports an overly large stack frame in the |
| whirlpool hash algorithm: |
| |
| crypto/wp512.c:987:1: warning: the frame size of 1112 bytes is larger than 1024 bytes [-Wframe-larger-than=] |
| |
| With some testing in different configurations, I'm seeing large |
| variations in stack frames size up to 1500 bytes for what should have |
| around 300 bytes at most. I also checked the reference implementation, |
| which is essentially the same code but also comes with some test and |
| benchmarking infrastructure. |
| |
| It seems that recent compiler versions on at least arm, arm64 and powerpc |
| have a partial fix for this problem, but enabling "-fsched-pressure", but |
| even with that fix they suffer from the issue to a certain degree. Some |
| testing on arm64 shows that the time needed to hash a given amount of |
| data is roughly proportional to the stack frame size here, which makes |
| sense given that the wp512 implementation is doing lots of loads for |
| table lookups, and the problem with the overly large stack is a result |
| of doing a lot more loads and stores for spilled registers (as seen from |
| inspecting the object code). |
| |
| Disabling -fschedule-insns consistently fixes the problem for wp512, |
| in my collection of cross-compilers, the results are consistently better |
| or identical when comparing the stack sizes in this function, though |
| some architectures (notable x86) have schedule-insns disabled by |
| default. |
| |
| The four columns are: |
| default: -O2 |
| press: -O2 -fsched-pressure |
| nopress: -O2 -fschedule-insns -fno-sched-pressure |
| nosched: -O2 -no-schedule-insns (disables sched-pressure) |
| |
| default press nopress nosched |
| alpha-linux-gcc-4.9.3 1136 848 1136 176 |
| am33_2.0-linux-gcc-4.9.3 2100 2076 2100 2104 |
| arm-linux-gnueabi-gcc-4.9.3 848 848 1048 352 |
| cris-linux-gcc-4.9.3 272 272 272 272 |
| frv-linux-gcc-4.9.3 1128 1000 1128 280 |
| hppa64-linux-gcc-4.9.3 1128 336 1128 184 |
| hppa-linux-gcc-4.9.3 644 308 644 276 |
| i386-linux-gcc-4.9.3 352 352 352 352 |
| m32r-linux-gcc-4.9.3 720 656 720 268 |
| microblaze-linux-gcc-4.9.3 1108 604 1108 256 |
| mips64-linux-gcc-4.9.3 1328 592 1328 208 |
| mips-linux-gcc-4.9.3 1096 624 1096 240 |
| powerpc64-linux-gcc-4.9.3 1088 432 1088 160 |
| powerpc-linux-gcc-4.9.3 1080 584 1080 224 |
| s390-linux-gcc-4.9.3 456 456 624 360 |
| sh3-linux-gcc-4.9.3 292 292 292 292 |
| sparc64-linux-gcc-4.9.3 992 240 992 208 |
| sparc-linux-gcc-4.9.3 680 592 680 312 |
| x86_64-linux-gcc-4.9.3 224 240 272 224 |
| xtensa-linux-gcc-4.9.3 1152 704 1152 304 |
| |
| aarch64-linux-gcc-7.0.0 224 224 1104 208 |
| arm-linux-gnueabi-gcc-7.0.1 824 824 1048 352 |
| mips-linux-gcc-7.0.0 1120 648 1120 272 |
| x86_64-linux-gcc-7.0.1 240 240 304 240 |
| |
| arm-linux-gnueabi-gcc-4.4.7 840 392 |
| arm-linux-gnueabi-gcc-4.5.4 784 728 784 320 |
| arm-linux-gnueabi-gcc-4.6.4 736 728 736 304 |
| arm-linux-gnueabi-gcc-4.7.4 944 784 944 352 |
| arm-linux-gnueabi-gcc-4.8.5 464 464 760 352 |
| arm-linux-gnueabi-gcc-4.9.3 848 848 1048 352 |
| arm-linux-gnueabi-gcc-5.3.1 824 824 1064 336 |
| arm-linux-gnueabi-gcc-6.1.1 808 808 1056 344 |
| arm-linux-gnueabi-gcc-7.0.1 824 824 1048 352 |
| |
| Trying the same test for serpent-generic, the picture is a bit different, |
| and while -fno-schedule-insns is generally better here than the default, |
| -fsched-pressure wins overall, so I picked that instead. |
| |
| default press nopress nosched |
| alpha-linux-gcc-4.9.3 1392 864 1392 960 |
| am33_2.0-linux-gcc-4.9.3 536 524 536 528 |
| arm-linux-gnueabi-gcc-4.9.3 552 552 776 536 |
| cris-linux-gcc-4.9.3 528 528 528 528 |
| frv-linux-gcc-4.9.3 536 400 536 504 |
| hppa64-linux-gcc-4.9.3 524 208 524 480 |
| hppa-linux-gcc-4.9.3 768 472 768 508 |
| i386-linux-gcc-4.9.3 564 564 564 564 |
| m32r-linux-gcc-4.9.3 712 576 712 532 |
| microblaze-linux-gcc-4.9.3 724 392 724 512 |
| mips64-linux-gcc-4.9.3 720 384 720 496 |
| mips-linux-gcc-4.9.3 728 384 728 496 |
| powerpc64-linux-gcc-4.9.3 704 304 704 480 |
| powerpc-linux-gcc-4.9.3 704 296 704 480 |
| s390-linux-gcc-4.9.3 560 560 592 536 |
| sh3-linux-gcc-4.9.3 540 540 540 540 |
| sparc64-linux-gcc-4.9.3 544 352 544 496 |
| sparc-linux-gcc-4.9.3 544 344 544 496 |
| x86_64-linux-gcc-4.9.3 528 536 576 528 |
| xtensa-linux-gcc-4.9.3 752 544 752 544 |
| |
| aarch64-linux-gcc-7.0.0 432 432 656 480 |
| arm-linux-gnueabi-gcc-7.0.1 616 616 808 536 |
| mips-linux-gcc-7.0.0 720 464 720 488 |
| x86_64-linux-gcc-7.0.1 536 528 600 536 |
| |
| arm-linux-gnueabi-gcc-4.4.7 592 440 |
| arm-linux-gnueabi-gcc-4.5.4 776 448 776 544 |
| arm-linux-gnueabi-gcc-4.6.4 776 448 776 544 |
| arm-linux-gnueabi-gcc-4.7.4 768 448 768 544 |
| arm-linux-gnueabi-gcc-4.8.5 488 488 776 544 |
| arm-linux-gnueabi-gcc-4.9.3 552 552 776 536 |
| arm-linux-gnueabi-gcc-5.3.1 552 552 776 536 |
| arm-linux-gnueabi-gcc-6.1.1 560 560 776 536 |
| arm-linux-gnueabi-gcc-7.0.1 616 616 808 536 |
| |
| I did not do any runtime tests with serpent, so it is possible that stack |
| frame size does not directly correlate with runtime performance here and |
| it actually makes things worse, but it's more likely to help here, and |
| the reduced stack frame size is probably enough reason to apply the patch, |
| especially given that the crypto code is often used in deep call chains. |
| |
| Link: https://kernelci.org/build/id/58797d7559b5149efdf6c3a9/logs/ |
| Link: http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html |
| Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11488 |
| Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 |
| Cc: Ralf Baechle <ralf@linux-mips.org> |
| Signed-off-by: Arnd Bergmann <arnd@arndb.de> |
| Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> |
| Signed-off-by: Ben Hutchings <ben@decadent.org.uk> |
| --- |
| crypto/Makefile | 2 ++ |
| 1 file changed, 2 insertions(+) |
| |
| --- a/crypto/Makefile |
| +++ b/crypto/Makefile |
| @@ -47,6 +47,7 @@ obj-$(CONFIG_CRYPTO_SHA1) += sha1_generi |
| obj-$(CONFIG_CRYPTO_SHA256) += sha256_generic.o |
| obj-$(CONFIG_CRYPTO_SHA512) += sha512_generic.o |
| obj-$(CONFIG_CRYPTO_WP512) += wp512.o |
| +CFLAGS_wp512.o := $(call cc-option,-fno-schedule-insns) # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 |
| obj-$(CONFIG_CRYPTO_TGR192) += tgr192.o |
| obj-$(CONFIG_CRYPTO_GF128MUL) += gf128mul.o |
| obj-$(CONFIG_CRYPTO_ECB) += ecb.o |
| @@ -67,6 +68,7 @@ obj-$(CONFIG_CRYPTO_BLOWFISH_COMMON) += |
| obj-$(CONFIG_CRYPTO_TWOFISH) += twofish_generic.o |
| obj-$(CONFIG_CRYPTO_TWOFISH_COMMON) += twofish_common.o |
| obj-$(CONFIG_CRYPTO_SERPENT) += serpent_generic.o |
| +CFLAGS_serpent_generic.o := $(call cc-option,-fsched-pressure) # https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 |
| obj-$(CONFIG_CRYPTO_AES) += aes_generic.o |
| obj-$(CONFIG_CRYPTO_CAMELLIA) += camellia_generic.o |
| obj-$(CONFIG_CRYPTO_CAST_COMMON) += cast_common.o |