aboutsummaryrefslogtreecommitdiff
path: root/tests/validation/CL/MatMulLowpNativeKernel.cpp
diff options
context:
space:
mode:
authorEthan Doe <yidoe@amazon.com>2023-04-14 17:24:33 +0000
committerPablo Marquez Tello <pablo.tello@arm.com>2023-04-19 14:26:25 +0000
commita07c01b6cad1fa37f98a05f08019b40bd4303a92 (patch)
treee543988368a7526877a434a27f4dccf6562a6e6f /tests/validation/CL/MatMulLowpNativeKernel.cpp
parent9c7c2d2d23693877867bb3284c577b33cfbff471 (diff)
downloadComputeLibrary-a07c01b6cad1fa37f98a05f08019b40bd4303a92.tar.gz
NETranspose 8x8 kernel for 32-bit elements
The existing 4x4 tiling for 32-bit transpose is not efficient on aarch64, given that there are a lot more Neon registers available. So making the tile size to 8x8 will greatly improve NETranspose latency. For example, on AWS Graviton3 processors, with this change I have observed transposing a 768x768 matrix improves latency from 0.32ms down to 0.19ms. Improvement can also be seen across different matrix sizes. Further enlarging the tile size to 8x16 or 16x16 won't make it perform as good as 8x8 due to register pressure. This change is to mitigate the issue reported at: https://github.com/ARM-software/ComputeLibrary/issues/1045 Signed-off-by: Ethan Doe <yidoe@amazon.com> Change-Id: Ia09859cdf2f6d312e67219a9d95a3a3bf1db1999 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9448 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Diffstat (limited to 'tests/validation/CL/MatMulLowpNativeKernel.cpp')
0 files changed, 0 insertions, 0 deletions