ComputeLibrary.git -

diff options

author	Ethan Doe <yidoe@amazon.com>	2023-04-14 17:24:33 +0000
committer	Pablo Marquez Tello <pablo.tello@arm.com>	2023-04-19 14:26:25 +0000
commit	a07c01b6cad1fa37f98a05f08019b40bd4303a92 (patch)
tree	e543988368a7526877a434a27f4dccf6562a6e6f /tests/validation/CL/MatMulLowpNativeKernel.cpp
parent	9c7c2d2d23693877867bb3284c577b33cfbff471 (diff)
download	ComputeLibrary-a07c01b6cad1fa37f98a05f08019b40bd4303a92.tar.gz

NETranspose 8x8 kernel for 32-bit elements

The existing 4x4 tiling for 32-bit transpose is not efficient on aarch64, given that there are a lot more Neon registers available. So making the tile size to 8x8 will greatly improve NETranspose latency. For example, on AWS Graviton3 processors, with this change I have observed transposing a 768x768 matrix improves latency from 0.32ms down to 0.19ms. Improvement can also be seen across different matrix sizes. Further enlarging the tile size to 8x16 or 16x16 won't make it perform as good as 8x8 due to register pressure. This change is to mitigate the issue reported at: https://github.com/ARM-software/ComputeLibrary/issues/1045 Signed-off-by: Ethan Doe <yidoe@amazon.com> Change-Id: Ia09859cdf2f6d312e67219a9d95a3a3bf1db1999 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9448 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>

Diffstat (limited to 'tests/validation/CL/MatMulLowpNativeKernel.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: