diff options
author | Ethan Doe <yidoe@amazon.com> | 2023-04-14 17:24:33 +0000 |
---|---|---|
committer | Pablo Marquez Tello <pablo.tello@arm.com> | 2023-04-19 14:26:25 +0000 |
commit | a07c01b6cad1fa37f98a05f08019b40bd4303a92 (patch) | |
tree | e543988368a7526877a434a27f4dccf6562a6e6f /tests/validation/CL | |
parent | 9c7c2d2d23693877867bb3284c577b33cfbff471 (diff) | |
download | ComputeLibrary-a07c01b6cad1fa37f98a05f08019b40bd4303a92.tar.gz |
NETranspose 8x8 kernel for 32-bit elements
The existing 4x4 tiling for 32-bit transpose is not efficient on aarch64, given that there are a lot more Neon registers available. So making the tile size to 8x8 will greatly improve NETranspose latency.
For example, on AWS Graviton3 processors, with this change I have observed transposing a 768x768 matrix improves latency from 0.32ms down to 0.19ms. Improvement can also be seen across different matrix sizes.
Further enlarging the tile size to 8x16 or 16x16 won't make it perform as good as 8x8 due to register pressure.
This change is to mitigate the issue reported at:
https://github.com/ARM-software/ComputeLibrary/issues/1045
Signed-off-by: Ethan Doe <yidoe@amazon.com>
Change-Id: Ia09859cdf2f6d312e67219a9d95a3a3bf1db1999
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/9448
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Diffstat (limited to 'tests/validation/CL')
0 files changed, 0 insertions, 0 deletions