From c5ab4df0c11dc66db47f2070edc719923af3367e Mon Sep 17 00:00:00 2001
From: SiCong Li <sicong.li@arm.com>
Date: Tue, 17 Oct 2023 17:38:57 +0100
Subject: Optimize CpuGemmConv2d start-up time

When weight has no holes, we can replace CpuWeightsReshapeKernel with:
 - Collapse by reinterpreting weight's 3 spatial dimensions
 - Perform CpuTranspose

For more details see the documentation in
src/cpu/operators/CpuGemmConv2d.cpp

This is one optimization since the CpuTranspose is better performing
than CpuWeightsReshapeKernel

A second optimization is to fuse this transpose with other weight
transformations (e.g. pretranspose_B_array in CpuGemmAssemblyDispatch)

However this second optimization depends on how the underlying gemm
methods (the fall back path: CpuGemmMatrixMultiplyKernel or the assembly
path: CpuGemmAssemblyDispatch) chooses to fuse the transpose.

Therefore, this patch moves the transpose down from CpuGemmConv2d, to
the individual gemm operators where the fusion decision needs to be
made, by passing an extra "transpose_b" flag to CpuGemm

New transpose_b flag in different scopes (they are all the same, but
with different names because pretranspose_b has a different meaning in
GemmAssemblyDispatch):
GEMMInfo::pretranspose_B -> AsmGemmInfo::transpose_b

New auxilliary tensors holding the transposed b result:
- CpuGemm optimized path: CpuGemmAssemblyDispatch::PrePretransposedB
- CpuGemm fallback path:  CpuGemm::PreTransposedRHS

Note that this patch does not yet have the second optimization
(COMPMID-6595), but it prepares for it.

Relates to COMPMID-6595
Resolves COMPMID-6499

Change-Id: I999a2da9da4b2b15369a3cc06d7872c86e0190ea
Signed-off-by: SiCong Li <sicong.li@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Anitha Raj <Anitha.Raj@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
---
 docs/user_guide/release_version_and_change_log.dox | 1 +
 1 file changed, 1 insertion(+)

(limited to 'docs/user_guide')

diff --git a/docs/user_guide/release_version_and_change_log.dox b/docs/user_guide/release_version_and_change_log.dox
index c07cf88d80..b6627a9701 100644
--- a/docs/user_guide/release_version_and_change_log.dox
+++ b/docs/user_guide/release_version_and_change_log.dox
@@ -59,6 +59,7 @@ v23.11 Public major release
    - Optimize @ref NEStackLayer
    - Optimize @ref CLReductionOperation.
    - Optimize @ref CLSoftmaxLayer.
+   - Optimize start-up time of @ref NEConvolutionLayer for some input configurations where GeMM is selected as the convolution algorithm
  - Add new OpenCL™ kernels:
    - @ref opencl::kernels::ClMatMulLowpNativeMMULKernel support for QASYMM8 and QASYMM8_SIGNED, with batch support
  - Deprecate support for Bfloat16 in @ref cpu::CpuCast.
-- 
cgit v1.2.1