diff options
author | SiCong Li <sicong.li@arm.com> | 2023-10-17 17:38:57 +0100 |
---|---|---|
committer | SiCong Li <sicong.li@arm.com> | 2023-11-08 09:49:56 +0000 |
commit | c5ab4df0c11dc66db47f2070edc719923af3367e (patch) | |
tree | c04bdac32528e628b2a9b9a1c1653e300328fc1b /src/cpu/operators/CpuGemm.h | |
parent | 4a9dbedfbfa66c2612c7461e60cd867b8aea825b (diff) | |
download | ComputeLibrary-c5ab4df0c11dc66db47f2070edc719923af3367e.tar.gz |
Optimize CpuGemmConv2d start-up time
When weight has no holes, we can replace CpuWeightsReshapeKernel with:
- Collapse by reinterpreting weight's 3 spatial dimensions
- Perform CpuTranspose
For more details see the documentation in
src/cpu/operators/CpuGemmConv2d.cpp
This is one optimization since the CpuTranspose is better performing
than CpuWeightsReshapeKernel
A second optimization is to fuse this transpose with other weight
transformations (e.g. pretranspose_B_array in CpuGemmAssemblyDispatch)
However this second optimization depends on how the underlying gemm
methods (the fall back path: CpuGemmMatrixMultiplyKernel or the assembly
path: CpuGemmAssemblyDispatch) chooses to fuse the transpose.
Therefore, this patch moves the transpose down from CpuGemmConv2d, to
the individual gemm operators where the fusion decision needs to be
made, by passing an extra "transpose_b" flag to CpuGemm
New transpose_b flag in different scopes (they are all the same, but
with different names because pretranspose_b has a different meaning in
GemmAssemblyDispatch):
GEMMInfo::pretranspose_B -> AsmGemmInfo::transpose_b
New auxilliary tensors holding the transposed b result:
- CpuGemm optimized path: CpuGemmAssemblyDispatch::PrePretransposedB
- CpuGemm fallback path: CpuGemm::PreTransposedRHS
Note that this patch does not yet have the second optimization
(COMPMID-6595), but it prepares for it.
Relates to COMPMID-6595
Resolves COMPMID-6499
Change-Id: I999a2da9da4b2b15369a3cc06d7872c86e0190ea
Signed-off-by: SiCong Li <sicong.li@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Anitha Raj <Anitha.Raj@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Diffstat (limited to 'src/cpu/operators/CpuGemm.h')
-rw-r--r-- | src/cpu/operators/CpuGemm.h | 21 |
1 files changed, 13 insertions, 8 deletions
diff --git a/src/cpu/operators/CpuGemm.h b/src/cpu/operators/CpuGemm.h index 6b30d134fa..a05258d206 100644 --- a/src/cpu/operators/CpuGemm.h +++ b/src/cpu/operators/CpuGemm.h @@ -21,8 +21,8 @@ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ -#ifndef ARM_COMPUTE_CPU_GEMM_H -#define ARM_COMPUTE_CPU_GEMM_H +#ifndef ACL_SRC_CPU_OPERATORS_CPUGEMM_H +#define ACL_SRC_CPU_OPERATORS_CPUGEMM_H #include "arm_compute/core/ITensorPack.h" #include "arm_compute/core/TensorInfo.h" @@ -36,6 +36,7 @@ #include "src/cpu/kernels/CpuGemmTranspose1xWKernel.h" #include "src/cpu/operators/CpuActivation.h" #include "src/cpu/operators/CpuAdd.h" +#include "src/cpu/operators/CpuTranspose.h" #include "src/cpu/operators/internal/CpuGemmAssemblyDispatch.h" #include <memory> @@ -144,16 +145,17 @@ public: private: enum AuxTensorIdx { - AsmGemmWorkspace = 0, - Pretraspose, - InterleavedLHS, - TransposedRHS, + /* Slots 0 - 2 reserved for CpuGemmAssemblyDispatch */ + InterleavedLHS = 3, + PreTransposedRHS, + Transposed1xWRHS, TempResult, Count }; std::unique_ptr<kernels::CpuGemmInterleave4x4Kernel> _interleave_kernel{nullptr}; - std::unique_ptr<kernels::CpuGemmTranspose1xWKernel> _transpose_kernel{nullptr}; + std::unique_ptr<CpuTranspose> _pretranspose_b_func{nullptr}; + std::unique_ptr<kernels::CpuGemmTranspose1xWKernel> _transpose1xW_b_kernel{nullptr}; std::unique_ptr<kernels::CpuGemmMatrixMultiplyKernel> _mm_kernel{nullptr}; std::unique_ptr<CpuGemmAssemblyDispatch> _asm_glue{nullptr}; std::unique_ptr<kernels::CpuGemmMatrixAdditionKernel> _ma_kernel{nullptr}; @@ -162,10 +164,13 @@ private: std::unique_ptr<CpuActivation> _activation_func{nullptr}; TensorInfo _tmp_a{}; + TensorInfo _pretransposed_b{}; TensorInfo _tmp_b{}; TensorInfo _tmp_d{}; bool _run_vector_matrix_multiplication{false}; + bool _run_interleave_transpose{ + true}; /**< If we run CpuGemmInterleave4x4Kernel on lhs and CpuGemmTranspose1xWKernel on rhs */ bool _run_alpha_scale{false}; bool _run_addition{false}; bool _run_bias_addition{false}; @@ -177,4 +182,4 @@ private: }; } // namespace cpu } // namespace arm_compute -#endif /*ARM_COMPUTE_CPU_GEMM_H */ +#endif // ACL_SRC_CPU_OPERATORS_CPUGEMM_H |