aboutsummaryrefslogtreecommitdiff
path: root/docs/user_guide/conv2d_heuristic.dox
blob: edd24a3d36012422949c04fcfbaafda70f03dab3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
///
/// Copyright (c) 2023 Arm Limited.
///
/// SPDX-License-Identifier: MIT
///
/// Permission is hereby granted, free of charge, to any person obtaining a copy
/// of this software and associated documentation files (the "Software"), to
/// deal in the Software without restriction, including without limitation the
/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
/// sell copies of the Software, and to permit persons to whom the Software is
/// furnished to do so, subject to the following conditions:
///
/// The above copyright notice and this permission notice shall be included in all
/// copies or substantial portions of the Software.
///
/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
/// SOFTWARE.
///

namespace arm_compute
{
/**
@page conv2d_heuristic Convolution 2D heuristic

@section conv2d_heuristic_algorithms_used Convolution 2D heuristic: algorithm selection

The convolution 2D (in short, conv2D) is certainly one of the most compute intensive and performance critical operators in ML workloads.
This operator can be implemented with different algorithms, which differ in terms of accuracy, kernel size support, and additional memory required.
Unfortunately, it does not exist a single algorithm that can be used in all scenarios to achieve the best performance.
Therefore, the Arm Compute Library integrates an heuristic within the conv2d operators to select the most efficient algorithm, depending on input and kernel shapes and desired level of accuracy.
The heuristic depends on the target backend (either NEON™ for Arm® CPUs or OpenCL for Arm® GPUs) and the following subsections will provide the main details behind the selection of the algorithm.

⚠ Attention: The heuristics presented in the following subsections will only refer to the NHWC data layout, which is the optimal and recommended layout for the Arm Compute Library.

@subsection conv2d_heuristic_on_cpu Convolution 2D heuristic: Arm® Cortex®-based CPUs

The conv2d heuristic for Arm® Cortex®-based CPUs is inside the get_convolution_method() method in the CpuConv2d function.
The algorithms used in the get_convolution_method() function are the following:
- Direct-Conv2D
- Im2Col+GeMM-based
- Indirect-GeMM (a.k.a. GEMMCONV2D)
- GeMM
- Winograd

⚠ Attention: Winograd only works with floating-point data types (F32, F16)

The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
-# Non unit dilation: We call Im2Col+GeMM
-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
-# Small Input-Feature-Maps (IFM): In this scenario, we have found that the GeMM implementation is generally the most efficient algorithm compared to Winograd and Indirect-GeMM

If we have a most frequent case, such as unit dilations, of larger IFM, we evaluate the following conditions instead:
-# Unit kernel size (1x1): In this scenario, the conv2d operations corresponds to a matrix multiplication and we call GeMM.
-# Winograd. Winograd only works with unit strides and supports a limited number of kernel sizes, such as 3x3, 3x1, 1x3, 5x1, 1x5 and 5x5
-# Indirect-GeMM: It should be used in all cases expect when the kernel size is 1x1 or when the IFM is small

If the preceding cases are not met, we will fall-back to the Im2Col+GeMM-based algorithm.

@subsection conv2d_heuristic_on_gpu Convolution 2D heuristic: Arm® Mali™-based GPUs

The conv2d heuristic for Arm® Mali™-based GPUs is inside the get_convolution_method() method in the ClConv2d function.

The algorithms used in the get_convolution_method() function are the following:
- Direct-Conv2D
- Im2Col+GeMM-based
- Indirect-GeMM
- GeMM
- Winograd

⚠ Attention: Winograd only works with floating-point data types (F32, F16)

The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
-# Non unit dilation: We call Im2Col+GeMM
-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory

In all the other cases, the GPU heuristic evaluates the suitability of Winograd and Direct-Conv2D/Indirect-Conv2D.
In particular, Winograd is adopted when the convolution parameters (kernel size and strides) are supported by the algorithm and when the IFM is not small (for example, greater than 8).
The conditions for using the Direct-Conv2D algorithms are several and we recommend you look at the heuristic directly.
In general, the Direct-Conv2D operators is used in almost all cases where kernel size is not 1x1.
The Indirect-GeMM algorithm is used in alternative to Direct-Conv2D only for Arm® Mali™-G77 GPU.
If neither Winograd nor Direct-Conv2D can be used, we will fall-back to either GeMM (when the kernel size is 1x1) or the Im2Col+GeMM-based algorithm.

*/
} // namespace