aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSiCong Li <sicong.li@arm.com>2019-11-11 15:12:43 +0000
committerSiCong Li <sicong.li@arm.com>2019-11-12 11:54:27 +0000
commit94e0cf960ea6116eb57fa88d9b951f859b52c602 (patch)
tree72288b3b302a6ace3df71201e3dfb4aca5b5f718
parent5dea19e58a5521b05e95375c8618a37072697bc0 (diff)
downloadComputeLibrary-94e0cf960ea6116eb57fa88d9b951f859b52c602.tar.gz
COMPMID-2690 Extend Doxygen documents to include GEMM Tuner
Signed-off-by: SiCong Li <sicong.li@arm.com> Change-Id: I998210c3c454c091cfe124f1151f0e052c83a0ef Reviewed-on: https://review.mlplatform.org/c/2264 Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
-rw-r--r--examples/gemm_tuner/README.md92
1 files changed, 76 insertions, 16 deletions
diff --git a/examples/gemm_tuner/README.md b/examples/gemm_tuner/README.md
index e8cb7e5193..3238a9dbda 100644
--- a/examples/gemm_tuner/README.md
+++ b/examples/gemm_tuner/README.md
@@ -1,29 +1,89 @@
# Gemm Tuner
+## Introduction
+
+This is a set of 2 script tools for tuning the performance of OpenCL GEMM
+kernels (limited to Convolution layer functions only for now). Specifically, we
+tune 3 GEMM kernels, each has a different implementation strategy of the GEMM
+operation: native, reshaped, reshaped only rhs. The details of these strategies
+can be found in the documentations of the corresponding kernels:
+CLGEMMMatrixMultiplyNativeKernel, CLGEMMMatrixMultiplyReshapedKernel and
+CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
+
+The outputs of the tuning process are 1 optimal configuration (called GEMM
+Configuration or GEMMConfig) for each of the 3 strategies.
+
+## Approach
+
+This section gives a brief description and rationale of the approach adopted by
+the current version of GEMM Tuner.
+
+As explained in the Introduction section, the outputs of the tuner are 1 optimal
+GEMMConfig for each strategy. This is because we can only integrate 1 GEMMConfig
+for each strategy in ACL at compile time. In theory, however, the optimal
+GEMMConfig also depends on different parameters of GEMM (called GEMM Parameter
+or GEMMParam, e.g.: the shape of the operation); thus ideally, for each
+strategy, the optimal configurations should be a mapping from GEMMParam to
+GEMMConfig instead of a single GEMMConfig.
+
+To address this issue, we ensure the one single optimal GEMMConfig can
+generalise well to all potential GEMMParams (or at least the ones that we care
+about). The approach we adopt involves a preliminary stage where a collection of
+common GEMMParams (GEMM shapes from popular networks) are compiled. Then, to
+reduce the final tuning time, rather contradictorily, we spend a lot of time
+searching for near-optimal GEMMConfigs for each GEMMParam first, and then
+discard redundant GEMMParams which share similar optimal GEMMConfigs with
+others. The resultant list of GEMMParams is called a __GEMMParam archetype
+list__, as in these GEMMParams are typical enough to capture the space of
+GEMMParams that we care about.
+
+During this preliminary stage we also produce a list of good GEMMConfigs that
+can be used to search for the optimal one in the actual tuning stage. This,
+again, is to reduce the tuning time, and the resultant list is called a
+__GEMMConfig search list__.
+
+The GEMMParam archetype list and the GEMMConfig search list are investigated and
+prepared by the developers; the users of GEMM tuner need not worry about
+producing them, but they need to obtain them prior to running the tuner.
+
+Once these two lists (2 for each strategy, so 6 in total) are obtained, they can
+be fed to the tuner, to produce the optimal GEMMConfig(s).
+
## Pre-requisite
-(Preferably) bash shell
-Built benchmark examples
-python >= 3.6
+* A target device (Android phones, Linux boards, e.t.c.), on which to tune the
+ GEMM kernels, plus these on the device:
+ * (Preferably) Bash shell
+ * Built ACL with benchmark examples
+ * GEMMParam archetype list
+ * GEMMConfig search list
+* A host machine, plus these on the machine:
+ * python >= 3.6
## Usage
-The tuning consists of 2 steps:
+
+The tuning stage consists of 2 steps:
1. Run benchmarks: Run the runner shell script (benchmark_gemm_examples.sh) on
your target device. Note that all the built benchmark examples have to be
present on your target device prior to running. The script will run the selected
-strategy, over all pre-defined tunable configurations, on a set of gemm shapes
-provided by the user, and then save the benchmark results to json files in an
-output directory.
-
+strategy, over all configs defined in GEMMConfig search list, on all GEMMParams
+inside the GEMMParam archetype list, and then save the benchmark results to json
+files in an output directory.
+```
[$SHELL] ./benchmark_gemm_examples.sh -s \<strategy\> -e \<example_binary_dir\>
- -g \<gemm_shape_file\> -c \<gemm_config_file\> [-o \<out_dir\>]
-
-2. Run analyser: Run the python script (GemmTuner.py) on your host device.
+-g \<gemmparam_archetype_list\> -c \<gemmconfig_search_list\> [-o \<out_dir\>]
+```
+2. Run analyser: Run the python script (GemmTuner.py) on your host machine.
You'll need to transfer all the benchmark result json files generated from the
-previous step to your host machine beforehand. Note that this requires
-python >= 3.6. The script will output the best configuration, along with some
-analysis statistics for each strategy, and optionally save the parsed benchmark
-results into csv files (one for each strategy) for further analysis.
-
+previous step to your host machine beforehand. Note that this requires python >=
+3.6. The script will output the best configuration, along with some analysis
+statistics for each strategy, and optionally save the parsed benchmark results
+into csv files (one for each strategy) for further analysis.
+An optional tolerance in milliseconds in OpenCl timer is provided to determine
+how far apart in performance two GEMMConfigs have to be, to be considered
+different. A default value of 0.01 ms is used, and it's recommended this value
+should be < 0.1 ms.
+```
python GemmTuner.py -b \<benchmark_results_dir\> [-t \<tolerance\>]
[-o \<out_dir\>]
+```