From 94e0cf960ea6116eb57fa88d9b951f859b52c602 Mon Sep 17 00:00:00 2001
From: SiCong Li <sicong.li@arm.com>
Date: Mon, 11 Nov 2019 15:12:43 +0000
Subject: COMPMID-2690 Extend Doxygen documents to include GEMM Tuner

Signed-off-by: SiCong Li <sicong.li@arm.com>
Change-Id: I998210c3c454c091cfe124f1151f0e052c83a0ef
Reviewed-on: https://review.mlplatform.org/c/2264
Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
---
 examples/gemm_tuner/README.md | 92 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 16 deletions(-)

diff --git a/examples/gemm_tuner/README.md b/examples/gemm_tuner/README.md
index e8cb7e5193..3238a9dbda 100644
--- a/examples/gemm_tuner/README.md
+++ b/examples/gemm_tuner/README.md
@@ -1,29 +1,89 @@
 # Gemm Tuner
 
+## Introduction
+
+This is a set of 2 script tools for tuning the performance of OpenCL GEMM
+kernels (limited to Convolution layer functions only for now).  Specifically, we
+tune 3 GEMM kernels, each has a different implementation strategy of the GEMM
+operation: native, reshaped, reshaped only rhs. The details of these strategies
+can be found in the documentations of the corresponding kernels:
+CLGEMMMatrixMultiplyNativeKernel, CLGEMMMatrixMultiplyReshapedKernel and
+CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
+
+The outputs of the tuning process are 1 optimal configuration (called GEMM
+Configuration or GEMMConfig) for each of the 3 strategies.
+
+## Approach
+
+This section gives a brief description and rationale of the approach adopted by
+the current version of GEMM Tuner.
+
+As explained in the Introduction section, the outputs of the tuner are 1 optimal
+GEMMConfig for each strategy. This is because we can only integrate 1 GEMMConfig
+for each strategy in ACL at compile time. In theory, however, the optimal
+GEMMConfig also depends on different parameters of GEMM (called GEMM Parameter
+or GEMMParam, e.g.: the shape of the operation); thus ideally, for each
+strategy, the optimal configurations should be a mapping from GEMMParam to
+GEMMConfig instead of a single GEMMConfig.
+
+To address this issue, we ensure the one single optimal GEMMConfig can
+generalise well to all potential GEMMParams (or at least the ones that we care
+about). The approach we adopt involves a preliminary stage where a collection of
+common GEMMParams (GEMM shapes from popular networks) are compiled. Then, to
+reduce the final tuning time, rather contradictorily, we spend a lot of time
+searching for near-optimal GEMMConfigs for each GEMMParam first, and then
+discard redundant GEMMParams which share similar optimal GEMMConfigs with
+others. The resultant list of GEMMParams is called a __GEMMParam archetype
+list__, as in these GEMMParams are typical enough to capture the space of
+GEMMParams that we care about.
+
+During this preliminary stage we also produce a list of good GEMMConfigs that
+can be used to search for the optimal one in the actual tuning stage. This,
+again, is to reduce the tuning time, and the resultant list is called a
+__GEMMConfig search list__.
+
+The GEMMParam archetype list and the GEMMConfig search list are investigated and
+prepared by the developers; the users of GEMM tuner need not worry about
+producing them, but they need to obtain them prior to running the tuner.
+
+Once these two lists (2 for each strategy, so 6 in total) are obtained, they can
+be fed to the tuner, to produce the optimal GEMMConfig(s).
+
 ## Pre-requisite
-(Preferably) bash shell
-Built benchmark examples
-python >= 3.6
+* A target device (Android phones, Linux boards, e.t.c.), on which to tune the
+  GEMM kernels, plus these on the device:
+    * (Preferably) Bash shell
+    * Built ACL with benchmark examples
+    * GEMMParam archetype list
+    * GEMMConfig search list
+* A host machine, plus these on the machine:
+    * python >= 3.6
 
 ## Usage
-The tuning consists of 2 steps:
+
+The tuning stage consists of 2 steps:
 
 1. Run benchmarks: Run the runner shell script (benchmark_gemm_examples.sh) on
 your target device. Note that all the built benchmark examples have to be
 present on your target device prior to running. The script will run the selected
-strategy, over all pre-defined tunable configurations, on a set of gemm shapes
-provided by the user, and then save the benchmark results to json files in an
-output directory.
-
+strategy, over all configs defined in GEMMConfig search list, on all GEMMParams
+inside the GEMMParam archetype list, and then save the benchmark results to json
+files in an output directory.
+```
 [$SHELL] ./benchmark_gemm_examples.sh -s \<strategy\> -e \<example_binary_dir\>
-    -g \<gemm_shape_file\> -c \<gemm_config_file\> [-o \<out_dir\>]
-
-2. Run analyser: Run the python script (GemmTuner.py) on your host device.
+-g \<gemmparam_archetype_list\> -c \<gemmconfig_search_list\> [-o \<out_dir\>]
+```
+2. Run analyser: Run the python script (GemmTuner.py) on your host machine.
 You'll need to transfer all the benchmark result json files generated from the
-previous step to your host machine beforehand. Note that this requires
-python >= 3.6. The script will output the best configuration, along with some
-analysis statistics for each strategy, and optionally save the parsed benchmark
-results into csv files (one for each strategy) for further analysis.
-
+previous step to your host machine beforehand. Note that this requires python >=
+3.6. The script will output the best configuration, along with some analysis
+statistics for each strategy, and optionally save the parsed benchmark results
+into csv files (one for each strategy) for further analysis.
+An optional tolerance in milliseconds in OpenCl timer is provided to determine
+how far apart in performance two GEMMConfigs have to be, to be considered
+different. A default value of 0.01 ms is used, and it's recommended this value
+should be < 0.1 ms.
+```
 python GemmTuner.py -b \<benchmark_results_dir\> [-t \<tolerance\>]
 [-o \<out_dir\>]
+```
-- 
cgit v1.2.1