aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSiCong Li <sicong.li@arm.com>2020-08-11 13:00:20 +0100
committerSiCong Li <sicong.li@arm.com>2020-08-12 13:14:29 +0000
commitb10181b9b476a0b41e270472e97eb0b8e5e197d5 (patch)
tree204b49a2587de5ceaeda965024d5f14c04820e4b
parentdc12519582d06da9fac9c53300a5ab83a5b26632 (diff)
downloadComputeLibrary-b10181b9b476a0b41e270472e97eb0b8e5e197d5.tar.gz
COMPMID-3456 Update gemm tuner documentation
* Update README with the improvements * Add a new step-by-step example section Change-Id: I4d76821fb6c2f3b5edd54edfeff053e1c92fbb6e Signed-off-by: SiCong Li <sicong.li@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/3713 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Sheri Zhang <sheri.zhang@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
-rw-r--r--docs/00_introduction.dox5
-rw-r--r--examples/gemm_tuner/README.md213
-rwxr-xr-xexamples/gemm_tuner/benchmark_gemm_examples.sh32
3 files changed, 142 insertions, 108 deletions
diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
index 906ddf27bf..90064399e7 100644
--- a/docs/00_introduction.dox
+++ b/docs/00_introduction.dox
@@ -257,6 +257,11 @@ v20.08 Public major release
- graph_yolov3_output_detector
- Removed padding from:
- @ref NEPixelWiseMultiplicationKernel
+ - GEMMTuner improvements:
+ - Added fp16 support
+ - Output json files for easier integration
+ - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
+ - More robust script for running benchmarks
- Deprecated functions / interfaces:
- Non-descriptor based interfaces for @ref NEThreshold, @ref CLThreshold
- In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer :
diff --git a/examples/gemm_tuner/README.md b/examples/gemm_tuner/README.md
index a4cde10403..1effd2f7e1 100644
--- a/examples/gemm_tuner/README.md
+++ b/examples/gemm_tuner/README.md
@@ -2,19 +2,77 @@
## Introduction
-This is a set of 2 script tools for tuning the performance of OpenCL GEMM kernels (limited to Convolution layer
-functions only for now). Specifically, we tune 3 GEMM kernels, each has a different implementation **strategy** of the
-GEMM operation: **native**, **reshaped**, **reshaped only rhs**. The details of these strategies can be found in the
-documentations of the corresponding kernels: **CLGEMMMatrixMultiplyNativeKernel**,
-**CLGEMMMatrixMultiplyReshapedKernel** and **CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
-
-The outputs of the tuning process are 1 optimal configuration (called **GEMM Configuration** or **GEMMConfig**, for
-more details see Approach section) for each of the 3 strategies.
+This is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each
+has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**.
+The details of these strategies can be found in the documentations of the corresponding kernels:
+**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and
+**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
+
+The Tuner consists of 2 scripts and 3 binaries:
+* benchmark_gemm_examples.sh and GemmTuner.py under examples/gemm_tuner, and
+* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under
+ build/tests/gemm_tuner (you'll need to build the library first)
+
+The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly
+data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation:
+```
+LHS x RHS = DST
+```
+Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size.
+
+The outputs of the tuning process are 4 json files:
+1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam
+2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam
+3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam
+4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam
+
+These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam,
+what kernel and subsequently what configurations for that kernels are the most performant.
+
+## Step-by-step example
+
+### Step1: Prepare the shape and configs files
+1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*.
+2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires
+ some prior heuristics, but can be provided by the ACL developers upon requests, based on your target device).
+
+ Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv".
+
+ Please refer to the Prerequisite section for more details
+
+### Step2: Push relevant files to the target device
+All the files that need to be present on the target device are:
+* benchmark script: \<ACL\>/examples/gemm_tuner/benchmark_gemm_examples.sh
+* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv
+* Example benchmark binaries: \<ACL\>/build/tests/gemm_tuner/benchmark_cl_gemm*
+
+### Step3: Collect benchmark data
+With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed
+to a folder called *gemm_tuner*. While logged onto our device:
+```
+# Native
+./benchmark_gemm_examples.sh -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native
+# Reshaped Only RHS
+./benchmark_gemm_examples.sh -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs
+# Reshaped
+./benchmark_gemm_examples.sh -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped
+```
+You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy),
+but you may need to change the output folder for each repeat
+
+### Step4: Generate the heuristics
+1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine
+2. We use the GemmTuner.py script to give us the heuristics
+ ```
+ python3 <ACL>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics
+ ```
+ When it's finished, there should be 4 json files in the *heuristics* folder
-## Location
-The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found under $ACL_ROOT/examples/gemm_tuner.
+One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because
+we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by
+passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script.
-## Pre-requisite
+## Prerequisite
* A target device to be tuned, plus the following on the device:
* Android or Linux OS
* Bash shell
@@ -28,10 +86,7 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
The format is described as:
- A headerless csv file with fields separated by commas and commas only (there cannot be whitespaces around each
- field).
-
- Note also comments and extraneous empty lines are not permitted.
+ A headerless csv file with fields separated by commas.
A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and
RHS) with:
@@ -54,10 +109,10 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
The format of the file for each strategy is the same:
- A headerless csv file with fields separated by commas and commas only (there cannot be whitespaces around each
- field). Note also comments and extraneous empty lines are not permitted.
+ A headerless csv file with fields separated by commas.
However the fields of GEMMConfig differ for each strategy:
+
* Strategy **native**:
A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with:
@@ -78,9 +133,7 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
...
```
* Strategy **reshaped_rhs_only**:
-
- A gemm config is a list of 4 positive integers \<m0, n0, k0, h0\> and 2 boolean values interleave_rhs and
- transpose_rhs, with:
+ A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
m0 - Number of rows processed by the matrix multiplication
n0 - Number of columns processed by the matrix multiplication
@@ -88,6 +141,9 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
+ export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+ with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+ for more details
Only the following configurations of M0, N0 and K0 are currently supported:
@@ -98,14 +154,12 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
An example gemm config file looks like:
```
- 4,4,4,1,1,1
- 4,4,4,3,1,0
+ 4,4,4,1,1,1,0
+ 4,4,4,3,1,0,1
...
```
* Strategy **reshaped**:
-
- A gemm config is a list of 5 positive integers \<m0, n0, k0, v0, h0\> and 3 boolean values interleave_lhs,
- interleave_rhs and transpose_rhs, with:
+ A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
m0 - Number of rows processed by the matrix multiplication
n0 - Number of columns processed by the matrix multiplication
@@ -114,29 +168,31 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
- transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose
- lhs matrix (0)
+ transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
+ export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+ with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+ for more details
- * If rhs matrix is transposed only the following configurations are currently supported:
+ If rhs matrix is transposed only the following configurations are currently supported:
- M0 = 2, 3, 4, 5, 6, 7, 8
- N0 = 2, 3, 4, 8, 16
- K0 = 2, 3, 4, 8, 16
- V0 >= 1
- H0 >= 1
+ M0 = 2, 3, 4, 5, 6, 7, 8
+ N0 = 2, 3, 4, 8, 16
+ K0 = 2, 3, 4, 8, 16
+ V0 >= 1
+ H0 >= 1
- * If lhs matrix is transposed only the following configurations are currently supported:
+ If lhs matrix is transposed only the following configurations are currently supported:
- M0 = 2, 3, 4, 8
- N0 = 2, 3, 4, 8, 16
- K0 = 2, 3, 4, 8, 16
- V0 >= 1
- H0 >= 1
+ M0 = 2, 3, 4, 8
+ N0 = 2, 3, 4, 8, 16
+ K0 = 2, 3, 4, 8, 16
+ V0 >= 1
+ H0 >= 1
An example gemm config file looks like:
```
- 4,4,4,1,3,1,1,1
- 4,4,4,3,3,1,1,0
+ 4,4,4,1,3,1,1,1,0
+ 4,4,4,3,3,1,1,0,1
...
```
* A host machine, plus these on the machine:
@@ -144,45 +200,53 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u
* GemmTuner.py script
## Usage
-The tuning stage consists of 2 steps:
+The usage of the 2 scripts:
-1. Run benchmarks:
+1. benchmark_gemm_examples.sh
Run the shell script (**benchmark_gemm_examples.sh**) on your **target device**. Note that all the built benchmark
- examples have to be present on your target device prior to running. The benchmark results will be saved to json
- files in an output directory.
+ examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running.
+ The benchmark results will be saved to json files in an output directory.
```
Usage: benchmark_gemm_examples.sh [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\>
- -c \<gemm_config_file\> [-o \<out_dir\>]
+ -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>]
Options:
-h
- Print help messages. If a strategy is specified with -s \<strategy\>, then only display messages relevant
- to that strategy. Otherwise if no strategy is specified, display messages for all available strategies.
+ Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that
+ strategy. Otherwise if no strategy is specified, display messages for all available strategies.
- -s \<strategy\>
+ -s <strategy>
Strategy option.
- Options: native reshaped_rhs_only reshaped.
+ Options: ${ALL_STRATEGY_OPTIONS[@]}.
- -e \<example_binary_dir\>
+ -e <example_binary_dir>
Path to directory that holds all example binaries
- -g \<gemm_shape_file\>
+ -g <gemm_shape_file>
Path to gemm shape csv file
- -c \<gemm_config_file\>
+ -c <gemm_config_file>
Path to gemm config csv file
- -o \<out_dir\>
+ -d <data_type>
+ Data type option with which to run benchmark examples
+ Default: ${DEFAULT_DATA_TYPE}
+ Supported options:
+ Strategy : Data Types
+ Native : F32
+ Reshaped : F16, F32
+ Reshaped RHS Only : F16, F32
+
+ -o <out_dir>
Path to output directory that holds output json files
- Default: out
+ Default: ${DEFAULT_OUT_DIR}
```
-2. Run analyser:
+2. GemmTuner.py:
Run the python script (**GemmTuner.py**) on your **host machine**.
You'll need to transfer all the benchmark result json files generated from the previous step to your host machine
- beforehand. The script will output the best configuration, along with some analysis statistics for each strategy, and
- optionally save the parsed benchmark results into csv files (one for each strategy) for further analysis.
+ beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files
```
Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D]
@@ -194,40 +258,11 @@ The tuning stage consists of 2 steps:
result json files have a file extension of
'gemmtuner_benchmark'
-o PATH, --output_dir PATH
- Path to directory that holds output csv files. One per
- strategy
+ Path to directory that holds output json files.
-t TOLERANCE, --tolerance TOLERANCE
For testing if two GEMMConfigs are equivalent in terms
of performance. The tolerance is OpenCL timer in
milliseconds. Recommended value: <= 0.1 ms
-D, --debug Enable script debugging output
- ```
-
-## Approach
-
-This section gives a brief description and rationale of the approach adopted by the current version of GEMM Tuner.
-
-As explained in the Introduction section, the outputs of the tuner are 1 optimal GEMMConfig for each strategy.
-This is because we can only integrate 1 GEMMConfig for each strategy in ACL at compile time. In theory, however, the
-optimal GEMMConfig also depends on different parameters of GEMM (called GEMM Parameter or GEMMParam, e.g.: the shape
-of the operation); thus ideally, for each strategy, the optimal configurations should be a mapping from GEMMParam to
-GEMMConfig instead of a single GEMMConfig.
-
-To address this issue, we ensure the one single optimal GEMMConfig can generalise well to all potential GEMMParams
-(or at least the ones that we care about). The approach we adopt involves a preliminary stage where a collection of
-common GEMMParams (GEMM shapes from popular networks) are compiled. Then, to reduce the final tuning time, rather
-contradictorily, we spend a lot of time searching for near-optimal GEMMConfigs for each GEMMParam first, and then
-discard redundant GEMMParams which share similar optimal GEMMConfigs with others. The resultant list of GEMMParams is
-called a __GEMMParam search list__, as in these GEMMParams are typical enough to capture the space of GEMMParams that
-we care about.
-
-During this preliminary stage we also produce a list of good GEMMConfigs that can be used to search for the optimal one
-in the actual tuning stage. This, again, is to reduce the tuning time, and the resultant list is called a
-__GEMMConfig search list__.
-
-The GEMMParam search list and the GEMMConfig search list are investigated and prepared by the developers; the users of
-GEMM tuner need not worry about producing them, but they need to obtain them prior to running the tuner.
-
-Once these two lists (2 for each strategy, so 6 in total) are obtained, they can be fed to the tuner, to produce the
-optimal GEMMConfig(s). \ No newline at end of file
+ ``` \ No newline at end of file
diff --git a/examples/gemm_tuner/benchmark_gemm_examples.sh b/examples/gemm_tuner/benchmark_gemm_examples.sh
index bb9ec0f3ab..f764cfaef6 100755
--- a/examples/gemm_tuner/benchmark_gemm_examples.sh
+++ b/examples/gemm_tuner/benchmark_gemm_examples.sh
@@ -59,10 +59,7 @@ NUM_ITERATION=5
function help_gemm_shape_file() {
cat >&2 << EOF
Gemm shape file:
- Gemm shape file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
- around each field).
-
- Note also comments and extraneous empty lines are not permitted.
+ Gemm shape file is a headerless csv file with fields separated by commas
A gemm shape is a list of 4 positive integers <M, N, K, B> describing the shapes of the two matrices (LHS and RHS)
with:
@@ -91,10 +88,7 @@ EOF
function help_gemm_config_file_native() {
cat >&2 << EOF
Gemm config file (Strategy native):
- Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
- around each field).
-
- Note also comments and extraneous empty lines are not permitted.
+ Gemm config file is a headerless csv file with fields separated by commas
A gemm config is a list of 3 positive integers <m0, n0, k0>, with:
m0 - Number of rows processed by the matrix multiplication
@@ -126,19 +120,20 @@ EOF
function help_gemm_config_file_reshaped_rhs_only() {
cat >&2 << EOF
Gemm config file (Strategy reshaped_rhs_only):
- Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
- around each field).
+ Gemm config file is a headerless csv file with fields separated by commas.
Note also comments and extraneous empty lines are not permitted.
- A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 2 boolean values interleave_rhs and transpose_rhs, with:
+ A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
m0 - Number of rows processed by the matrix multiplication
n0 - Number of columns processed by the matrix multiplication
k0 - Number of partial accumulations performed by the matrix multiplication
h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
- export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0)
+ export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+ with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+ for more details
Only the following configurations of M0, N0 and K0 are currently supported:
M0 = 1, 2, 3, 4, 5, 6, 7, 8
@@ -166,12 +161,9 @@ EOF
function help_gemm_config_file_reshaped() {
cat >&2 << EOF
Gemm config file (Strategy reshaped):
- Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces
- around each field).
-
- Note also comments and extraneous empty lines are not permitted.
+ Gemm config file is a headerless csv file with fields separated by commas
- A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 3 boolean values interleave_lhs, interleave_rhs and transpose_rhs, with:
+ A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
m0 - Number of rows processed by the matrix multiplication
n0 - Number of columns processed by the matrix multiplication
k0 - Number of partial accumulations performed by the matrix multiplication
@@ -180,7 +172,9 @@ Gemm config file (Strategy reshaped):
interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
- export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0)
+ export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
+ with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
+ for more details
If rhs matrix is transposed only the following configurations are currently supported:
M0 = 2, 3, 4, 5, 6, 7, 8
@@ -218,7 +212,7 @@ function usage() {
Run gemm examples of a selected strategy, over provided tunable configurationsa and gemm shapes.
Save the benchmark results to json files in an output directory.
-Usage: ${CMD} [-h] -s <strategy> -e <example_binary_dir> -g <gemm_shape_file> -c <gemm_config_file> [-o <out_dir>]
+Usage: ${CMD} [-h] -s <strategy> -e <example_binary_dir> -g <gemm_shape_file> -c <gemm_config_file> [-d <data_type>] [-o <out_dir>]
Options:
-h