From b10181b9b476a0b41e270472e97eb0b8e5e197d5 Mon Sep 17 00:00:00 2001 From: SiCong Li Date: Tue, 11 Aug 2020 13:00:20 +0100 Subject: COMPMID-3456 Update gemm tuner documentation * Update README with the improvements * Add a new step-by-step example section Change-Id: I4d76821fb6c2f3b5edd54edfeff053e1c92fbb6e Signed-off-by: SiCong Li Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/3713 Comments-Addressed: Arm Jenkins Reviewed-by: Sheri Zhang Tested-by: Arm Jenkins --- docs/00_introduction.dox | 5 + examples/gemm_tuner/README.md | 213 ++++++++++++++----------- examples/gemm_tuner/benchmark_gemm_examples.sh | 32 ++-- 3 files changed, 142 insertions(+), 108 deletions(-) diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox index 906ddf27bf..90064399e7 100644 --- a/docs/00_introduction.dox +++ b/docs/00_introduction.dox @@ -257,6 +257,11 @@ v20.08 Public major release - graph_yolov3_output_detector - Removed padding from: - @ref NEPixelWiseMultiplicationKernel + - GEMMTuner improvements: + - Added fp16 support + - Output json files for easier integration + - Enabled tuning for export_to_cl_image_rhs option for RHS tensors + - More robust script for running benchmarks - Deprecated functions / interfaces: - Non-descriptor based interfaces for @ref NEThreshold, @ref CLThreshold - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer : diff --git a/examples/gemm_tuner/README.md b/examples/gemm_tuner/README.md index a4cde10403..1effd2f7e1 100644 --- a/examples/gemm_tuner/README.md +++ b/examples/gemm_tuner/README.md @@ -2,19 +2,77 @@ ## Introduction -This is a set of 2 script tools for tuning the performance of OpenCL GEMM kernels (limited to Convolution layer -functions only for now). Specifically, we tune 3 GEMM kernels, each has a different implementation **strategy** of the -GEMM operation: **native**, **reshaped**, **reshaped only rhs**. The details of these strategies can be found in the -documentations of the corresponding kernels: **CLGEMMMatrixMultiplyNativeKernel**, -**CLGEMMMatrixMultiplyReshapedKernel** and **CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**. - -The outputs of the tuning process are 1 optimal configuration (called **GEMM Configuration** or **GEMMConfig**, for -more details see Approach section) for each of the 3 strategies. +This is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each +has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**. +The details of these strategies can be found in the documentations of the corresponding kernels: +**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and +**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**. + +The Tuner consists of 2 scripts and 3 binaries: +* benchmark_gemm_examples.sh and GemmTuner.py under examples/gemm_tuner, and +* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under + build/tests/gemm_tuner (you'll need to build the library first) + +The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly +data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation: +``` +LHS x RHS = DST +``` +Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size. + +The outputs of the tuning process are 4 json files: +1. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam +2. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam +3. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam +4. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam + +These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam, +what kernel and subsequently what configurations for that kernels are the most performant. + +## Step-by-step example + +### Step1: Prepare the shape and configs files +1. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*. +2. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires + some prior heuristics, but can be provided by the ACL developers upon requests, based on your target device). + + Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv". + + Please refer to the Prerequisite section for more details + +### Step2: Push relevant files to the target device +All the files that need to be present on the target device are: +* benchmark script: \/examples/gemm_tuner/benchmark_gemm_examples.sh +* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv +* Example benchmark binaries: \/build/tests/gemm_tuner/benchmark_cl_gemm* + +### Step3: Collect benchmark data +With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed +to a folder called *gemm_tuner*. While logged onto our device: +``` +# Native +./benchmark_gemm_examples.sh -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native +# Reshaped Only RHS +./benchmark_gemm_examples.sh -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs +# Reshaped +./benchmark_gemm_examples.sh -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped +``` +You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy), +but you may need to change the output folder for each repeat + +### Step4: Generate the heuristics +1. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine +2. We use the GemmTuner.py script to give us the heuristics + ``` + python3 /examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics + ``` + When it's finished, there should be 4 json files in the *heuristics* folder -## Location -The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found under $ACL_ROOT/examples/gemm_tuner. +One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because +we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by +passing a lower value to *-t \* to the GemmTuner.py script. -## Pre-requisite +## Prerequisite * A target device to be tuned, plus the following on the device: * Android or Linux OS * Bash shell @@ -28,10 +86,7 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u The format is described as: - A headerless csv file with fields separated by commas and commas only (there cannot be whitespaces around each - field). - - Note also comments and extraneous empty lines are not permitted. + A headerless csv file with fields separated by commas. A gemm shape is a list of 4 positive integers \ describing the shapes of the two matrices (LHS and RHS) with: @@ -54,10 +109,10 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u The format of the file for each strategy is the same: - A headerless csv file with fields separated by commas and commas only (there cannot be whitespaces around each - field). Note also comments and extraneous empty lines are not permitted. + A headerless csv file with fields separated by commas. However the fields of GEMMConfig differ for each strategy: + * Strategy **native**: A gemm config is a list of 3 positive integers \, with: @@ -78,9 +133,7 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u ... ``` * Strategy **reshaped_rhs_only**: - - A gemm config is a list of 4 positive integers \ and 2 boolean values interleave_rhs and - transpose_rhs, with: + A gemm config is a list of 4 positive integers and 3 boolean values: m0 - Number of rows processed by the matrix multiplication n0 - Number of columns processed by the matrix multiplication @@ -88,6 +141,9 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0) + export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true + with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel + for more details Only the following configurations of M0, N0 and K0 are currently supported: @@ -98,14 +154,12 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u An example gemm config file looks like: ``` - 4,4,4,1,1,1 - 4,4,4,3,1,0 + 4,4,4,1,1,1,0 + 4,4,4,3,1,0,1 ... ``` * Strategy **reshaped**: - - A gemm config is a list of 5 positive integers \ and 3 boolean values interleave_lhs, - interleave_rhs and transpose_rhs, with: + A gemm config is a list of 5 positive integers and 4 boolean values: m0 - Number of rows processed by the matrix multiplication n0 - Number of columns processed by the matrix multiplication @@ -114,29 +168,31 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0) interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) - transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose - lhs matrix (0) + transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0) + export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true + with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel + for more details - * If rhs matrix is transposed only the following configurations are currently supported: + If rhs matrix is transposed only the following configurations are currently supported: - M0 = 2, 3, 4, 5, 6, 7, 8 - N0 = 2, 3, 4, 8, 16 - K0 = 2, 3, 4, 8, 16 - V0 >= 1 - H0 >= 1 + M0 = 2, 3, 4, 5, 6, 7, 8 + N0 = 2, 3, 4, 8, 16 + K0 = 2, 3, 4, 8, 16 + V0 >= 1 + H0 >= 1 - * If lhs matrix is transposed only the following configurations are currently supported: + If lhs matrix is transposed only the following configurations are currently supported: - M0 = 2, 3, 4, 8 - N0 = 2, 3, 4, 8, 16 - K0 = 2, 3, 4, 8, 16 - V0 >= 1 - H0 >= 1 + M0 = 2, 3, 4, 8 + N0 = 2, 3, 4, 8, 16 + K0 = 2, 3, 4, 8, 16 + V0 >= 1 + H0 >= 1 An example gemm config file looks like: ``` - 4,4,4,1,3,1,1,1 - 4,4,4,3,3,1,1,0 + 4,4,4,1,3,1,1,1,0 + 4,4,4,3,3,1,1,0,1 ... ``` * A host machine, plus these on the machine: @@ -144,45 +200,53 @@ The 2 scripts **benchmark_gemm_examples.sh** and **GemmTuner.py** can be found u * GemmTuner.py script ## Usage -The tuning stage consists of 2 steps: +The usage of the 2 scripts: -1. Run benchmarks: +1. benchmark_gemm_examples.sh Run the shell script (**benchmark_gemm_examples.sh**) on your **target device**. Note that all the built benchmark - examples have to be present on your target device prior to running. The benchmark results will be saved to json - files in an output directory. + examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running. + The benchmark results will be saved to json files in an output directory. ``` Usage: benchmark_gemm_examples.sh [-h] -s \ -e \ -g \ - -c \ [-o \] + -c \ [-d \] [-o \] Options: -h - Print help messages. If a strategy is specified with -s \, then only display messages relevant - to that strategy. Otherwise if no strategy is specified, display messages for all available strategies. + Print help messages. If a strategy is specified with -s , then only display messages relevant to that + strategy. Otherwise if no strategy is specified, display messages for all available strategies. - -s \ + -s Strategy option. - Options: native reshaped_rhs_only reshaped. + Options: ${ALL_STRATEGY_OPTIONS[@]}. - -e \ + -e Path to directory that holds all example binaries - -g \ + -g Path to gemm shape csv file - -c \ + -c Path to gemm config csv file - -o \ + -d + Data type option with which to run benchmark examples + Default: ${DEFAULT_DATA_TYPE} + Supported options: + Strategy : Data Types + Native : F32 + Reshaped : F16, F32 + Reshaped RHS Only : F16, F32 + + -o Path to output directory that holds output json files - Default: out + Default: ${DEFAULT_OUT_DIR} ``` -2. Run analyser: +2. GemmTuner.py: Run the python script (**GemmTuner.py**) on your **host machine**. You'll need to transfer all the benchmark result json files generated from the previous step to your host machine - beforehand. The script will output the best configuration, along with some analysis statistics for each strategy, and - optionally save the parsed benchmark results into csv files (one for each strategy) for further analysis. + beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files ``` Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D] @@ -194,40 +258,11 @@ The tuning stage consists of 2 steps: result json files have a file extension of 'gemmtuner_benchmark' -o PATH, --output_dir PATH - Path to directory that holds output csv files. One per - strategy + Path to directory that holds output json files. -t TOLERANCE, --tolerance TOLERANCE For testing if two GEMMConfigs are equivalent in terms of performance. The tolerance is OpenCL timer in milliseconds. Recommended value: <= 0.1 ms -D, --debug Enable script debugging output - ``` - -## Approach - -This section gives a brief description and rationale of the approach adopted by the current version of GEMM Tuner. - -As explained in the Introduction section, the outputs of the tuner are 1 optimal GEMMConfig for each strategy. -This is because we can only integrate 1 GEMMConfig for each strategy in ACL at compile time. In theory, however, the -optimal GEMMConfig also depends on different parameters of GEMM (called GEMM Parameter or GEMMParam, e.g.: the shape -of the operation); thus ideally, for each strategy, the optimal configurations should be a mapping from GEMMParam to -GEMMConfig instead of a single GEMMConfig. - -To address this issue, we ensure the one single optimal GEMMConfig can generalise well to all potential GEMMParams -(or at least the ones that we care about). The approach we adopt involves a preliminary stage where a collection of -common GEMMParams (GEMM shapes from popular networks) are compiled. Then, to reduce the final tuning time, rather -contradictorily, we spend a lot of time searching for near-optimal GEMMConfigs for each GEMMParam first, and then -discard redundant GEMMParams which share similar optimal GEMMConfigs with others. The resultant list of GEMMParams is -called a __GEMMParam search list__, as in these GEMMParams are typical enough to capture the space of GEMMParams that -we care about. - -During this preliminary stage we also produce a list of good GEMMConfigs that can be used to search for the optimal one -in the actual tuning stage. This, again, is to reduce the tuning time, and the resultant list is called a -__GEMMConfig search list__. - -The GEMMParam search list and the GEMMConfig search list are investigated and prepared by the developers; the users of -GEMM tuner need not worry about producing them, but they need to obtain them prior to running the tuner. - -Once these two lists (2 for each strategy, so 6 in total) are obtained, they can be fed to the tuner, to produce the -optimal GEMMConfig(s). \ No newline at end of file + ``` \ No newline at end of file diff --git a/examples/gemm_tuner/benchmark_gemm_examples.sh b/examples/gemm_tuner/benchmark_gemm_examples.sh index bb9ec0f3ab..f764cfaef6 100755 --- a/examples/gemm_tuner/benchmark_gemm_examples.sh +++ b/examples/gemm_tuner/benchmark_gemm_examples.sh @@ -59,10 +59,7 @@ NUM_ITERATION=5 function help_gemm_shape_file() { cat >&2 << EOF Gemm shape file: - Gemm shape file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces - around each field). - - Note also comments and extraneous empty lines are not permitted. + Gemm shape file is a headerless csv file with fields separated by commas A gemm shape is a list of 4 positive integers describing the shapes of the two matrices (LHS and RHS) with: @@ -91,10 +88,7 @@ EOF function help_gemm_config_file_native() { cat >&2 << EOF Gemm config file (Strategy native): - Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces - around each field). - - Note also comments and extraneous empty lines are not permitted. + Gemm config file is a headerless csv file with fields separated by commas A gemm config is a list of 3 positive integers , with: m0 - Number of rows processed by the matrix multiplication @@ -126,19 +120,20 @@ EOF function help_gemm_config_file_reshaped_rhs_only() { cat >&2 << EOF Gemm config file (Strategy reshaped_rhs_only): - Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces - around each field). + Gemm config file is a headerless csv file with fields separated by commas. Note also comments and extraneous empty lines are not permitted. - A gemm config is a list of 4 positive integers and 2 boolean values interleave_rhs and transpose_rhs, with: + A gemm config is a list of 4 positive integers and 3 boolean values: m0 - Number of rows processed by the matrix multiplication n0 - Number of columns processed by the matrix multiplication k0 - Number of partial accumulations performed by the matrix multiplication h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0) - export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0) + export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true + with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel + for more details Only the following configurations of M0, N0 and K0 are currently supported: M0 = 1, 2, 3, 4, 5, 6, 7, 8 @@ -166,12 +161,9 @@ EOF function help_gemm_config_file_reshaped() { cat >&2 << EOF Gemm config file (Strategy reshaped): - Gemm config file is a headerless csv file with fields separated by commas and commas only (there cannot be whitespaces - around each field). - - Note also comments and extraneous empty lines are not permitted. + Gemm config file is a headerless csv file with fields separated by commas - A gemm config is a list of 5 positive integers and 3 boolean values interleave_lhs, interleave_rhs and transpose_rhs, with: + A gemm config is a list of 5 positive integers and 4 boolean values: m0 - Number of rows processed by the matrix multiplication n0 - Number of columns processed by the matrix multiplication k0 - Number of partial accumulations performed by the matrix multiplication @@ -180,7 +172,9 @@ Gemm config file (Strategy reshaped): interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0) interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0) transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0) - export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0) + export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true + with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel + for more details If rhs matrix is transposed only the following configurations are currently supported: M0 = 2, 3, 4, 5, 6, 7, 8 @@ -218,7 +212,7 @@ function usage() { Run gemm examples of a selected strategy, over provided tunable configurationsa and gemm shapes. Save the benchmark results to json files in an output directory. -Usage: ${CMD} [-h] -s -e -g -c [-o ] +Usage: ${CMD} [-h] -s -e -g -c [-d ] [-o ] Options: -h -- cgit v1.2.1