COMPMID-1290 - Update CLTuner documentation

Change-Id: Ief1b6df40623c9f304093cf1f188c86454da3f9c Reviewed-on: https://eu-gerrit-1.euhpc.arm.com/141965 Tested-by: Jenkins <bsgcomp@arm.com> Reviewed-by: Anthony Barbier <anthony.barbier@arm.com>
author: Gian Marco Iodice <gianmarco.iodice@arm.com> 2018-07-30 17:21:41 +0100
committer: Anthony Barbier <anthony.barbier@arm.com> 2018-11-02 16:54:54 +0000
commit: 201cea1b40597c226bf2c8e59d90bebdf9817dd3 (patch)
tree: e546ccf1fcc10ca1c90132d8732a7667fc3cf206 /docs
parent: 366628a5f59ed30751712696f05a75a078add9e2 (diff)
download: ComputeLibrary-201cea1b40597c226bf2c8e59d90bebdf9817dd3.tar.gz
1 files changed, 84 insertions, 1 deletions
diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
index e12c739f99..0a68d893ff 100644
--- a/docs/00_introduction.dox
+++ b/docs/00_introduction.dox
@@ -1058,5 +1058,88 @@ Integer dot product built-in function extensions (and therefore optimized kernel
 OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported.
 
 SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.
+
+@subsection S3_9_cl_tuner OpenCL Tuner
+
+The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
+The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
+The OpenCL tuner performs a brute-force approach: it runs the same OpenCL kernel for a range of local workgroup sizes and keep the local workgroup size of the fastest run to use in subsequent calls to the kernel.
+In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
+
+If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:
+
+https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice
+
+Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.
+
+CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.
+
+    #Example: 2 unique Matrix Multiply configurations
+@code{.cpp}
+    TensorShape a0 = TensorShape(32,32);
+    TensorShape b0 = TensorShape(32,32);
+    TensorShape c0 = TensorShape(32,32);
+    TensorShape a1 = TensorShape(64,64);
+    TensorShape b1 = TensorShape(64,64);
+    TensorShape c1 = TensorShape(64,64);
+
+    Tensor a0_tensor;
+    Tensor b0_tensor;
+    Tensor c0_tensor;
+    Tensor a1_tensor;
+    Tensor b1_tensor;
+    Tensor c1_tensor;
+
+    a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
+    b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
+    c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
+    a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
+    b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
+    c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
+
+    CLGEMM gemm0;
+    CLGEMM gemm1;
+
+    // Configuration 0
+    gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
+
+    // Configuration 1
+    gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);
+@endcode
+
+@subsubsection S3_9_1_cl_tuner_how_to How to use it
+
+All the graph examples in the ACL's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file
+
+    #Enable CL tuner
+    ./graph_mobilenet --enable-tuner –-target=CL
+    ./arm_compute_benchmark --enable-tuner
+
+    #Export/Import to/from a file
+    ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
+    ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv
+
+If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.
+
+Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:
+
+    -# Disable the power management
+    -# Keep the GPU frequency constant
+    -# Run multiple times the network (i.e. 10).
+
+If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.
+
+@code{.cpp}
+CLTuner tuner;
+
+// Setup Scheduler
+CLScheduler::get().default_init(&tuner);
+@endcode
+
+After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".
+- tuner.save_to_file("results.csv");
+
+This file can be also imported using the method "load_from_file("results.csv")".
+- tuner.load_from_file("results.csv");
 */
-} // namespace arm_compute
+} // namespace arm_compute
+\ No newline at end of file
author	Gian Marco Iodice <gianmarco.iodice@arm.com>	2018-07-30 17:21:41 +0100
committer	Anthony Barbier <anthony.barbier@arm.com>	2018-11-02 16:54:54 +0000
commit	201cea1b40597c226bf2c8e59d90bebdf9817dd3 (patch)
tree	e546ccf1fcc10ca1c90132d8732a7667fc3cf206 /docs
parent	366628a5f59ed30751712696f05a75a078add9e2 (diff)
download	ComputeLibrary-201cea1b40597c226bf2c8e59d90bebdf9817dd3.tar.gz