Restructure documentation

The documentation has been restructured for better grouping and readability. Resolves: COMPMID-4198 Signed-off-by: Sheri Zhang <sheri.zhang@arm.com> Change-Id: I8c8bc77f0aab8d63f1659f2235dbab634422a68c Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/5568 Tested-by: Georgios Pinitas <georgios.pinitas@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
author: Sheri Zhang <sheri.zhang@arm.com> 2021-04-30 16:53:41 +0100
committer: Sang-Hoon Park <sang-hoon.park@arm.com> 2021-05-05 09:40:36 +0000
commit: d813bab10bb4fe954fa0e962e1402ed1377617da (patch)
tree: 6e107f6788fc7f396087e8efa29161bfeb2099cc /docs/user_guide
parent: 6124ce60b54eb5639ed19d46c79fce21cca2c83b (diff)
download: ComputeLibrary-d813bab10bb4fe954fa0e962e1402ed1377617da.tar.gz
11 files changed, 3274 insertions, 0 deletions
diff --git a/docs/user_guide/advanced.dox b/docs/user_guide/advanced.dox
new file mode 100644
index 0000000000..86ee2ce756
--- /dev/null
+++ b/docs/user_guide/advanced.dox
@@ -0,0 +1,114 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page advanced Advanced
+
+@tableofcontents
+
+@section S1_8_cl_tuner OpenCL Tuner
+
+The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
+The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
+The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
+In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
+
+If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:
+
+https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice
+
+Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.
+
+CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.
+
+    #Example: 2 unique Matrix Multiply configurations
+@code{.cpp}
+    TensorShape a0 = TensorShape(32,32);
+    TensorShape b0 = TensorShape(32,32);
+    TensorShape c0 = TensorShape(32,32);
+    TensorShape a1 = TensorShape(64,64);
+    TensorShape b1 = TensorShape(64,64);
+    TensorShape c1 = TensorShape(64,64);
+
+    Tensor a0_tensor;
+    Tensor b0_tensor;
+    Tensor c0_tensor;
+    Tensor a1_tensor;
+    Tensor b1_tensor;
+    Tensor c1_tensor;
+
+    a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
+    b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
+    c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
+    a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
+    b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
+    c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
+
+    CLGEMM gemm0;
+    CLGEMM gemm1;
+
+    // Configuration 0
+    gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
+
+    // Configuration 1
+    gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);
+@endcode
+
+@subsection S1_8_1_cl_tuner_how_to How to use it
+
+All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file
+
+    #Enable CL tuner
+    ./graph_mobilenet --enable-tuner –-target=CL
+    ./arm_compute_benchmark --enable-tuner
+
+    #Export/Import to/from a file
+    ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
+    ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv
+
+If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.
+
+Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:
+
+    -# Disable the power management
+    -# Keep the GPU frequency constant
+    -# Run multiple times the network (i.e. 10).
+
+If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.
+
+@code{.cpp}
+CLTuner tuner;
+
+// Setup Scheduler
+CLScheduler::get().default_init(&tuner);
+@endcode
+
+After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".
+- tuner.save_to_file("results.csv");
+
+This file can be also imported using the method "load_from_file("results.csv")".
+- tuner.load_from_file("results.csv");
+
+*/
+} // namespace
+\ No newline at end of file
diff --git a/docs/user_guide/api.dox b/docs/user_guide/api.dox
new file mode 100644
index 0000000000..39282046a9
--- /dev/null
+++ b/docs/user_guide/api.dox
@@ -0,0 +1,135 @@
+///
+/// Copyright (c) 2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page api Application Programming Interface
+
+@tableofcontents
+
+@section api_overview Overview
+
+In this section we present Compute Library's application programming interface (API) architecture along with
+a detailed explanation of its components. Compute Library's API consists of multiple high-level operators and
+even more internally distinct computational blocks that can be executed on a command queue.
+Operators can be bound to multiple Tensor objects and executed concurrently or asynchronously if needed.
+All operators and associated objects are encapsulated in a Context-based mechanism, which provides all related
+construction services.
+
+@section api_objects Fundamental objects
+
+Compute Library consists of a list of fundamental objects that are responsible for creating and orchestrating operator execution.
+Below we present these objects in more detail.
+
+@subsection api_objects_context AclContext or Context
+
+AclContext or Context acts as a central creational aggregate service. All other objects are bound to or created from a context.
+It provides, internally, common facilities such as
+- allocators for object creation or backing memory allocation
+- serialization interfaces
+- any other modules that affect the construction of objects (e.g., program cache for OpenCL).
+
+The followings sections will describe parameters that can be given on the creation of Context.
+
+@subsubsection api_object_context_target AclTarget
+Context is initialized with a backend target (AclTarget) as different backends might have a different subset of services.
+Currently the following targets are supported:
+- #AclCpu: a generic CPU target that accelerates primitives through SIMD technologies
+- #AclGpuOcl: a target for GPU acceleration using OpenCL
+
+@subsubsection api_object_context_execution_mode AclExecutionMode
+An execution mode (AclExecutionMode) can be passed as an argument that affects the operator creation.
+At the moment the following execution modes are supported:
+- #AclPreferFastRerun: Provides faster re-run. It can be used when the operators are expected to be executed multiple
+times under the same execution context
+- #AclPreferFastStart: Provides faster single execution. It can be used when the operators will be executed only once,
+thus reducing their latency is important (Currently, it is not implemented)
+
+@subsubsection api_object_context_capabilitys AclTargetCapabilities
+Context creation can also have a list of capabilities of hardware as one of its parameters. This is currently
+available only for the CPU backend. A list of architecture capabilities can be passed to influence the selection
+of the underlying kernels. Such capabilities can be for example the enablement of SVE or the dot product
+instruction explicitly.
+@note The underlying hardware should support the given capability list.
+
+@subsubsection api_object_context_allocator Allocator
+An allocator object that implements @ref AclAllocator can be passed to the Context upon its creation.
+This user-provided allocator will be used for allocation of any internal backing memory.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClContext) or set (@ref AclSetClContext) the internal OpenCL context.
+
+@subsection api_objects_tensor AclTensor or Tensor
+
+A tensor is a mathematical object that can describe physical properties like matrices.
+It can be also considered a generalization of matrices that can represent arbitrary
+dimensionalities. AclTensor is an abstracted interface that represents a tensor.
+
+AclTensor, in addition to the elements of the physical properties they represent,
+also contains the information such as shape, data type, data layout and strides to not only
+fully describe the characteristics of the physical properties but also provide information
+how the object stored in memory should be traversed. @ref AclTensorDescriptor is a dedicated
+object to represent such metadata.
+
+@note The allocation of an AclTensor can be deferred until external memory is imported
+as backing memory to accomplish a zero-copy context.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClMem) the internal OpenCL memory object.
+
+As Tensors can reside in different memory spaces, @ref AclMapTensor and @ref AclUnmapTensor entrypoints
+are provided to map Tensors in and out of the host memory system, respectively.
+
+@subsection api_objects_queue AclQueue or Queue
+
+AclQueue acts as a runtime aggregate service. It provides facilities to schedule
+and execute operators using underlying hardware. It also contains services like
+tuning mechanisms (e.g., Local workgroup size tuning for OpenCL) that can be specified
+during operator execution.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClQueue) or set (@ref AclSetClQueue) the internal OpenCL queue.
+
+@section api_internal Internal
+@subsection api_internal_operator_vs_kernels Operators vs Kernels
+
+Internally, Compute Library separates the executable primitives in two categories: kernels and operators
+which operate in a hierarchical way.
+
+A kernel is the lowest-level computation block whose responsibility is performing a task on a given group of data.
+For design simplicity, kernels computation does NOT involve the following:
+
+- Memory allocation: All the memory manipulation should be handled by the caller.
+- Multi-threading: The information on how the workload can be split is provided by kernels,
+so the caller can effectively distribute the workload to multiple threads.
+
+On the other hand, operators combine one or multiple kernels to achieve more complex calculations.
+The responsibilities of the operators can be summarized as follows:
+
+- Defining the scheduling policy and dispatching of the underlying kernels to the hardware backend
+- Providing information to the caller required by the computation (e.g., memory requirements)
+- Allocation of any required auxiliary memory if it isn't given by its caller explicitly
+
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/data_layout.dox b/docs/user_guide/data_layout.dox
new file mode 100644
index 0000000000..48f15acd63
--- /dev/null
+++ b/docs/user_guide/data_layout.dox
@@ -0,0 +1,41 @@
+///
+/// Copyright (c) 2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+
+namespace arm_compute
+{
+/**
+@page data_layout_support Data Layout Support
+
+@section data_layout_support_supported_data_layout Supported Data Layouts
+
+Compute Library supports the follwing data layouts and
+the right-most letter represents the fastest changing dimension:
+
+- NHWC: The native layout of Compute Library that delivers the best performance where channels are in the fastest changing dimension
+- NCHW: Legacy layout where width is in the fastest changing dimension
+
+, where N = batch, C = channel, H = height, W = width.
+
+*/
+} // namespace
diff --git a/docs/user_guide/data_type.dox b/docs/user_guide/data_type.dox
new file mode 100644
index 0000000000..7083270a07
--- /dev/null
+++ b/docs/user_guide/data_type.dox
@@ -0,0 +1,47 @@
+///
+/// Copyright (c) 2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page data_type_support Data Type Support
+
+@tableofcontents
+
+@section data_type_support_supported_data_type Supported Data Types
+
+Compute Library supports the following list of data types. More detailed information
+can be found from the documentation of each operator since the data types supported
+by each operator vary.
+
+- BFLOAT16: 16-bit non-standard brain floating point
+- QASYMM8: 8-bit unsigned asymmetric quantized
+- QASYMM8_SIGNED: 8-bit signed asymmetric quantized
+- QSYMM8_PER_CHANNEL: 8-bit signed symmetric quantized (Used for the weights)
+- QSYMM8: 8-bit unsigned symmetric quantized
+- QSYMM16: 16-bit unsigned symmetric quantized
+- F32: 32-bit single precision floating point
+- F16: 16-bit half precision floating point
+- S32: 32-bit signed integer
+*/
+} // namespace
diff --git a/docs/user_guide/errata.dox b/docs/user_guide/errata.dox
new file mode 100644
index 0000000000..0c8d684017
--- /dev/null
+++ b/docs/user_guide/errata.dox
@@ -0,0 +1,76 @@
+///
+/// Copyright (c) 2019-2020 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page errata Errata
+
+@tableofcontents
+
+@section S7_1_errata Errata
+
+- Under certain conditions, CLFullyConnectedLayer quantized tests may fail due to an issue in the test framework.
+    - Versions Affected: 21.02
+    - OSs Affected: Linux
+    - Conditions:
+        - armv7a architecture
+        - release mode
+        - asserts enabled
+
+- A wrong test configuration has been found in CLGEMMMatrixMultiplyReshapedOnlyRHS set of tests.
+    - Versions Affected: >= 20.11
+    - Conditions:
+        - Data type input: F32/F16
+        - Fused bounded relu activation with coefficient 'a' being negative
+
+- Under certain conditions, the validation test case 'CL/DirectConvolutionLayer/Float/FP32/RunSmall9x9\@InputShape=32x37x3x4:StrideX=1:StrideY=1:PadX=0:PadY=0:KernelSize=9:NumKernels=1:DataType=F32:ActivationInfo=LU_BOUNDED_RELU:DataLayout=NHWC' may fail.
+    - Versions Affected: >= v20.08
+    - Conditions:
+        - The validation suite has to run in nightly mode and execute 40k+ test cases before the test mentioned above
+
+- Under certain conditions, benchmark examples can hang when OpenCL profiling queues are enabled.
+    - Versions Affected: >= v19.11
+    - OSs Affected: Linux
+    - Conditions:
+        - Arm® Mali™ DDK r1p0 - r8p0, and
+        - Linux kernel >= 4.4
+
+- On Android with arm64-v8a/arm64-v8.2-a architecture, Arm® Neon™ validation tests can fail when compiled using Android Ndk
+  >= r18b in debug mode (https://github.com/android/ndk/issues/1135).
+    - Versions Affected: >= v19.11
+    - OSs Affected: Android
+    - Conditions:
+        - arm64-v8a/arm64-v8.2-a architecture, and
+        - Compiled using Android NDK >= r18b in debug mode.
+
+- An issue has been identified with CLCast.
+    - Versions Affected: >= 18.11
+    - Conditions:
+        - Data type input: F32
+        - Data type output: All integer types
+        - Conversion policy: SATURATE
+    - Result: OpenCL backend will always wrap around instead of saturating for out-of-range inputs
+
+*/
+} // namespace
diff --git a/docs/user_guide/how_to_build_and_run_examples.dox b/docs/user_guide/how_to_build_and_run_examples.dox
new file mode 100644
index 0000000000..e57183e891
--- /dev/null
+++ b/docs/user_guide/how_to_build_and_run_examples.dox
@@ -0,0 +1,541 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page how_to_build How to Build and Run Examples
+
+@tableofcontents
+
+@section S1_1_build_options Build options
+
+scons 2.3 or above is required to build the library.
+To see the build options available simply run ```scons -h```:
+
+        debug: Debug (yes|no)
+            default: False
+
+        asserts: Enable asserts (this flag is forced to 1 for debug=1) (yes|no)
+            default: False
+
+        logging: Logging (this flag is forced to 1 for debug=1) (yes|no)
+            default: False
+
+        arch: Target Architecture (armv7a|arm64-v8a|arm64-v8.2-a|arm64-v8.2-a-sve|arm64-v8.2-a-sve2|x86_32|x86_64|armv8a|armv8.2-a|armv8.2-a-sve|armv8.6-a|armv8.6-a-sve|armv8.6-a-sve2|armv8r64|x86)
+            default: armv7a
+
+        estate: Execution State (auto|32|64)
+            default: auto
+
+        os: Target OS (linux|android|macos|tizen|bare_metal)
+            default: linux
+
+        build: Build type (native|cross_compile|embed_only)
+            default: cross_compile
+
+        examples: Build example programs (yes|no)
+            default: True
+
+        gemm_tuner: Build gemm_tuner programs (yes|no)
+            default: True
+
+        Werror: Enable/disable the -Werror compilation flag (yes|no)
+            default: True
+
+        standalone: Builds the tests as standalone executables, links statically with libgcc, libstdc++ and libarm_compute (yes|no)
+            default: False
+
+        opencl: Enable OpenCL support (yes|no)
+            default: True
+
+        neon: Enable Arm® Neon™ support (yes|no)
+            default: False
+
+        embed_kernels: Embed OpenCL kernels in library binary (yes|no)
+            default: True
+
+        compress_kernels: Compress embedded OpenCL kernels in library binary. Note embed_kernels should be enabled as well (yes|no)
+            default: False
+
+        set_soname: Set the library's soname and shlibversion (requires SCons 2.4 or above) (yes|no)
+            default: False
+
+        openmp: Enable OpenMP backend (yes|no)
+            default: False
+
+        cppthreads: Enable C++11 threads backend (yes|no)
+            default: True
+
+        build_dir: Specify sub-folder for the build ( /path/to/build_dir )
+            default: .
+
+        install_dir: Specify sub-folder for the install ( /path/to/install_dir )
+            default:
+
+        exceptions: Enable/disable C++ exception support (yes|no)
+            default: True
+
+        linker_script: Use an external linker script ( /path/to/linker_script )
+            default:
+
+        custom_options: Custom options that can be used to turn on/off features
+            (all|none|comma-separated list of names)
+            allowed names: disable_mmla_fp
+            default: none
+
+        data_type_support: Enable a list of data types to support
+            (all|none|comma-separated list of names)
+            allowed names: qasymm8 qasymm8_signed qsymm16 fp16 fp32
+            default: all
+
+        toolchain_prefix: Override the toolchain prefix
+            default:
+
+        compiler_prefix: Override the compiler prefix
+            default:
+
+        extra_cxx_flags: Extra CXX flags to be appended to the build command
+            default:
+
+        extra_link_flags: Extra LD flags to be appended to the build command
+            default:
+
+        compiler_cache: Command to prefix to the C and C++ compiler (e.g ccache)
+            default:
+
+        specs_file: Specs file to use
+            default: rdimon.specs
+
+        benchmark_examples: Build benchmark examples programs (yes|no)
+            default: False
+
+        validate_examples: Build validate examples programs (yes|no)
+            default: False
+
+        reference_openmp: Build reference validation with openmp (yes|no)
+            default: True
+
+        validation_tests: Build validation test programs (yes|no)
+            default: False
+
+        benchmark_tests: Build benchmark test programs (yes|no)
+            default: False
+
+        test_filter: Pattern to specify the tests' filenames to be compiled
+            default: *.cpp
+
+        pmu: Enable PMU counters (yes|no)
+            default: False
+
+        mali: Enable Arm® Mali™ hardware counters (yes|no)
+            default: False
+
+        external_tests_dir: Add examples, benchmarks and tests to the tests suite from an external path ( /path/to/external_tests_dir )
+            default:
+
+@b debug / @b asserts:
+ - With debug=1 asserts are enabled, and the library is built with symbols and no optimisations enabled.
+ - With debug=0 and asserts=1: Optimisations are enabled and symbols are removed, however all the asserts are still present (This is about 20% slower than the release build)
+ - With debug=0 and asserts=0: All optimisations are enable and no validation is performed, if the application misuses the library it is likely to result in a crash. (Only use this mode once you are sure your application is working as expected).
+
+@b arch: The x86_32 and x86_64 targets can only be used with neon=0 and opencl=1.
+
+@b os: Choose the operating system you are targeting: Linux, Android or bare metal.
+@note bare metal can only be used for Arm® Neon™ (not OpenCL), only static libraries get built and Neon's multi-threading support is disabled.
+
+@b build: you can either build directly on your device (native) or cross compile from your desktop machine (cross-compile). In both cases make sure the compiler is available in your path.
+
+@note If you want to natively compile for 32bit on a 64bit Arm device running a 64bit OS then you will have to use cross-compile too.
+
+There is also an 'embed_only' option which will generate all the .embed files for the OpenCL kernels. This might be useful if using a different build system to compile the library.
+
+In addittion the option 'compress_kernels' will compress the embedded OpenCL kernel files using zlib and inject them in the library. This is useful for reducing the binary size. Note, this option is only available for Android when 'embed_kernels' is enabled.
+
+@b Werror: If you are compiling using the same toolchains as the ones used in this guide then there shouldn't be any warning and therefore you should be able to keep Werror=1. If with a different compiler version the library fails to build because of warnings interpreted as errors then, if you are sure the warnings are not important, you might want to try to build with Werror=0 (But please do report the issue on Github).
+
+@b opencl / @b neon: Choose which SIMD technology you want to target. (Neon for Arm Cortex-A CPUs or OpenCL for Arm® Mali™ GPUs)
+
+@b embed_kernels: For OpenCL only: set embed_kernels=1 if you want the OpenCL kernels to be built in the library's binaries instead of being read from separate ".cl" / ".cs" files. If embed_kernels is set to 0 then the application can set the path to the folder containing the OpenCL kernel files by calling CLKernelLibrary::init(). By default the path is set to "./cl_kernels".
+
+@b set_soname: Do you want to build the versioned version of the library ?
+
+If enabled the library will contain a SONAME and SHLIBVERSION and some symlinks will automatically be created between the objects.
+Example:
+  libarm_compute_core.so -> libarm_compute_core.so.1.0.0
+  libarm_compute_core.so.1 -> libarm_compute_core.so.1.0.0
+  libarm_compute_core.so.1.0.0
+
+@note This options is disabled by default as it requires SCons version 2.4 or above.
+
+@b extra_cxx_flags: Custom CXX flags which will be appended to the end of the build command.
+
+@b build_dir: Build the library in a subfolder of the "build" folder. (Allows to build several configurations in parallel).
+
+@b examples: Build or not the examples
+
+@b validation_tests: Enable the build of the validation suite.
+
+@b benchmark_tests: Enable the build of the benchmark tests
+
+@b pmu: Enable the PMU cycle counter to measure execution time in benchmark tests. (Your device needs to support it)
+
+@b mali: Enable the collection of Arm® Mali™ hardware counters to measure execution time in benchmark tests. (Your device needs to have a Arm® Mali™ driver that supports it)
+
+@b openmp Build in the OpenMP scheduler for Neon.
+
+@note Only works when building with g++ not clang++
+
+@b cppthreads Build in the C++11 scheduler for Neon.
+
+@sa Scheduler::set
+
+@b external_tests_dir Add examples, benchmarks and tests to the tests suite from an external path ( /path/to/external_tests_dir )
+
+In order to use this option, the external tests directory must have the following structure:
+
+    EXTERNAL_TESTS_DIR:
+    └── tests
+        ├── benchmark
+        │   ├── CL
+        │   ├── datasets
+        │   ├── fixtures
+        │   └── Neon
+        └── validation
+            ├── CL
+            ├── datasets
+            ├── fixtures
+            └── Neon
+
+Then, build the library with `external_tests_dir=<PATH_TO_EXTERNAL_TESTS_DIR>`.
+
+@section S1_2_linux Building for Linux
+
+@subsection S1_2_1_library How to build the library ?
+
+For Linux, the library was successfully built and tested using the following Linaro GCC toolchain:
+
+ - gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
+ - gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
+
+To cross-compile the library in debug mode, with Arm® Neon™ only support, for Linux 32bit:
+
+	scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
+
+To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit:
+
+	scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a
+
+You can also compile the library natively on an Arm device by using <b>build=native</b>:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=arm64-v8a build=native
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native
+
+@note g++ for Arm is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler.
+
+For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b>
+
+	apt-get install g++-arm-linux-gnueabihf
+
+Then run
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile
+
+or simply remove the build parameter as build=cross_compile is the default value:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a
+
+@subsection S1_2_2_examples How to manually build the examples ?
+
+The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
+
+@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
+
+To cross compile a Arm® Neon™ example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -larm_compute_core -o neon_convolution
+
+To cross compile a Arm® Neon™ example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -larm_compute_core -o neon_convolution
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+To cross compile an OpenCL example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
+
+To cross compile an OpenCL example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
+
+i.e. to cross compile the "graph_lenet" example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+i.e. to cross compile the "graph_lenet" example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
+
+To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 32bit:
+
+	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -larm_compute -larm_compute_core -o neon_convolution
+
+To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 64bit:
+
+	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -larm_compute_core -o neon_convolution
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
+
+To compile natively (i.e directly on an Arm device) for OpenCL for Linux 32bit or Linux 64bit:
+
+	g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
+
+To compile natively the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
+
+i.e. to natively compile the "graph_lenet" example for Linux 32bit:
+
+	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+i.e. to natively compile the "graph_lenet" example for Linux 64bit:
+
+	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
+
+@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
+
+@note These two commands assume libarm_compute.so is available in your library path, if not add the path to it using -L (e.g. -Llib/linux-arm64-v8a-neon-cl-asserts/)
+@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
+
+To run the built executable simply run:
+
+	LD_LIBRARY_PATH=build ./neon_convolution
+
+or
+
+	LD_LIBRARY_PATH=build ./cl_convolution
+
+@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
+
+For example:
+
+	LD_LIBRARY_PATH=. ./graph_lenet --help
+
+Below is a list of the common parameters among the graph examples :
+@snippet utils/CommonGraphOptions.h Common graph examples parameters
+
+@subsection S1_2_3_sve Build for SVE or SVE2
+
+In order to build for SVE or SVE2 you need a compiler that supports them. You can find more information in the following these links:
+    -# GCC: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/sve-support
+    -# LLVM: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/llvm-toolchain/sve-support
+
+@note You the need to indicate the toolchains using the scons "toolchain_prefix" parameter.
+
+An example build command with SVE is:
+
+        scons arch=arm64-v8.2-a-sve os=linux build_dir=arm64 -j55 standalone=0 opencl=0 openmp=0 validation_tests=1 neon=1 cppthreads=1 toolchain_prefix=aarch64-none-linux-gnu-
+
+@section S1_3_android Building for Android
+
+For Android, the library was successfully built and tested using Google's standalone toolchains:
+ - clang++ from NDK r18b for armv7a
+ - clang++ from NDK r20b for arm64-v8a
+ - clang++ from NDK r20b for arm64-v8.2-a with FP16 support
+
+For NDK r18 or older, here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a>:
+- Download the NDK r18b from here: https://developer.android.com/ndk/downloads/index.html to directory $NDK
+- Make sure you have Python 2.7 installed on your machine.
+- Generate the 32 and/or 64 toolchains by running the following commands to your toolchain dirctory $MY_TOOLCHAINS:
+
+	$NDK/build/tools/make_standalone_toolchain.py --arch arm64 --install-dir $MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b --stl libc++ --api 21
+	$NDK/build/tools/make_standalone_toolchain.py --arch arm --install-dir $MY_TOOLCHAINS/arm-linux-android-ndk-r18b --stl libc++ --api 21
+
+For NDK r19 or newer, you can directly <a href="https://developer.android.com/ndk/downloads">Download</a> the NDK package for your development platform, without the need to launch the make_standalone_toolchain.py script. You can find all the prebuilt binaries inside $NDK/toolchains/llvm/prebuilt/$OS_ARCH/bin/.
+@attention the building script will look for a binary named "aarch64-linux-android-clang++", while the prebuilt binaries will have their API version as a suffix to their filename (e.g. "aarch64-linux-android21-clang++"). You should copy/rename the binary removing this suffix, or - alternatively - create an alias for it.
+
+@attention We used to use gnustl but as of NDK r17 it is deprecated so we switched to libc++
+
+@note Make sure to add the toolchains to your PATH:
+
+	export PATH=$PATH:$MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b/bin:$MY_TOOLCHAINS/arm-linux-android-ndk-r18b/bin
+
+@subsection S1_3_1_library How to build the library ?
+
+To cross-compile the library in debug mode, with Arm® Neon™ only support, for Android 32bit:
+
+	CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a
+
+To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit:
+
+	CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a
+
+@subsection S1_3_2_examples How to manually build the examples ?
+
+The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
+
+@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
+
+Once you've got your Android standalone toolchain built and added to your path you can do the following:
+
+To cross compile a Arm® Neon™ example:
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_arm -static-libstdc++ -pie
+	#64 bit:
+	aarch64-linux-android-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_aarch64 -static-libstdc++ -pie
+
+To cross compile an OpenCL example:
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
+	#64 bit:
+	aarch64-linux-android-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
+
+To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also.
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
+	#64 bit:
+	aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
+
+@note Due to some issues in older versions of the Arm® Mali™ OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android.
+@note When linked statically the arm_compute_graph library currently needs the --whole-archive linker flag in order to work properly
+
+Then you need to do is upload the executable and the shared library to the device using ADB:
+
+	adb push neon_convolution_arm /data/local/tmp/
+	adb push cl_convolution_arm /data/local/tmp/
+	adb push gc_absdiff_arm /data/local/tmp/
+	adb shell chmod 777 -R /data/local/tmp/
+
+And finally to run the example:
+
+	adb shell /data/local/tmp/neon_convolution_arm
+	adb shell /data/local/tmp/cl_convolution_arm
+	adb shell /data/local/tmp/gc_absdiff_arm
+
+For 64bit:
+
+	adb push neon_convolution_aarch64 /data/local/tmp/
+	adb push cl_convolution_aarch64 /data/local/tmp/
+	adb push gc_absdiff_aarch64 /data/local/tmp/
+	adb shell chmod 777 -R /data/local/tmp/
+
+And finally to run the example:
+
+	adb shell /data/local/tmp/neon_convolution_aarch64
+	adb shell /data/local/tmp/cl_convolution_aarch64
+	adb shell /data/local/tmp/gc_absdiff_aarch64
+
+@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
+
+For example:
+	adb shell /data/local/tmp/graph_lenet --help
+
+In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on Neon, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run.
+
+@section S1_4_macos Building for macOS
+
+The library was successfully natively built for Apple Silicon under macOS 11.1 using clang v12.0.0.
+
+To natively compile the library with accelerated CPU support:
+
+	scons Werror=1 -j8 neon=1 opencl=0 os=macos arch=arm64-v8a build=native
+
+@note Initial support disables feature discovery through HWCAPS and thread scheduling affinity controls
+
+@section S1_5_bare_metal Building for bare metal
+
+For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains:
+ - arm-eabi for armv7a
+ - aarch64-elf for arm64-v8a
+
+Download linaro for <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/arm-eabi/">armv7a</a> and <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/aarch64-elf/">arm64-v8a</a>.
+
+@note Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin
+
+@subsection S1_5_1_library How to build the library ?
+
+To cross-compile the library with Arm® Neon™ support for baremetal arm64-v8a:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=arm64-v8a build=cross_compile cppthreads=0 openmp=0 standalone=1
+
+@subsection S1_5_2_examples How to manually build the examples ?
+
+Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a/index.html">here</a>.
+
+@section S1_6_windows_host Building on a Windows host system
+
+Using `scons` directly from the Windows command line is known to cause
+problems. The reason seems to be that if `scons` is setup for cross-compilation
+it gets confused about Windows style paths (using backslashes). Thus it is
+recommended to follow one of the options outlined below.
+
+@subsection S1_6_1_ubuntu_on_windows Bash on Ubuntu on Windows
+
+The best and easiest option is to use
+<a href="https://msdn.microsoft.com/en-gb/commandline/wsl/about">Ubuntu on Windows</a>.
+This feature is still marked as *beta* and thus might not be available.
+However, if it is building the library is as simple as opening a *Bash on
+Ubuntu on Windows* shell and following the general guidelines given above.
+
+@subsection S1_6_2_cygwin Cygwin
+
+If the Windows subsystem for Linux is not available <a href="https://www.cygwin.com/">Cygwin</a>
+can be used to install and run `scons`, the minimum Cygwin version must be 3.0.7 or later. In addition
+to the default packages installed by Cygwin `scons` has to be selected in the installer. (`git` might
+also be useful but is not strictly required if you already have got the source
+code of the library.) Linaro provides pre-built versions of
+<a href="http://releases.linaro.org/components/toolchain/binaries/">GCC cross-compilers</a>
+that can be used from the Cygwin terminal. When building for Android the
+compiler is included in the Android standalone toolchain. After everything has
+been set up in the Cygwin terminal the general guide on building the library
+can be followed.
+
+@section S1_7_cl_requirements OpenCL DDK Requirements
+
+@subsection S1_7_1_cl_hard_requirements Hard Requirements
+
+Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Arm® Mali™ OpenCL DDK r8p0 and above as an extension (respective extension flag is \a -cl-arm-non-uniform-work-group-size).
+
+Enabling 16-bit floating point calculations require \a cl_khr_fp16 extension to be supported. All Arm® Mali™ GPUs with compute capabilities have native support for half precision floating points.
+
+@subsection S1_7_2_cl_performance_requirements Performance improvements
+
+Integer dot product built-in function extensions (and therefore optimized kernels) are available with Arm® Mali™ OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are \a cl_arm_integer_dot_product_int8, \a cl_arm_integer_dot_product_accumulate_int8 and \a cl_arm_integer_dot_product_accumulate_int16.
+
+OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported.
+
+SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.
+
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/introduction.dox b/docs/user_guide/introduction.dox
new file mode 100644
index 0000000000..a956d7dd52
--- /dev/null
+++ b/docs/user_guide/introduction.dox
@@ -0,0 +1,74 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page introduction Introduction
+
+@tableofcontents
+
+The Compute Library is a collection of low-level machine learning functions optimized for both Arm CPUs and GPUs using SIMD technologies.
+
+Several builds of the library are available using various configurations:
+ - OS: Linux, Android, macOS or bare metal.
+ - Architecture: armv7a (32bit) or arm64-v8a (64bit).
+ - Technology: Arm® Neon™ / OpenCL / Arm® Neon™ and OpenCL.
+ - Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance.
+
+@section S0_1_contact Contact / Support
+
+Please create an issue on <a href="https://github.com/ARM-software/ComputeLibrary/issues">Github</a>.
+
+In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run:
+
+    $ strings android-armv7a-cl-asserts/libarm_compute.so | grep arm_compute_version
+    arm_compute_version=v16.12 Build options: {'embed_kernels': '1', 'opencl': '1', 'arch': 'armv7a', 'neon': '0', 'asserts': '1', 'debug': '0', 'os': 'android', 'Werror': '1'} Git hash=f51a545d4ea12a9059fe4e598a092f1fd06dc858
+
+@section S0_2_prebuilt_binaries Pre-built binaries
+
+For each release we provide some pre-built binaries of the library [here](https://github.com/ARM-software/ComputeLibrary/releases)
+
+These binaries have been built using the following toolchains:
+            - Linux armv7a: gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf
+            - Linux arm64-v8a: gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu
+            - Android armv7a: clang++ / libc++ NDK r18b
+            - Android am64-v8a: clang++ / libc++ NDK r20b
+
+@warning Make sure to use a compatible toolchain to build your application or you will get some std::bad_alloc errors at runtime.
+
+@section S0_3_file_organisation File organisation
+
+This archive contains:
+ - The arm_compute header and source files
+ - The latest Khronos OpenCL 1.2 C headers from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a>
+ - The latest Khronos cl2.hpp from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> (API version 2.1 when this document was written)
+ - The latest Khronos EGL 1.5 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos EGL registry</a>
+ - The sources for a stub version of libOpenCL.so, libGLESv1_CM.so, libGLESv2.so and libEGL.so to help you build your application.
+ - An examples folder containing a few examples to compile and link against the library.
+ - A @ref utils folder containing headers with some boiler plate code used by the examples.
+ - This documentation.
+
+ For detailed information about file organization, please refer to Files -> File List section of this documentation.
+
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/library.dox b/docs/user_guide/library.dox
new file mode 100644
index 0000000000..2e3cc967ea
--- /dev/null
+++ b/docs/user_guide/library.dox
@@ -0,0 +1,402 @@
+///
+/// Copyright (c) 2017-2020 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page architecture Library Architecture
+
+@tableofcontents
+
+@section S4_1_1 Core vs Runtime libraries
+
+The Core library is a low level collection of algorithms implementations, it is designed to be embedded in existing projects and applications:
+
+- It doesn't allocate any memory (All the memory allocations/mappings have to be handled by the caller).
+- It doesn't perform any kind of multi-threading (but provide information to the caller about how the workload can be split).
+
+The Runtime library is a very basic wrapper around the Core library which can be used for quick prototyping, it is basic in the sense that:
+
+- It allocates images and tensors by using standard malloc().
+- It multi-threads Arm® Neon™ code in a very basic way using a very simple pool of threads.
+- For OpenCL it uses the default CLScheduler command queue for all mapping operations and kernels.
+
+For maximum performance, it is expected that the users would re-implement an equivalent to the runtime library which suits better their needs (With a more clever multi-threading strategy, load-balancing between Arm® Neon™ and OpenCL, etc.)
+
+@section S4_1_3 Fast-math support
+
+Compute Library supports different types of convolution methods, fast-math flag is only used for the Winograd algorithm.
+When the fast-math flag is enabled, both Arm® Neon™ and CL convolution layers will try to dispatch the fastest implementation available, which may introduce a drop in accuracy as well. The different scenarios involving the fast-math flag are presented below:
+- For FP32:
+    - no-fast-math: Only supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7
+    - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
+- For fp16:
+    - no-fast-math: No Winograd support
+    - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
+
+@section S4_1_4 Thread-safety
+
+Although the library supports multi-threading during workload dispatch, thus parallelizing the execution of the workload at multiple threads, the current runtime module implementation is not thread-safe in the sense of executing different functions from separate threads.
+This lies to the fact that the provided scheduling mechanism wasn't designed with thread-safety in mind.
+As it is true with the rest of the runtime library a custom scheduling mechanism can be re-implemented to account for thread-safety if needed and be injected as the library's default scheduler.
+
+@section S4_5_algorithms Algorithms
+
+All computer vision algorithms in this library have been implemented following the [OpenVX 1.1 specifications](https://www.khronos.org/registry/vx/specs/1.1/html/). Please refer to the Khronos documentation for more information.
+
+@section S4_6_images_tensors Images, padding, border modes and tensors
+
+Most kernels and functions in the library process images, however, in order to be future proof most of the kernels actually accept tensors. See below for more information about how they are related.
+
+@attention Each memory object can be written by only one kernel, however it can be read by several kernels. Writing to the same object from several kernels will result in undefined behavior. The kernel writing to an object must be configured before the kernel(s) reading from it.
+
+@subsection S4_6_1_padding_and_border Padding and border modes
+
+Several algorithms require a neighborhood around the current pixel to compute it's value. This means the algorithm will not be able to process the borders of the image unless you give it more information about how those border pixels should be processed. The @ref BorderMode enum is used for this purpose.
+
+You have 3 types of @ref BorderMode :
+
+- @ref BorderMode::UNDEFINED : Neighbor pixels outside of the image are treated as undefined. As a result all the pixels which are on the border will have a value which is undefined.
+- @ref BorderMode::REPLICATE : Neighbor pixels outside of the image are treated as having the same value as the closest valid pixel.
+- @ref BorderMode::CONSTANT : Neighbor pixels outside of the image are treated as having the same constant value. (The user can choose what this value should be).
+
+Moreover both OpenCL and Arm® Neon™ use vector loads and stores instructions to access the data in buffers, so in order to avoid having special cases to handle for the borders all the images and tensors used in this library must be padded.
+
+@subsubsection padding Padding
+
+There are different ways padding can be calculated:
+
+- Accurate padding:
+
+@note It's important to call allocate @b after the function is configured: if the image / tensor is already allocated then the function will shrink its execution window instead of increasing the padding. (See below for more details).
+
+- Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions). In that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (which translates into a smaller valid region for the output). See also @ref valid_region).
+If you don't want to manually set the padding but still want to allocate your objects upfront then you can use auto_padding. It guarantees that the allocation will have enough padding to run any of the provided functions.
+
+@code{.cpp}
+Image     src, dst;
+
+// Use auto padding for the input:
+src.info()->init_auto_padding(TensorShape(640u,480u), Format::U8);
+
+// Use manual padding for the destination image
+dst.info()->init(src.info()->tensor_shape(), Format::U8, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);
+
+// Allocate all the images
+src.allocator()->allocate();
+dst.allocator()->allocate();
+// Fill the input image with the content of the PPM image if a filename was provided:
+fill_image(src);
+
+NEGaussian3x3 gauss;
+
+// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)
+gauss.configure(&src, &dst, BorderMode::UNDEFINED);
+
+//Execute the functions:
+gauss.run();
+@endcode
+
+@warning Some kernels need up to 3 neighbor values to calculate the value of a given pixel. Therefore, to be safe, we use a 4-pixel padding all around the image. In addition, some kernels read and write up to 32 pixels at the same time. To cover that case as well we add an extra 32 pixels of padding at the end of each row. As a result auto padded buffers waste a lot of memory and are less cache friendly. It is therefore recommended to use accurate padding or manual padding wherever possible.
+
+@subsubsection valid_region Valid regions
+
+Some kernels (like edge detectors for example) need to read values of neighboring pixels to calculate the value of a given pixel, it is therefore not possible to calculate the values of the pixels on the edges.
+
+Another case is: if a kernel processes 8 pixels per iteration and the image's dimensions are not a multiple of 8 and not enough padding is available then the kernel will not be able to process the pixels near the right edge. As a result these pixels will be left undefined.
+
+In order to know which pixels have been calculated, each kernel sets a valid region for each output image or tensor. See also @ref TensorInfo::valid_region(), @ref ValidRegion
+
+@subsection S4_6_2_tensors Tensors
+
+Tensors are multi-dimensional arrays with a maximum of @ref Coordinates::num_max_dimensions dimensions.
+
+Depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as an one-dimensional tensor. Further, an image is actually just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
+
+@note Most algorithms process images (i.e a 2D slice of the tensor), therefore only padding along the X and Y axes is required (2D slices can be stored contiguously in memory).
+
+@subsection S4_6_3_description_conventions Images and Tensors description conventions
+
+Image objects are defined by a @ref Format and dimensions expressed as [width, height, batch]
+
+Tensors are defined by a @ref DataType plus a number of channels (Always expected to be 1 for now) and their dimensions are expressed as [width, height, feature_maps, batch].
+
+In other words, the lower three dimensions of a tensor specify a single input in [width, height, feature_maps], while any other specified dimension represents a batch in the appropriate dimension space.
+For example, a tensor with dimensions [128, 128, 64, 16] represents a 1D batch space with 16 batches of 128 elements in width and height and 64 feature maps each.
+Each kernel specifies the expected layout of each of its tensors in its documentation.
+
+@note Unless specified otherwise in the kernel's or function's documentation all tensors and images parameters passed must have identical dimensions.
+
+@note Unless specified otherwise in the kernel's or function's documentation the number of channels for tensors is expected to be 1 (For images, the number of channels is inferred from the @ref Format).
+
+@attention Regardless of the @ref DataType used by a tensor the @ref ITensor::buffer() method will always return a uint8_t pointer, and all the metadata in @ref TensorInfo will be expressed in bytes. It is the user's responsibility to cast the pointer to the correct type.
+
+For example, to read the element located at the coordinates (x,y) of a float tensor:
+
+@code{.cpp}
+float value = *reinterpret_cast<float*>(input.buffer() + input.info()->offset_element_in_bytes(Coordinates(x,y)));
+@endcode
+
+@subsection S4_6_4_working_with_objects Working with Images and Tensors using iterators
+
+The library provides some iterators to access objects' data.
+Iterators are created by associating a data object (An image or a tensor for example) with an iteration window.
+
+Iteration windows are defined by an array of dimensions, each of which consists of a start, end and step.
+
+The @ref execute_window_loop function takes an execution window, a lambda function and one or more iterators.
+It will iterate through every element of the execution window and for each element it will update the iterators accordingly and call the lambda function.
+
+Here are a couple of examples of how to use the iterators to fill / read tensors:
+
+@snippet examples/neon_copy_objects.cpp Copy objects example
+
+@subsection S4_6_5_sub_tensors Sub-tensors
+
+Sub-tensors are aliases to existing Tensors, as a result creating a sub-tensor does not result in any underlying memory allocation.
+
+Sub-tensors can be used to access a sub-set of the parent tensor, something that can be useful in case different operations need to be performed on different parts of a tensor.
+
+Moreover, sub-tensors can be used to perform zero copy tensor concatenation.
+
+The API for creating a sub-tensor is the following:
+@code{.cpp}
+SubTensor(ITensor *parent, const TensorShape &tensor_shape, const Coordinates &coords)
+@endcode
+
+Where \a parent is the parent tensor which we want to create an alias for, \a tensor_shape is the shape of the sub-tensor and \a coords are the starting indexing coordinates of the sub-tensor within the parent tensor.
+
+@note Two sub-tensor concrete classes for different targets are currently supported : @ref CLSubTensor and @ref SubTensor
+
+@warning Limitation of the sub-tensor is that it cannot be extracted spatially, meaning sub-tensors should have the same width and height as the parent tensor. The main reasons for this is the fact that individual kernels might need to operate with a step size that is not a multiple of the sub-tensor spatial dimension. This could lead to elements being overwritten by different kernels operating on different sub-tensors of the same underlying tensor.
+
+@section S4_7_memory_manager MemoryManager
+
+@ref IMemoryManager is a memory managing interface that can be used to reduce the memory requirements of a given pipeline by recycling temporary buffers.
+
+@subsection S4_7_1_memory_manager_components MemoryGroup, MemoryPool and MemoryManager Components
+
+@subsubsection S4_7_1_1_memory_group MemoryGroup
+
+@ref IMemoryGroup defines the memory managing granularity.
+
+MemoryGroup binds a number of objects to a bucket of memory requirements that need to be fulfilled in order for an operation or list of operations to be executed.
+
+Requesting backing memory for a specific group can be done using @ref IMemoryGroup::acquire and releasing the memory back using @ref IMemoryGroup::release.
+
+@subsubsection S4_7_1_2_memory_pool MemoryPool
+
+@ref IMemoryPool defines a pool of memory that can be used to provide backing memory to a memory group.
+
+@note @ref BlobMemoryPool is currently implemented which models the memory requirements as a vector of distinct memory blobs.
+
+@subsubsection S4_7_1_2_memory_manager_components MemoryManager Components
+
+@ref IMemoryManager consists of two components:
+- @ref ILifetimeManager that keeps track of the lifetime of the registered objects of the memory groups and given an @ref IAllocator creates an appropriate memory pool that fulfils the memory requirements of all the registered memory groups.
+- @ref IPoolManager that safely manages the registered memory pools.
+
+@note @ref BlobLifetimeManager is currently implemented which models the memory requirements as a vector of distinct memory blobs.
+
+@subsection S4_7_2_working_with_memory_manager Working with the Memory Manager
+Using a memory manager to reduce the memory requirements of a pipeline can be summed in the following steps:
+
+Initially a memory manager must be set-up:
+@code{.cpp}
+Allocator  allocator{};                                                               // Create an allocator to use for the backing memory allocation
+auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
+auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
+auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
+@endcode
+
+Once done, memory groups can be registered to use the memory manager:
+@code{.cpp}
+MemoryGroup memory_group(mm); // Create a memory group and set the memory manager to use
+@endcode
+
+@note If a memory manager is not specified then all allocation will be immediate instead of deferred through the memory manager.
+
+Next step is to set objects to be managed by the memory group. It is important though to note that the lifetime of an object is tracked from the @ref MemoryGroup::manage() and the @ref TensorAllocator::allocate calls.
+@ref MemoryGroup::manage flags that the object will be needed starting now and when @ref TensorAllocator::allocate is called it signals the end of the object lifetime.
+@code{.cpp}
+Tensor tmp1, tmp2, tmp3;            // Create example tensors
+memory_group.manage(&tmp1);         // Start managing object tmp1 and start its lifetime
+memory_group.manage(&tmp2);         // Start managing object tmp2 and start its lifetime
+
+operation1.configure(&tmp1, &tmp2); // Configure a function/kernel using tmp1 and tmp2
+
+tmp1.allocator()->allocate();       // Flag that the lifetime of object tmp1 has ended
+
+memory_group.manage(&tmp3);         // Start managing object tmp3 and start its lifetime
+
+operation2.configure(&tmp2, &tmp3); // Configure a function/kernel using tmp2 and tmp3
+
+tmp2.allocator()->allocate();       // Flag that the lifetime of object tmp2 has ended
+tmp3.allocator()->allocate();       // Flag that the lifetime of object tmp3 has ended
+@endcode
+
+@warning The configuration step should be done sequentially by a single thread so that all the lifetimes are captured correclty.
+
+When configuration of all the operations is finished then the memory manager have to be populated:
+@code{.cpp}
+mm->populate(&allocator), 2 /* num_pools */); // Populate memory manager pools
+@endcode
+
+Finally, during execution of the pipeline the memory of the appropriate memory group should be requested before running:
+@code{.cpp}
+memory_group.acquire(); // Request memory for the group
+
+operation1.run();       // Run operation1
+operation2.run();       // Run operation2
+
+memory_group.release(); // Release memory so that it can be reused
+@endcode
+@note Execution of a pipeline can be done in a multi-threading environment as memory acquisition/release are thread safe.
+@note If you are handling sensitive data and it's required to zero out the memory buffers before freeing, make sure to also zero out the intermediate buffers. You can access the buffers through the memory group's mappings.
+
+@subsection S4_7_3_memory_manager_function_support Function support
+
+Most of the library's function have been ported to use @ref IMemoryManager for their internal temporary buffers.
+
+If that is the case, a memory manager can be passed to them during construction to reuse memory among these functions.
+@code{.cpp}
+// Setup Memory Manager
+CLBufferAllocator  allocator{};                                                       // Create an allocator to use for the backing memory allocation
+auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
+auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
+auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
+
+// Create two convolution layers and use the memory manager to manager their internal temporary buffers
+CLConvolutionLayer conv1(mm), conv2(mm);
+
+// Configure layers
+conv1.configure(...);
+conv2.configure(...);
+
+// Populate memory manager
+mm->populate(&allocator), 1 /* num_pools */); // Populate memory manager pools
+
+// Run layers (Memory will be recycled for internal buffers for conv1 and conv2
+conv1.run();
+conv2.run();
+@endcode
+
+@section S4_8_import_memory Import Memory Interface
+
+The implemented @ref TensorAllocator and @ref CLTensorAllocator objects provide an interface capable of importing existing memory to a tensor as backing memory.
+
+A simple Arm® Neon™ example can be the following:
+@code{.cpp}
+// External backing memory
+void* external_ptr = ...;
+
+// Create and initialize tensor
+Tensor tensor;
+tensor.allocator()->init(tensor_info);
+
+// Import existing pointer as backing memory
+tensor.allocator()->import_memory(external_ptr);
+@endcode
+
+It is important to note the following:
+- Ownership of the backing memory is not transferred to the tensor itself.
+- The tensor mustn't be memory managed.
+- Padding requirements should be accounted by the client code. In other words, if padding is required by the tensor after the function configuration step, then the imported backing memory should account for it. Padding can be checked through the @ref TensorInfo::padding() interface.
+
+@section S4_9_opencl_tuner OpenCL Tuner
+
+OpenCL kernels when dispatched to the GPU take two arguments:
+- The Global Workgroup Size (GWS): That's the number of times to run an OpenCL kernel to process all the elements we want to process.
+- The Local Workgroup Size (LWS): That's the number of elements we want to run in parallel on a GPU core at a given point in time.
+
+The LWS can be required by an algorithm (For example if it contains memory barriers or uses local memory) but it can also be used for performance reasons to tweak the performance of a kernel: the execution time of the overall kernel might vary significantly depending on how the GWS is broken down.
+
+However, there is no universal rule regarding which LWS is best for a given kernel, so instead we created the @ref CLTuner.
+
+When the @ref CLTuner is enabled ( Target = 2 for the graph examples), the first time an OpenCL kernel is executed the Compute Library will try to run it with a variety of LWS values and will remember which one performed best for subsequent runs. At the end of the run the @ref graph::Graph will try to save these tuning parameters to a file.
+
+However this process takes quite a lot of time, which is why it cannot be enabled all the time. @ref CLTuner supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
+
+But, when the @ref CLTuner is disabled ( Target = 1 for the graph examples), the @ref graph::Graph will try to reload the file containing the tuning parameters, then for each executed kernel the Compute Library will use the fine tuned LWS if it was present in the file or use a default LWS value if it's not.
+
+@section S4_10_cl_queue_prioritites OpenCL Queue Priorities
+
+OpenCL 2.1 exposes the `cl_khr_priority_hints` extensions that if supported by an underlying implementation allows the user to specify priority hints to the created command queues.
+Is important to note that this does not specify guarantees or the explicit scheduling behavior, this is something that each implementation needs to expose.
+
+In some cases, priority queues can be used when there is an implicit internal priority between graphics and compute queues and thus allow some level of priority control between them.
+At the moment three priority level can be specified:
+- CL_QUEUE_PRIORITY_HIGH_KHR
+- CL_QUEUE_PRIORITY_MED_KHR
+- CL_QUEUE_PRIORITY_LOW_KHR
+
+Compute Library allows extraction of the internal OpenCL queue or the ability to inject directly a user-defined queue to the @ref CLScheduler.
+This way the user can utilize this extension to define priorities between the queues and setup the OpenCL scheduler mechanism to utilize them.
+
+@code{.cpp}
+cl_queue_properties queue_properties[] = {CL_QUEUE_PRIORITY_KHR, CL_QUEUE_PRIORITY_HIGH_KHR, 0};
+cl_command_queue priority_queue = clCreateCommandQueueWithProperties(ctx, dev, queue_properties, &error);
+CLScheduler::get().set_queue(::cl::CommandQueue(priority_queue));
+@endcode
+
+@section S4_11_weights_manager Weights Manager
+
+@ref IWeightsManager is a weights managing interface that can be used to reduce the memory requirements of a given pipeline by reusing transformed weights across multiple function executions.
+@ref IWeightsManager is responsible for managing weight tensors alongside with their transformations.
+@ref ITransformWeights provides an interface for running the desired transform function. This interface is used by the weights manager.
+
+@subsection S4_10_1_working_with_weights_manager Working with the Weights Manager
+Following is a simple example that uses the weights manager:
+
+Initially a weights manager must be set-up:
+@code{.cpp}
+auto  wm = std::make_shared<IWeightsManager>(); // Create a weights manager
+@endcode
+
+Once done, weights can be managed, configured and run:
+@code{.cpp}
+wm->manage(weights); // Manage the weights
+wm->acquire(weights, &_reshape_weights_managed_function); // Acquire the address of the transformed weights based on the transform function
+wm->run(weights, &_reshape_weights_managed_function);     // Run the transpose function
+@endcode
+
+@section S5_0_experimental Experimental Features
+
+@subsection S5_1_run_time_context Run-time Context
+
+Some of the Compute Library components are modelled as singletons thus posing limitations to supporting some use-cases and ensuring a more client-controlled API.
+Thus, we are introducing an aggregate service interface @ref IRuntimeContext which will encapsulate the services that the singletons were providing and allow better control of these by the client code.
+Run-time context encapsulates a list of mechanisms, some of them are: scheduling, memory management, kernel caching and others.
+Consequently, this will allow finer control of these services among pipelines when Compute Library is integrated in higher level frameworks.
+
+This feature introduces some changes to our API.
+All the kernels/functions will now accept a Runtime Context object which will allow the function to use the mentioned services.
+
+Finally, we will try to adapt our code-base progressively to use the new mechanism but will continue supporting the legacy mechanism to allow a smooth transition. Changes will apply to all our three backends: Neon, OpenCL and OpenGL ES.
+
+@subsection S5_2_clvk CLVK
+
+Compute Library offers experimental support for [CLVK](https://github.com/kpet/clvk). If CLVK is installed in the system, users can select the backend when running a graph example with --target=clvk.
+If no target is specified and more that one OpenCL implementations are present, Compute Library will pick the first available.
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/programming_model.dox b/docs/user_guide/programming_model.dox
new file mode 100644
index 0000000000..7990231ba9
--- /dev/null
+++ b/docs/user_guide/programming_model.dox
@@ -0,0 +1,70 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page programming_model Programming Model
+
+@tableofcontents
+
+@section programming_model_functions Functions
+
+Functions will automatically allocate the temporary buffers mentioned above, and will automatically multi-thread kernels' executions using the very basic scheduler described in the previous section.
+
+Simple functions only call a single kernel (e.g NEConvolution3x3), while more complex ones consist of several kernels pipelined together (e.g @ref NEFullyConnectedLayer ). Check their documentation to find out which kernels are used by each function.
+
+@code{.cpp}
+//Create a function object:
+MyFunction function;
+// Initialize the function with the input/output and options you want to use:
+function.configure( input, output, option0, option1);
+// Execute the function:
+function.run();
+@endcode
+
+@warning The Compute Library requires Arm® Mali™ OpenCL DDK r8p0 or higher (OpenCL kernels are compiled using the -cl-arm-non-uniform-work-group-size flag)
+
+@note All OpenCL functions and objects in the runtime library use the command queue associated with CLScheduler for all operations, a real implementation would be expected to use different queues for mapping operations and kernels in order to reach a better GPU utilization.
+
+@section programming_model_scheduler OpenCL Scheduler
+
+The Compute Library runtime uses a single command queue and context for all the operations.
+
+The user can get / set this context and command queue through CLScheduler's interface.
+
+The user can get / set the target GPU device through the CLScheduler's interface.
+
+@attention Make sure the application is using the same context as the library as in OpenCL it is forbidden to share objects across contexts. This is done by calling @ref CLScheduler::init() or @ref CLScheduler::default_init() at the beginning of your application.
+
+@attention Make sure the scheduler's target is not changed after function classes are created.
+
+@section programming_model__events_sync OpenCL events and synchronization
+
+In order to block until all the jobs in the CLScheduler's command queue are done executing the user can call @ref CLScheduler::sync() or create a sync event using @ref CLScheduler::enqueue_sync_event()
+
+@section programming_model_cl_neon OpenCL / Arm® Neon™ interoperability
+
+You can mix OpenCL and Arm® Neon™ kernels and functions. However it is the user's responsibility to handle the mapping/unmapping of OpenCL objects.
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/release_version_and_change_log.dox b/docs/user_guide/release_version_and_change_log.dox
new file mode 100644
index 0000000000..b9e3b37263
--- /dev/null
+++ b/docs/user_guide/release_version_and_change_log.dox
@@ -0,0 +1,1389 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page versions_changelogs Release Versions and Changelog
+
+@tableofcontents
+
+@section S2_1_versions Release versions
+
+All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number.
+If there is more than one release in a month then an extra sequential number is appended at the end:
+
+	v17.03 (First release of March 2017)
+	v17.03.1 (Second release of March 2017)
+	v17.04 (First release of April 2017)
+
+@note We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.
+
+@section S2_2_changelog Changelog
+
+v21.05 Public major release
+ - Removed computer vision support from Arm® Neon™ backend
+ - Removed the following functions:
+   - NEAbsoluteDifference
+   - NEAccumulate
+   - NEBox3x3
+   - NECannyEdge
+   - NEChannelCombine
+   - NEChannelExtract
+   - NEColorConvert
+   - NEConvolution
+   - NEDerivative
+   - NEDilate
+   - NEEqualizeHistogram
+   - NEErode
+   - NEFastCorners
+   - NEGaussian3x3
+   - NEGaussian5x5
+   - NEGaussianPyramid
+   - NEHOGDescriptor
+   - NEHOGDetector
+   - NEHOGGradient
+   - NEHOGMultiDetection
+   - NEHarrisCorners
+   - NEHistogram
+   - NEIntegralImage
+   - NELaplacianPyramid
+   - NELaplacianReconstruct
+   - NEMagnitude
+   - NEMeanStdDev
+   - NEMedian3x3
+   - NEMinMaxLocation
+   - NENonLinearFilter
+   - NEOpticalFlow
+   - NEPhase
+   - NEScharr3x3
+   - NESobel3x3
+   - NESobel5x5
+   - NESobel7x7
+   - NETableLookup
+   - NEThreshold
+   - NEWarpAffine
+   - NEWarpPerspectiveKernel
+
+ - Remove all GLES kernels / functions / tests / examples
+ - Removed computer vision support from CL backend
+ - Removed the following functions:
+   - CLAbsoluteDifference
+   - CLAccumulate
+   - CLBox3x3
+   - CLCannyEdge
+   - CLChannelCombine
+   - CLChannelExtract
+   - CLColorConvert
+   - CLConvolution
+   - CLDerivative
+   - CLDilate
+   - CLEqualizeHistogram
+   - CLErode
+   - CLFastCorners
+   - CLGaussian3x3
+   - CLGaussian5x5
+   - CLGaussianPyramid
+   - CLHOGDescriptor
+   - CLHOGDetector
+   - CLHOGGradient
+   - CLHOGMultiDetection
+   - CLHarrisCorners
+   - CLHistogram
+   - CLIntegralImage
+   - CLLaplacianPyramid
+   - CLLaplacianReconstruct
+   - CLMagnitude
+   - CLMeanStdDev
+   - CLMedian3x3
+   - CLMinMaxLocation
+   - CLNonLinearFilter
+   - CLOpticalFlow
+   - CLPhase
+   - CLScharr3x3
+   - CLSobel3x3
+   - CLSobel5x5
+   - CLSobel7x7
+   - CLTableLookup
+   - CLThreshold
+   - CLWarpAffine
+   - CLWarpPerspective
+ 
+v21.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Upgrade C++ standard to C++14
+ - Add macOS support
+ - Add Armv8-R AArch64 architecture support
+ - Add SVE/SVE2 support for:
+   - NEScaleKernel
+   - @ref NEActivationLayer
+   - @ref NEArithmeticAddition
+   - @ref NEBatchNormalizationLayerKernel
+   - @ref cpu::kernels::CpuLogits1DSoftmaxKernel
+   - @ref cpu::kernels::CpuLogits1DMaxKernel
+   - @ref cpu::kernels::CpuElementwiseUnaryKernel
+ - Remove padding from OpenCL kernels:
+   - CLDirectConvolutionLayerKernel
+   - @ref CLArgMinMaxLayerKernel
+   - @ref CLPadLayerKernel
+   - @ref CLROIAlignLayerKernel
+   - @ref CLRangeKernel
+   - CLScaleKernel
+   - @ref CLSelectKernel
+   - @ref CLBitwiseKernel
+   - @ref opencl::kernels::ClFloorKernel
+   - CLTransposeKernel
+ - Deprecate functions in CLTuner:
+    - add_lws_to_table
+    - import_lws_table
+    - lws_table
+ - Remove functions:
+   - NELocallyConnectedLayer / CLLocallyConnectedLayer
+   - NEIm2Col
+   - NECol2Im
+   - NEGEMMInterleave4x4
+   - NEGEMMTranspose1xW
+   - NEComputeAllAnchors / CLComputeAllAnchors
+   - NEGEMMAssemblyDispatch
+   - NEUpsampleLayer / CLUpsampleLayer
+ - Remove kernels:
+   - NEGEMMMatrixVectorMultiplyKernel
+   - NELocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedMatrixMultiplyKernel
+   - NEUpsampleLayerKernel / CLUpsampleLayerKernel
+ - Extend OpenCL tuner with workgroup batch size support
+   - Experimental extension for the OpenCL tuner to tune the batches of work groups distribute to compute units
+ - Add functionality to load the OpenCL GEMM heuristics at runtime
+   - The GEMM heuristic file (MLGO) can be used to update the default GEMM heuristics available for OpenCL
+ - Note: there might be performance regressions against v20.08 in Inception v3 using int8 data types on Arm Mali-G77 GPUs. Currently under investigation
+ - Note: data-type decoupling is in progress and expiremental. Warning of unused symbols might be raised
+
+v20.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Performance regressions can be noted when executing Depthwise Convolution on Arm® Neon™ with a depth multiplier > 1 for quantized data type.
+   This is planned to be resolved in 21.02 release.
+ - Added new data type QASYMM8_SIGNED support for @ref NEROIAlignLayer.
+ - Added new data type S32 support for:
+   - NEArithmeticSubtraction
+   - NEArithmeticSubtractionKernel
+   - @ref NEPixelWiseMultiplication
+   - NEPixelWiseMultiplicationKernel
+   - NEElementwiseDivision
+   - NEDivisionOperationKernel
+ - Interface change
+   - Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension
+     on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5.
+     The supported value range of axis is [-rank, rank).
+     This change applies to the following functions:
+      - @ref NESoftmaxLayer
+      - @ref NELogSoftmaxLayer
+      - @ref CLSoftmaxLayer
+      - @ref CLLogSoftmaxLayer
+      - GCSoftmaxLayer
+ - New OpenCL kernels / functions:
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
+   - @ref CLLogicalNot
+   - @ref CLLogicalAnd
+   - @ref CLLogicalOr
+ - New Arm® Neon™ kernels / functions:
+   - @ref NELogicalNot
+   - @ref NELogicalAnd
+   - @ref NELogicalOr
+ - Removed padding from Arm® Neon™ kernels:
+   - NEComplexPixelWiseMultiplicationKernel
+   - NENonMaximaSuppression3x3Kernel
+   - @ref NERemapKernel
+   - @ref NEGEMMInterleave4x4Kernel
+   - NEDirectConvolutionLayerKernel
+   - NEScaleKernel
+   - NELocallyConnectedMatrixMultiplyKernel
+   - @ref NEGEMMLowpOffsetContributionKernel
+   - @ref NEGEMMTranspose1xWKernel
+   - NEPoolingLayerKernel
+   - NEConvolutionKernel
+   - NEDepthwiseConvolutionLayerNativeKernel
+   - @ref NEGEMMLowpMatrixMultiplyKernel
+   - @ref NEGEMMMatrixMultiplyKernel
+   - NEDirectConvolutionLayerOutputStageKernel
+   - @ref NEReductionOperationKernel
+   - @ref NEGEMMLowpMatrixAReductionKernel
+   - @ref NEGEMMLowpMatrixBReductionKernel
+ - Removed padding from OpenCL kernels:
+   - CLBatchConcatenateLayerKernel
+   - CLElementwiseOperationKernel
+   - @ref CLBatchNormalizationLayerKernel
+   - CLPoolingLayerKernel
+   - @ref CLWinogradInputTransformKernel
+   - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+   - @ref CLGEMMLowpMatrixAReductionKernel
+   - @ref CLGEMMLowpMatrixBReductionKernel
+   - @ref CLGEMMLowpOffsetContributionOutputStageKernel
+   - @ref CLGEMMLowpOffsetContributionKernel
+   - @ref CLWinogradOutputTransformKernel
+   - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
+   - @ref CLFuseBatchNormalizationKernel
+   - @ref CLDepthwiseConvolutionLayerNativeKernel
+   - @ref CLDepthConvertLayerKernel
+   - CLCopyKernel
+   - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
+   - CLActivationLayerKernel
+   - @ref CLWinogradFilterTransformKernel
+   - CLWidthConcatenateLayerKernel
+   - CLWidthConcatenate4TensorsKernel
+   - CLWidthConcatenate2TensorsKernel
+   - CLLogits1DMaxShiftExpSumKernel
+   - CLLogits1DNormKernel
+   - CLHeightConcatenateLayerKernel
+   - @ref CLGEMMMatrixMultiplyKernel
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
+   - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+   - CLDepthConcatenateLayerKernel
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
+ - Removed OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+   - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
+   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
+ - Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - CLLocallyConnectedLayer
+     - CLLocallyConnectedMatrixMultiplyKernel
+     - CLAbsoluteDifference
+     - CLAbsoluteDifferenceKernel
+     - CLAccumulate
+     - CLAccumulateKernel
+     - CLAccumulateSquared
+     - CLAccumulateSquaredKernel
+     - CLAccumulateWeighted
+     - CLAccumulateWeightedKernel
+     - CLAccumulateWeightedFP16Kernel
+     - CLBox3x3
+     - CLBox3x3Kernel
+     - CLBox3x3FP16Kernel
+     - CLCannyEdge
+     - CLChannelCombine
+     - CLChannelCombineKernel
+     - CLChannelExtract
+     - CLChannelExtractKernel
+     - CLColorConvert
+     - CLColorConvertKernel
+     - CLConvolution3x3
+     - CLConvolutionRectangle
+     - CLConvolutionRectangleKernel
+     - CLConvolutionSquare
+     - CLConvolutionKernel
+     - CLDerivative
+     - CLDerivativeKernel
+     - CLDilate
+     - CLDilateKernel
+     - CLEqualizeHistogram
+     - CLErode
+     - CLErodeKernel
+     - CLFastCorners
+     - CLFastCornersKernel
+     - CLGaussian3x3
+     - CLGaussian3x3Kernel
+     - CLGaussian5x5
+     - CLGaussian5x5HorKernel
+     - CLGaussian5x5VertKernel
+     - CLGaussianPyramid
+     - CLGaussianPyramidHalf
+     - CLGaussianPyramidOrb
+     - CLHarrisCorners
+     - CLHarrisScoreKernel
+     - CLHarrisScoreFP16Kernel
+     - CLHistogram
+     - CLHistogramKernel
+     - CLHOGOrientationBinningKernel
+     - CLHOGBlockNormalizationKernel
+     - CLHOGDetectorKernel
+     - CLHOGNonMaximaSuppressionKernel
+     - CLHOGDescriptor
+     - CLHOGDetector
+     - CLHOGGradient
+     - CLHOGMultiDetection
+     - CLHOGOrientationBinningKernel
+     - CLHOGBlockNormalizationKernel
+     - CLHOGDetectorKernel
+     - CLIntegralImage
+     - CLIntegralImageKernel
+     - CLLaplacianReconstruct
+     - CLLaplacianPyramid
+     - CLMagnitude
+     - CLMagnitudePhaseKernel
+     - CLMedian3x3
+     - CLMedian3x3Kernel
+     - CLMinMaxLocation
+     - CLMinMaxLocationKernel
+     - CLNonLinearFilter
+     - CLNonLinearFilterKernel
+     - CLNonMaximaSuppression3x3
+     - CLNonMaximaSuppression3x3FP16Kernel
+     - CLNonMaximaSuppression3x3Kernel
+     - CLOpticalFlow
+     - CLPhase
+     - CLRemap
+     - CLRemapKernel
+     - CLScharr3x3
+     - CLScharr3x3Kernel
+     - CLSobel3x3
+     - CLSobel3x3Kernel
+     - CLSobel5x5
+     - CLSobel5x5HorKernel
+     - CLSobel5x5VertKernel
+     - CLSobel7x7
+     - CLSobel7x7HorKernel
+     - CLSobel7x7VertKernel
+     - CLThreshold
+     - CLThresholdKernel
+     - CLWarpAffine
+     - CLWarpAffineKernel
+     - CLWarpPerspective
+     - CLWarpPerspectiveKernel
+ - Deprecated Arm® Neon™ kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - NELocallyConnectedLayer
+     - NELocallyConnectedMatrixMultiplyKernel
+     - NEAbsoluteDifference
+     - NEAbsoluteDifferenceKernel
+     - NEAccumulate
+     - NEAccumulateKernel
+     - NEAccumulateSquared
+     - NEAccumulateSquaredKernel
+     - NEAccumulateWeighted
+     - NEAccumulateWeightedKernel
+     - NEAccumulateWeightedFP16Kernel
+     - NEBox3x3
+     - NEBox3x3Kernel
+     - NEBox3x3FP16Kernel
+     - NECannyEdge
+     - NEChannelCombine
+     - NEChannelCombineKernel
+     - NEChannelExtract
+     - NEChannelExtractKernel
+     - NEColorConvert
+     - NEColorConvertKernel
+     - NEConvolution3x3
+     - NEConvolutionRectangle
+     - NEConvolutionRectangleKernel
+     - NEConvolutionSquare
+     - NEConvolutionKernel
+     - NEDerivative
+     - NEDerivativeKernel
+     - NEDilate
+     - NEDilateKernel
+     - NEEqualizeHistogram
+     - NEErode
+     - NEErodeKernel
+     - NEFastCorners
+     - NEFastCornersKernel
+     - NEGaussian3x3
+     - NEGaussian3x3Kernel
+     - NEGaussian5x5
+     - NEGaussian5x5HorKernel
+     - NEGaussian5x5VertKernel
+     - NEGaussianPyramid
+     - NEGaussianPyramidHalf
+     - NEGaussianPyramidOrb
+     - NEHarrisCorners
+     - NEHarrisScoreKernel
+     - NEHarrisScoreFP16Kernel
+     - NEHistogram
+     - NEHistogramKernel
+     - NEHOGOrientationBinningKernel
+     - NEHOGBlockNormalizationKernel
+     - NEHOGDetectorKernel
+     - NEHOGNonMaximaSuppressionKernel
+     - NEHOGDescriptor
+     - NEHOGDetector
+     - NEHOGGradient
+     - NEHOGMultiDetection
+     - NEHOGOrientationBinningKernel
+     - NEHOGBlockNormalizationKernel
+     - NEHOGDetectorKernel
+     - NEIntegralImage
+     - NEIntegralImageKernel
+     - NELaplacianReconstruct
+     - NELaplacianPyramid
+     - NEMagnitude
+     - NEMagnitudePhaseKernel
+     - NEMedian3x3
+     - NEMedian3x3Kernel
+     - NEMinMaxLocation
+     - NEMinMaxLocationKernel
+     - NENonLinearFilter
+     - NENonLinearFilterKernel
+     - NENonMaximaSuppression3x3
+     - NENonMaximaSuppression3x3FP16Kernel
+     - NENonMaximaSuppression3x3Kernel
+     - NEOpticalFlow
+     - NEPhase
+     - NERemap
+     - NERemapKernel
+     - NEScharr3x3
+     - NEScharr3x3Kernel
+     - NESobel3x3
+     - NESobel3x3Kernel
+     - NESobel5x5
+     - NESobel5x5HorKernel
+     - NESobel5x5VertKernel
+     - NESobel7x7
+     - NESobel7x7HorKernel
+     - NESobel7x7VertKernel
+     - NEThreshold
+     - NEThresholdKernel
+     - NEWarpAffine
+     - NEWarpAffineKernel
+     - NEWarpPerspective
+     - NEWarpPerspectiveKernel
+ - Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - GCAbsoluteDifference
+     - GCActivationLayer
+     - GCArithmeticAddition
+     - GCBatchNormalizationLayer
+     - GCConcatenateLayer
+     - GCConvolutionLayer
+     - GCDepthwiseConvolutionLayer
+     - GCDirectConvolutionLayer
+     - GCDropoutLayer
+     - GCFillBorder
+     - GCFullyConnectedLayer
+     - GCGEMM
+     - GCGEMMInterleave4x4
+     - GCGEMMTranspose1xW
+     - GCNormalizationLayer
+     - GCNormalizePlanarYUVLayer
+     - GCPixelWiseMultiplication
+     - GCPoolingLayer
+     - GCScale
+     - GCSoftmaxLayer
+     - GCTensorShift
+     - GCTranspose
+
+
+v20.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Added new data type QASYMM8_SIGNED support for:
+   - @ref CLArgMinMaxLayer
+   - @ref CLArgMinMaxLayerKernel
+ - Added new data type U8 support for:
+   - @ref NECropKernel
+   - CLCropKernel
+ - Added aligh_corner support for nearest neighbor interpolation in:
+   - NEScaleKernel
+   - CLScaleKernel
+ - New OpenCL kernels / functions:
+   - @ref CLMaxUnpoolingLayerKernel
+ - New Arm® Neon™ kernels / functions:
+   - @ref NEMaxUnpoolingLayerKernel
+ - New graph example:
+   - graph_yolov3_output_detector
+ - GEMMTuner improvements:
+   - Added fp16 support
+   - Output json files for easier integration
+   - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
+   - More robust script for running benchmarks
+ - Removed padding from:
+   - NEPixelWiseMultiplicationKernel
+   - NEHeightConcatenateLayerKernel
+   - NEThresholdKernel
+   - NEBatchConcatenateLayerKernel
+   - NETransposeKernel
+   - @ref NEBatchNormalizationLayerKernel
+   - NEArithmeticSubtractionKernel
+   - @ref NEBoundingBoxTransformKernel
+   - NELogits1DMaxKernel
+   - NELogits1DSoftmaxKernel
+   - @ref NEROIPoolingLayerKernel
+   - @ref NEROIAlignLayerKernel
+   - NEYOLOLayerKernel
+   - NEUpsampleLayerKernel
+   - NEFloorKernel
+   - NEWidthConcatenateLayerKernel
+   - NEDepthConcatenateLayerKernel
+   - @ref NENormalizationLayerKernel
+   - @ref NEL2NormalizeLayerKernel
+   - NEFillArrayKernel
+   - @ref NEDepthConvertLayerKernel
+   - @ref NERangeKernel
+   - @ref NEPriorBoxLayer
+ - Removed OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ToUint8Scale
+   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
+ - Removed Arm® Neon™ kernels / functions:
+   - NEGEMMLowpQuantizeDownInt32ToUint8Scale
+   - NEGEMMMatrixAccumulateBiasesKernel
+ - Deprecated functions / interfaces:
+   - Non-descriptor based interfaces for NEThreshold, CLThreshold
+   - Non-descriptor based interfaces for @ref NEScale, @ref CLScale and GCScale
+   - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer :
+      The default "axis" value for @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer is changed from 1 to 0.
+      Only axis 0 is supported.
+      The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0.
+      Only axis 0 is supported.
+ - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity.
+ - Removed padding requirement for the input (e.g. LHS of GEMM) and output in @ref CLGEMMMatrixMultiplyNativeKernel, @ref CLGEMMMatrixMultiplyReshapedKernel, @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only)
+   - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output.
+   - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation.
+   - Only on Arm® Mali™ Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since @ref CLGEMMMatrixMultiplyKernel is called and currently requires padding.
+ - Added support for exporting the OpenCL buffer object to the OpenCL image object in @ref CLGEMMMatrixMultiplyReshapedKernel and @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
+   - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
+   - The padding requirement for the OpenCL image object is considered into the @ref CLGEMMReshapeRHSMatrixKernel.
+   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate @ref CLGEMMConvolutionLayer.
+
+v20.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r18b.
+ - Updated recommended gcc version to Linaro 6.3.1.
+ - Added Bfloat16 type support
+ - Added Bfloat16 support in:
+     - @ref NEWeightsReshapeKernel
+     - @ref NEConvolutionLayerReshapeWeights
+     - @ref NEIm2ColKernel
+     - NEIm2Col
+     - @ref NEDepthConvertLayerKernel
+     - @ref NEDepthConvertLayer
+     - @ref NEGEMMConvolutionLayer
+     - NEGEMMAssemblyDispatch
+ - Added new data type QASYMM8_SIGNED support for:
+     - @ref CLDirectConvolutionLayer
+     - @ref CLDeconvolutionLayer
+     - @ref CLDirectDeconvolutionLayer
+     - @ref CLGEMMDeconvolutionLayer
+     - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
+     - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
+     - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
+     - @ref CLReductionOperation
+     - @ref CLReduceMean
+     - @ref NEScale
+     - NEScaleKernel
+     - NEUpsampleLayer
+     - @ref NECast
+     - @ref NEReductionOperation
+     - @ref NEReduceMean
+     - @ref NEArgMinMaxLayer
+     - @ref NEDeconvolutionLayer
+     - @ref NEGEMMLowpQuantizeDownInt32ScaleKernel
+     - @ref CPPBoxWithNonMaximaSuppressionLimit
+     - @ref CPPDetectionPostProcessLayer
+     - @ref CPPPermuteKernel
+     - @ref CPPPermute
+     - @ref CPPTopKVKernel
+     - @ref CPPTopKV
+     - @ref CPPUpsample
+     - @ref CPPUpsampleKernel
+ - New OpenCL kernels / functions:
+     - @ref CLQLSTMLayer
+     - @ref CLQLSTMLayerNormalizationKernel
+ - New Arm® Neon™ kernels / functions:
+     - @ref NEQLSTMLayer
+     - @ref NEQLSTMLayerNormalizationKernel
+ - Added HARD_SWISH support in:
+     - CLActivationLayerKernel
+     - NEActivationLayerKernel
+ - Deprecated OpenCL kernels / functions:
+     - CLGEMMLowpQuantizeDownInt32ToUint8Scale
+     - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
+ - Deprecated Arm® Neon™ kernels / functions:
+     - NEGEMMLowpQuantizeDownInt32ToUint8Scale
+ - Removed CPP kernels / functions:
+     - CPPFlipWeightsKernel
+ - Removed PoolingLayerInfo constructors without Data Layout.
+ - Removed CLDepthwiseConvolutionLayer3x3
+ - Removed NEDepthwiseConvolutionLayerOptimized
+ - Added support for Winograd 3x3,4x4 on Arm® Neon™ FP16:
+     - @ref NEWinogradConvolutionLayer
+     - @ref NEWinogradLayerTransformInputKernel
+     - @ref NEWinogradLayerTransformOutputKernel
+     - @ref NEWinogradLayerTransformWeightsKernel
+ - Added CLCompileContext
+ - Added Arm® Neon™ GEMM kernel with 2D window support
+
+v20.02.1 Maintenance release
+ - Added Android-NN build script.
+
+v20.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Added new data type QASYMM8_SIGNED support for:
+     - @ref CLDepthwiseConvolutionLayer
+     - CLDepthwiseConvolutionLayer3x3
+     - @ref CLGEMMConvolutionLayer
+     - @ref CLGEMMLowpMatrixMultiplyCore
+     - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+     - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+     - @ref NEActivationLayer
+     - NEComparisonOperationKernel
+     - @ref NEConvolutionLayer
+     - @ref NEDepthwiseConvolutionLayer
+     - NEDepthwiseConvolutionLayer3x3Kernel
+     - NEDirectConvolutionLayerOutputStageKernel
+     - @ref NEElementwiseComparison
+     - @ref NEElementwiseMax
+     - @ref NEElementwiseMin
+     - @ref NEElementwiseSquaredDiff
+     - @ref NEFullyConnectedLayer
+     - NEGEMMMatrixVectorMultiplyKernel
+     - @ref NEPixelWiseMultiplication
+     - @ref NEPoolingLayer
+     - @ref NEPReluLayer
+ - Added support for QSYMM8_PER_CHANNEL in:
+     - NEDepthwiseConvolutionLayer3x3Kernel
+ - Added support for split sizes in:
+     - @ref CLSplit
+     - @ref NESplit
+ - New OpenCL kernels / functions:
+     - @ref CLFill
+     - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
+ - New Arm® Neon™ kernels / functions:
+     - @ref NEFill
+     - @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
+ - Deprecated Arm® Neon™ functions / interfaces:
+     - CLDepthwiseConvolutionLayer3x3
+     - NEDepthwiseConvolutionLayerOptimized
+     - PoolingLayerInfo constructors without Data Layout.
+ - Added support for quantization with multiplier greater than 1 on Arm® Neon™ and CL.
+ - Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to @ref CLQuantizationLayer.
+ - Added the ability to build bootcode for bare metal.
+ - Added support for generating synthetic QASYMM8 graphs.
+ - Added support for F16 datatype in VGG16.
+ - Removed pre-built binaries for GLES.
+
+v19.11.1 Public maintenance release
+ - Fix offset calculation in NEReductionOperationKernel.
+ - Fix data layout in NEScaleKernel for nhwc.
+ - Retain configuration step data layout to avoid side-effects.
+ - Perform sqrt in double domain for L2 pooling.
+ - Fix output shape calculation for Reduce Mean
+ - Restrict cases where optimized NEPadLayer runs.
+
+v19.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r17c.
+ - Deprecated OpenCL kernels / functions:
+    - CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
+    - CLDepthwiseIm2ColKernel
+    - CLDepthwiseSeparableConvolutionLayer
+    - CLDepthwiseVectorToTensorKernel
+    - CLDirectConvolutionLayerOutputStageKernel
+ - Deprecated Arm® Neon™ kernels / functions:
+    - NEDepthwiseWeightsReshapeKernel
+    - NEDepthwiseIm2ColKernel
+    - NEDepthwiseSeparableConvolutionLayer
+    - NEDepthwiseVectorToTensorKernel
+    - NEDepthwiseConvolutionLayer3x3
+ - New OpenCL kernels / functions:
+    - @ref CLInstanceNormalizationLayerKernel / @ref CLInstanceNormalizationLayer
+    - @ref CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated
+      OpenCL kernels / functions)
+    - @ref CLLogSoftmaxLayer
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBoundingBoxTransformKernel / @ref NEBoundingBoxTransform
+    - @ref NEComputeAllAnchorsKernel / NEComputeAllAnchors
+    - @ref NEDetectionPostProcessLayer
+    - @ref NEGenerateProposalsLayer
+    - @ref NEInstanceNormalizationLayerKernel / @ref NEInstanceNormalizationLayer
+    - @ref NELogSoftmaxLayer
+    - @ref NEROIAlignLayerKernel / @ref NEROIAlignLayer
+ - Added QASYMM8 support for:
+    - @ref CLGenerateProposalsLayer
+    - @ref CLROIAlignLayer
+    - @ref CPPBoxWithNonMaximaSuppressionLimit
+ - Added QASYMM16 support for:
+    - @ref CLBoundingBoxTransform
+ - Added FP16 support for:
+    - @ref CLGEMMMatrixMultiplyReshapedKernel
+ - Added new data type QASYMM8_PER_CHANNEL support for:
+    - CLDequantizationLayer
+    - @ref NEDequantizationLayer
+ - Added new data type QSYMM8_PER_CHANNEL support for:
+    - @ref CLConvolutionLayer
+    - @ref NEConvolutionLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref NEDepthwiseConvolutionLayer
+ - Added FP16 mixed-precision support for:
+    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - CLPoolingLayerKernel
+ - Added FP32 and FP16 ELU activation for:
+    - @ref CLActivationLayer
+    - @ref NEActivationLayer
+ - Added asymmetric padding support for:
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+    - @ref NEDeconvolutionLayer
+ - Added SYMMETRIC and REFLECT modes for @ref CLPadLayerKernel / @ref CLPadLayer.
+ - Replaced the calls to NECopyKernel and NEMemsetKernel with @ref NEPadLayer in @ref NEGenerateProposalsLayer.
+ - Replaced the calls to CLCopyKernel and CLMemsetKernel with @ref CLPadLayer in @ref CLGenerateProposalsLayer.
+ - Improved performance for CL Inception V3 - FP16.
+ - Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
+ - Improved Arm® Neon™ performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
+ - Improved Arm® Neon™ performance for MobileNet-SSD by improving the output detection performance.
+ - Optimized @ref CLPadLayer.
+ - Optimized CL generic depthwise convolution layer by introducing @ref CLDepthwiseConvolutionLayerNativeKernel.
+ - Reduced memory consumption by implementing weights sharing.
+
+v19.08.1 Public maintenance release
+ - Fix offset calculation in NEReductionOperationKernel.
+ - Fix data layout in NEScaleKernel for nhwc.
+ - Retain configuration step data layout to avoid side-effects.
+ - Perform sqrt in double domain for L2 pooling.
+ - Fix output shape calculation for Reduce Mean
+ - Fix broadcast CLPixelwiseMultiplication with 5D tensors
+
+v19.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Deprecated Arm® Neon™ functions
+    - NEDepthConcatenateLayer
+    - NEWidthConcatenateLayer
+ - Deprecated OpenCL kernels / functions
+    - CLDepthConcatenateLayer
+    - CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
+    - CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
+    - CLWidthConcatenateLayer
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEAbsLayer
+    - @ref NECast
+    - @ref NEElementwisePower
+    - @ref NELogLayer
+    - @ref NELSTMLayerQuantized
+    - @ref NENegLayer
+    - @ref NEPReluLayer
+    - @ref NESinLayer
+    - NEBatchConcatenateLayerKernel
+    - @ref NEDepthToSpaceLayerKernel / @ref NEDepthToSpaceLayer
+    - NEDepthwiseConvolutionLayerNativeKernel
+    - @ref NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+    - @ref NEMeanStdDevNormalizationKernel / @ref NEMeanStdDevNormalizationLayer
+    - @ref NESpaceToDepthLayerKernel / @ref NESpaceToDepthLayer
+ - New OpenCL kernels / functions:
+    - @ref CLAbsLayer
+    - @ref CLElementwisePower
+    - @ref CLLogLayer
+    - @ref CLLSTMLayerQuantized
+    - @ref CLNegLayer
+    - @ref CLPReluLayer
+    - @ref CLSinLayer
+    - CLBatchConcatenateLayerKernel
+    - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer
+    - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+    - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+    - @ref CLGEMMMatrixMultiplyNativeKernel
+    - CLMeanStdDevNormalizationKernel /CLMeanStdDevNormalizationLayer
+    - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer
+ - New examples:
+    - neon_opticalflow
+    - cl_cache
+    - neon_permute
+ - Added support for FP16 in @ref NEDeconvolutionLayer
+ - Added support for FP16 in @ref CLDeconvolutionLayer
+ - Added support for REDUCE_MIN and REDUCE_MAX in @ref ReductionOperation
+ - Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
+ - Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
+ - Re-factored the depthwise convolution layer kernel on Arm® Neon™ for generic cases
+ - Added an optimized depthwise convolution layer kernel for 5x5 filters (Neon only)
+ - Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
+ - Altered @ref QuantizationInfo interface to support per-channel quantization.
+ - The CLDepthwiseConvolutionLayer3x3 will be included by @ref CLDepthwiseConvolutionLayer to accommodate for future optimizations.
+ - The NEDepthwiseConvolutionLayerOptimized will be included by @ref NEDepthwiseConvolutionLayer to accommodate for future optimizations.
+ - Removed inner_border_right and inner_border_top parameters from @ref CLDeconvolutionLayer interface
+ - Removed inner_border_right and inner_border_top parameters from @ref NEDeconvolutionLayer interface
+ - Optimized the Arm® Neon™ assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel
+
+v19.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer
+    - NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication
+    - @ref NECropKernel / @ref NECropResize
+    - NEDepthwiseConvolutionAssemblyDispatch
+    - @ref NEFFTDigitReverseKernel
+    - @ref NEFFTRadixStageKernel
+    - @ref NEFFTScaleKernel
+    - @ref NEGEMMLowpOffsetContributionOutputStageKernel
+    - NEHeightConcatenateLayerKernel
+    - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer
+    - @ref NEFFT1D
+    - @ref NEFFT2D
+    - @ref NEFFTConvolutionLayer
+ - New OpenCL kernels / functions:
+    - CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
+    - CLCropKernel / @ref CLCropResize
+    - @ref CLDeconvolutionReshapeOutputKernel
+    - @ref CLFFTDigitReverseKernel
+    - @ref CLFFTRadixStageKernel
+    - @ref CLFFTScaleKernel
+    - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+    - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
+    - CLHeightConcatenateLayerKernel
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLFFT1D
+    - @ref CLFFT2D
+    - @ref CLFFTConvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+ - New OpenGLES kernels / functions:
+    - GCConcatenateLayer
+ - Deprecated functions/interfaces
+    - GCDepthConcatenateLayer
+    - NEWidthConcatenateLayer
+    - NEDepthConcatenateLayer
+    - CLWidthConcatenateLayer
+    - CLDepthConcatenateLayer
+    - CLGEMMInterleave4x4
+    - CLGEMMTranspose1xW
+ - Support different quantization info in CLConcatLayer.
+ - Add checks on different input/output quantization info were not supported.
+ - Tensors have different quantization information.
+ - Add FP16 support checks.
+ - Fix output quantization CLDeptwiseConv3x3 when activation is fused.
+ - New graph examples:
+     - graph_convolution
+     - graph_fully_connected
+     - graph_depthwise_convolution
+     - Deepspeech v0.4.1
+ - Add support for QASYMM8 in NEArithmeticSubtractionKernel.
+ - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
+ - Add support for QASYMM8 NEDeconvolution.
+ - Add support for DequantizationLayer for Neon/CL.
+ - Add support for dilation in CLDepthwiseConvolution.
+ - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
+ - Optimize CLDeconvolution.
+ - Add StackLayer to the graph API.
+ - Add support for "reflect" padding mode in NEPad.
+ - Winograd 7x7 NHWC on OpenCL.
+ - Rework CL ML layers to run exclusively on CL.
+ - Support different quantization info in PoolingLayer.
+ - Implement and test import memory interfaces.
+ - Added new tests and removed old ones.
+ - Various clang-tidy fixes.
+
+v19.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NETileKernel / @ref NETile
+    - @ref NEFuseBatchNormalizationKernel / @ref NEFuseBatchNormalization
+    - NEElementwiseOperationKernel
+    - @ref NEElementwiseMax
+    - @ref NEElementwiseMin
+    - @ref NEElementwiseSquaredDiff
+    - @ref NESelectKernel / @ref NESelect
+    - @ref NESplit
+    - @ref NESlice
+    - @ref NEUnstack
+    - @ref NEStridedSliceKernel / @ref NEStridedSlice
+    - NEElementwiseUnaryKernel
+    - @ref NERsqrtLayer
+    - @ref NEExpLayer
+    - @ref NEReverseKernel / @ref NEReverse
+    - @ref NEArgMinMaxLayer
+    - @ref NEStackLayerKernel / @ref NEStackLayer
+    - @ref NERangeKernel / @ref NERange
+    - @ref NEPadLayer
+    - NEMemsetKernel
+    - @ref NEGatherKernel / @ref NEGather
+    - @ref NEElementwiseComparison
+    - @ref NEElementwiseComparisonStatic
+    - NEComparisonOperationKernel
+    - @ref NEElementwiseDivision
+ - New OpenCL kernels / functions:
+    - @ref CLSelectKernel / @ref CLSelect
+    - @ref CLTileKernel / @ref CLTile
+    - @ref CLComparisonKernel / @ref CLComparison
+    - @ref CLArgMinMaxLayer
+    - @ref CLElementwiseMax
+    - @ref CLElementwiseMin
+    - @ref CLElementwiseSquaredDiff
+    - @ref CLStackLayerKernel / @ref CLStackLayer
+    - @ref CLReverse / @ref CLReverseKernel
+    - @ref CLRsqrtLayer
+    - @ref CLExpLayer
+    - CLElementWiseUnaryLayerKernel
+    - @ref CLGEMMReshapeLHSMatrixKernel
+    - @ref CLGEMMReshapeRHSMatrixKernel
+    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - @ref CLRangeKernel / @ref CLRange
+    - @ref CLUnstack
+    - @ref CLGatherKernel / @ref CLGather
+    - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
+ - New CPP kernels / functions:
+    - @ref CPPDetectionOutputLayer
+    - @ref CPPTopKV / @ref CPPTopKVKernel
+ - Added new examples:
+    - graph_ssd_mobilenet.cpp
+    - graph_mobilenet_v2.cpp
+    - graph_resnet12.cpp
+    - graph_srcnn955.cpp
+    - graph_vgg_vdsr.cpp
+    - graph_inception_resnet_v1.cpp
+ - Add 4D tensors support to
+    - @ref NESoftmaxLayer
+ - Fused activation in @ref CLWinogradConvolutionLayer
+ - Extented @ref NEPermute to support more cases
+ - Added Neon/SVE GEMM Hybrid kernels
+ - Added u8 and s8 hybrid assembly kernels
+ - Introduced GEMM strategy name in NEGEMMAssemblyWrapper
+ - Improved @ref CLTuner
+ - Fused the bias addition within @ref CLGEMM
+ - Added support for QASYMM8 LOGISTIC activation in @ref NEActivationLayer
+ - Added NHWC data layout support to:
+    - @ref NEScale for F16
+    - @ref CLNormalizationLayer IN_MAP_2D for FP32/FP16
+    - @ref NEL2NormalizeLayer for FP32/FP16
+    - @ref NENormalizationLayer IN_MAP_2D for FP32/FP16
+    - @ref CLROIAlignLayer
+    - @ref CLGenerateProposalsLayer
+ - Added QASYMM8 support to the following kernels:
+    - NEArithmeticAdditionKernel
+    - @ref NEScale
+ - Added new tests and improved validation and benchmarking suites.
+ - Deprecated functions/interfaces
+    - Usage of inner_border_right and inner_border_top has been deprecated in @ref CLDeconvolutionLayer and @ref NEDeconvolutionLayer
+
+v18.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEChannelShuffleLayer / @ref NEChannelShuffleLayerKernel
+    - @ref NEReduceMean
+    - @ref NEReorgLayer / @ref NEReorgLayerKernel
+    - @ref NEPriorBoxLayer / @ref NEPriorBoxLayerKernel
+    - NEUpsampleLayer / NEUpsampleLayerKernel
+    - NEYOLOLayer / NEYOLOLayerKernel
+ - New OpenCL kernels / functions:
+    - @ref CLBatchToSpaceLayer / @ref CLBatchToSpaceLayerKernel
+    - @ref CLBoundingBoxTransform / @ref CLBoundingBoxTransformKernel
+    - @ref CLComputeAllAnchorsKernel
+    - @ref CLGenerateProposalsLayer
+    - @ref CLNormalizePlanarYUVLayer / @ref CLNormalizePlanarYUVLayerKernel
+    - @ref CLReorgLayer / @ref CLReorgLayerKernel
+    - @ref CLSpaceToBatchLayer / @ref CLSpaceToBatchLayerKernel
+    - @ref CLPadLayer
+    - @ref CLReduceMean
+    - @ref CLPriorBoxLayer / @ref CLPriorBoxLayerKernel
+    - @ref CLROIAlignLayer / @ref CLROIAlignLayerKernel
+    - @ref CLSlice
+    - @ref CLSplit
+    - @ref CLStridedSlice / @ref CLStridedSliceKernel
+    - CLUpsampleLayer / CLUpsampleLayerKernel
+    - CLYOLOLayer / CLYOLOLayerKernel
+ - New CPP kernels / functions:
+    - @ref CPPBoxWithNonMaximaSuppressionLimit / @ref CPPBoxWithNonMaximaSuppressionLimitKernel
+ - Added the validate method in:
+    - @ref NEDepthConvertLayer
+    - @ref NEFloor / @ref CLFloor
+    - @ref NEGEMMMatrixAdditionKernel
+    - @ref NEReshapeLayer / @ref CLReshapeLayer
+    - @ref CLScale
+ - Added new examples:
+    - graph_shufflenet.cpp
+    - graph_yolov3.cpp
+ - Added documentation for add a new function or kernel.
+ - Improved doxygen documentation adding a list of the existing functions.
+ - Add 4D tensors support to
+    - CLWidthConcatenateLayer
+    - CLFlattenLayer
+    - @ref CLSoftmaxLayer
+ - Add dot product support for @ref CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
+ - Add SVE support
+ - Fused batch normalization into convolution layer weights in @ref CLFuseBatchNormalization
+ - Fuses activation in @ref CLDepthwiseConvolutionLayer3x3NCHWKernel, @ref CLDepthwiseConvolutionLayer3x3NHWCKernel and @ref NEGEMMConvolutionLayer
+ - Added NHWC data layout support to:
+    - @ref CLChannelShuffleLayer
+    - @ref CLDeconvolutionLayer
+    - @ref CLL2NormalizeLayer
+ - Added QASYMM8 support to the following kernels:
+    - CLScaleKernel
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - CLPixelWiseMultiplicationKernel
+ - Added FP16 support to the following kernels:
+    - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - @ref CLNormalizePlanarYUVLayerKernel
+    - @ref CLWinogradConvolutionLayer (5x5 kernel)
+ - More tests added to both validation and benchmarking suites.
+
+v18.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r17b.
+ - Removed support for QS8/QS16 data types.
+ - Added support for grouped convolution in @ref CLConvolutionLayer.
+ - Added NHWC data layout support to:
+    - NEDepthConcatenateLayer / CLDepthConcatenateLayer
+    - @ref NEWinogradConvolutionLayer / @ref CLWinogradConvolutionLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref CLDirectConvolutionLayer
+    - @ref CLConvolutionLayer
+    - @ref CLScale
+    - @ref CLIm2ColKernel
+ - New Arm® Neon™ kernels / functions:
+    - @ref NERNNLayer
+ - New OpenCL kernels / functions:
+    - @ref CLArithmeticDivision
+ - Introduced prepare() stage support in the graph API for GLES.
+ - Added support for memory reusage when trying to allocate smaller CLTensors.
+ - Enabled NHWC execution on graph examples.
+ - Added JPEG accessor for validation purposes.
+ - Added validate methods to some kernels / functions.
+
+v18.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Major redesign in the interface for the neon kernels implemented in assembly.
+ - Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
+ - Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in neon functions.
+ - Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
+ - Moved neon assembly kernels to the folder src/core/Neon/kernels/arm_gemm.
+ - Improved doxygen documentation.
+ - Improved memory management for layer's transitions.
+ - Added support for NHWC data layout in tensors.
+ - Added NHWC data layout support to:
+    - @ref NEGEMMConvolutionLayer
+    - @ref NEDirectConvolutionLayer
+    - @ref NEPoolingLayer / @ref CLPoolingLayer
+    - @ref NEBatchNormalizationLayer / @ref CLBatchNormalizationLayer
+    - @ref NEDepthwiseConvolutionLayer
+    - @ref NEScale
+    - NEIm2Col
+ - Added support for dilated convolutions in @ref NEConvolutionLayer and @ref CLConvolutionLayer.
+ - New OpenCL kernels / functions:
+    - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel
+    - CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights
+    - @ref CLCopy / CLCopyKernel
+    - @ref CLLSTMLayer
+    - @ref CLRNNLayer
+    - CLWidthConcatenateLayer / CLWidthConcatenateLayerKernel
+    - @ref CLWinogradFilterTransformKernel / @ref CLWinogradInputTransformKernel / @ref CLWinogradConvolutionLayer
+    - @ref CLWinogradInputTransformKernel / @ref CLWinogradInputTransform
+ - New Arm® Neon™ kernels / functions:
+    - NEConvertFullyConnectedWeightsKernel / @ref NEConvertFullyConnectedWeights.
+ - Created the validate method in @ref CLDepthwiseConvolutionLayer.
+ - Beta and gamma are no longer mandatory arguments in @ref NEBatchNormalizationLayer and @ref CLBatchNormalizationLayer.
+ - Added depth multiplier support in @ref NEDepthwiseConvolutionLayer and @ref CLDepthwiseConvolutionLayer.
+ - Added broadcast multiply support in @ref NEPixelWiseMultiplication / NEPixelWiseMultiplicationKernel.
+ - Port mobilenet example to NHWC data layout.
+ - Enabled Winograd method in @ref CLConvolutionLayer.
+ - Renamed NEWinogradLayer to @ref NEWinogradConvolutionLayer.
+ - Updated @ref NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/Neon/kernels/arm_gemm.
+ - Added memory manager support in GLES functions.
+ - Major refactoring of the graph API.
+ - Added GLES backend in the graph API.
+ - Added support for the memory manager in the graph API.
+ - Enabled Winograd Convolution method in the graph API.
+ - Added support for grouped convolutions in the graph API.
+ - Replaced NEDeconvolutionLayerUpsampleKernel with NEScaleKernel in @ref NEDeconvolutionLayer.
+ - Added fast maths flag in @ref CLConvolutionLayer.
+ - Added new tests and benchmarks in validation and benchmark frameworks
+ - Merge Activation layer with Convolution Layer (Neon. CL, GLES)
+ - Added support to OpenCL 2.0 SVM
+ - Added support to import memory in OpenCL tensors.
+ - Added the prepare() method to perform any one off pre-processing before running the function.
+ - Added new examples:
+    - graph_inception_v4.cpp
+    - graph_resnext50.cpp
+ - Added memory measurement instrument for CL.
+
+v18.03 Public maintenance release
+ - Various bug fixes.
+ - Fixed bug in @ref NEActivationLayer
+ - Fix in @ref CLTuner when using batches.
+ - Updated recommended NDK version to r16b (And fixed warnings).
+ - Fixed bug in validation code.
+ - Added Inception v4 graph example.
+ - Renamed NEWinogradLayer.cpp to @ref NEWinogradConvolutionLayer
+
+v18.02 Public major release
+ - Various Arm® Neon™ / OpenCL / GLES optimisations.
+ - Various bug fixes.
+ - Changed default number of threads on big LITTLE systems.
+ - Refactored examples and added:
+    - graph_mobilenet_qassym8
+    - graph_resnet
+    - graph_squeezenet_v1_1
+ - Renamed @ref CLConvolutionLayer into @ref CLGEMMConvolutionLayer and created a new @ref CLConvolutionLayer to select the fastest convolution method.
+ - Renamed @ref NEConvolutionLayer into @ref NEGEMMConvolutionLayer and created a new @ref NEConvolutionLayer to select the fastest convolution method.
+ - Added in place support to:
+    - @ref CLActivationLayer
+    - @ref CLBatchNormalizationLayer
+ - Added QASYMM8 support to:
+    - @ref CLActivationLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref NEDepthwiseConvolutionLayer
+    - @ref NESoftmaxLayer
+ - Added FP16 support to:
+    - CLDepthwiseConvolutionLayer3x3
+    - @ref CLDepthwiseConvolutionLayer
+ - Added broadcasting support to NEArithmeticAddition / @ref CLArithmeticAddition / @ref CLPixelWiseMultiplication
+ - Added fused batched normalization and activation to @ref CLBatchNormalizationLayer and @ref NEBatchNormalizationLayer
+ - Added support for non-square pooling to @ref NEPoolingLayer and @ref CLPoolingLayer
+ - New OpenCL kernels / functions:
+    - CLDirectConvolutionLayerOutputStageKernel
+ - New Arm® Neon™ kernels / functions
+    - Added name() method to all kernels.
+    - Added support for Winograd 5x5.
+    - NEPermuteKernel / @ref NEPermute
+    - @ref NEWinogradLayerTransformInputKernel / NEWinogradLayer
+    - @ref NEWinogradLayerTransformOutputKernel / NEWinogradLayer
+    - @ref NEWinogradLayerTransformWeightsKernel / NEWinogradLayer
+    - Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel
+ - New GLES kernels / functions:
+    - GCTensorShiftKernel / GCTensorShift
+
+v18.01 Public maintenance release
+ - Various bug fixes
+ - Added some of the missing validate() methods
+ - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample
+ - Added CLPermuteKernel / @ref CLPermute
+ - Added method to clean the programs cache in the CL Kernel library.
+ - Added GCArithmeticAdditionKernel / GCArithmeticAddition
+ - Added GCDepthwiseConvolutionLayer3x3Kernel / GCDepthwiseConvolutionLayer3x3
+ - Added GCNormalizePlanarYUVLayerKernel / GCNormalizePlanarYUVLayer
+ - Added GCScaleKernel / GCScale
+ - Added GCWeightsReshapeKernel / GCConvolutionLayer
+ - Added FP16 support to the following GLES compute kernels:
+    - GCCol2ImKernel
+    - GCGEMMInterleave4x4Kernel
+    - GCGEMMTranspose1xWKernel
+    - GCIm2ColKernel
+ - Refactored Arm® Neon™ Winograd (NEWinogradLayerKernel)
+ - Added NEDirectConvolutionLayerOutputStageKernel
+ - Added QASYMM8 support to the following Arm® Neon™ kernels:
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - @ref NEFillBorderKernel
+    - NEPoolingLayerKernel
+ - Added new examples:
+    - graph_cl_mobilenet_qasymm8.cpp
+    - graph_inception_v3.cpp
+    - gc_dc.cpp
+ - More tests added to both validation and benchmarking suites.
+
+v17.12 Public major release
+ - Most machine learning functions on OpenCL support the new data type QASYMM8
+ - Introduced logging interface
+ - Introduced opencl timer
+ - Reworked GEMMLowp interface
+ - Added new Arm® Neon™ assembly kernels for GEMMLowp, SGEMM and HGEMM
+ - Added validation method for most Machine Learning kernels / functions
+ - Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
+ - Added sgemm example for OpenCL
+ - Added absolute difference example for GLES compute
+ - Added new tests and benchmarks in validation and benchmark frameworks
+ - Added new kernels / functions for GLES compute
+
+ - New OpenGL ES kernels / functions
+    - GCAbsoluteDifferenceKernel / GCAbsoluteDifference
+    - GCActivationLayerKernel / GCActivationLayer
+    - GCBatchNormalizationLayerKernel / GCBatchNormalizationLayer
+    - GCCol2ImKernel
+    - GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
+    - GCDirectConvolutionLayerKernel / GCDirectConvolutionLayer
+    - GCDropoutLayerKernel / GCDropoutLayer
+    - GCFillBorderKernel / GCFillBorder
+    - GCGEMMInterleave4x4Kernel / GCGEMMInterleave4x4
+    - GCGEMMMatrixAccumulateBiasesKernel / GCGEMMMatrixAdditionKernel / GCGEMMMatrixMultiplyKernel / GCGEMM
+    - GCGEMMTranspose1xWKernel / GCGEMMTranspose1xW
+    - GCIm2ColKernel
+    - GCNormalizationLayerKernel / GCNormalizationLayer
+    - GCPixelWiseMultiplicationKernel / GCPixelWiseMultiplication
+    - GCPoolingLayerKernel / GCPoolingLayer
+    - GCLogits1DMaxKernel / GCLogits1DShiftExpSumKernel / GCLogits1DNormKernel / GCSoftmaxLayer
+    - GCTransposeKernel / GCTranspose
+
+ - New Arm® Neon™ kernels / functions
+    - arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
+    - arm_compute::NEHGEMMAArch64FP16Kernel
+    - NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / @ref NEDepthwiseConvolutionLayer
+    - @ref NEGEMMLowpOffsetContributionKernel / @ref NEGEMMLowpMatrixAReductionKernel / @ref NEGEMMLowpMatrixBReductionKernel / @ref NEGEMMLowpMatrixMultiplyCore
+    - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
+    - NEWinogradLayer / NEWinogradLayerKernel
+
+ - New OpenCL kernels / functions
+    - @ref CLGEMMLowpOffsetContributionKernel / @ref CLGEMMLowpMatrixAReductionKernel / @ref CLGEMMLowpMatrixBReductionKernel / @ref CLGEMMLowpMatrixMultiplyCore
+    - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
+
+ - New graph nodes for Arm® Neon™ and OpenCL
+    - graph::BranchLayer
+    - graph::DepthConvertLayer
+    - graph::DepthwiseConvolutionLayer
+    - graph::DequantizationLayer
+    - graph::FlattenLayer
+    - graph::QuantizationLayer
+    - graph::ReshapeLayer
+
+v17.10 Public maintenance release
+ - Bug fixes:
+    - Check the maximum local workgroup size supported by OpenCL devices
+    - Minor documentation updates (Fixed instructions to build the examples)
+    - Introduced a graph::GraphContext
+    - Added a few new Graph nodes, support for branches and grouping.
+    - Automatically enable cl_printf in debug builds
+    - Fixed bare metal builds for armv7a
+    - Added AlexNet and cartoon effect examples
+    - Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)
+
+v17.09 Public major release
+ - Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers.
+ - Memory Manager (@ref BlobLifetimeManager, @ref BlobMemoryPool, @ref ILifetimeManager, @ref IMemoryGroup, @ref IMemoryManager, @ref IMemoryPool, @ref IPoolManager, @ref MemoryManagerOnDemand, @ref PoolManager)
+ - New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework).
+ - Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both Arm® Neon™ and OpenCL.
+ - New Arm® Neon™ kernels / functions:
+    - arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel
+    - NEDequantizationLayerKernel / @ref NEDequantizationLayer
+    - NEFloorKernel / @ref NEFloor
+    - @ref NEL2NormalizeLayerKernel / @ref NEL2NormalizeLayer
+    - NEQuantizationLayerKernel @ref NEMinMaxLayerKernel / @ref NEQuantizationLayer
+    - @ref NEROIPoolingLayerKernel / @ref NEROIPoolingLayer
+    - @ref NEReductionOperationKernel / @ref NEReductionOperation
+    - NEReshapeLayerKernel / @ref NEReshapeLayer
+
+ - New OpenCL kernels / functions:
+    - @ref CLDepthwiseConvolutionLayer3x3NCHWKernel @ref CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 @ref CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer
+    - CLDequantizationLayerKernel / CLDequantizationLayer
+    - CLDirectConvolutionLayerKernel / @ref CLDirectConvolutionLayer
+    - CLFlattenLayer
+    - CLFloorKernel / @ref CLFloor
+    - CLGEMMTranspose1xW
+    - CLGEMMMatrixVectorMultiplyKernel
+    - @ref CLL2NormalizeLayerKernel / @ref CLL2NormalizeLayer
+    - CLQuantizationLayerKernel @ref CLMinMaxLayerKernel / @ref CLQuantizationLayer
+    - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer
+    - @ref CLReductionOperationKernel / @ref CLReductionOperation
+    - CLReshapeLayerKernel / @ref CLReshapeLayer
+
+v17.06 Public major release
+ - Various bug fixes
+ - Added support for fixed point 8 bit (QS8) to the various Arm® Neon™ machine learning kernels.
+ - Added unit tests and benchmarks (AlexNet, LeNet)
+ - Added support for sub tensors.
+ - Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
+ - Added @ref OMPScheduler (OpenMP) scheduler for Neon
+ - Added @ref SingleThreadScheduler scheduler for Arm® Neon™ (For bare metal)
+ - User can specify his own scheduler by implementing the @ref IScheduler interface.
+ - New OpenCL kernels / functions:
+    - @ref CLBatchNormalizationLayerKernel / @ref CLBatchNormalizationLayer
+    - CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
+    - CLHOGOrientationBinningKernel CLHOGBlockNormalizationKernel, CLHOGDetectorKernel / CLHOGDescriptor CLHOGDetector CLHOGGradient CLHOGMultiDetection
+    - CLLocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedLayer
+    - @ref CLWeightsReshapeKernel / @ref CLConvolutionLayerReshapeWeights
+ - New C++ kernels:
+    - CPPDetectionWindowNonMaximaSuppressionKernel
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBatchNormalizationLayerKernel / @ref NEBatchNormalizationLayer
+    - NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer
+    - NEDirectConvolutionLayerKernel / @ref NEDirectConvolutionLayer
+    - NELocallyConnectedMatrixMultiplyKernel / NELocallyConnectedLayer
+    - @ref NEWeightsReshapeKernel / @ref NEConvolutionLayerReshapeWeights
+
+v17.05 Public bug fixes release
+ - Various bug fixes
+ - Remaining of the functions ported to use accurate padding.
+ - Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
+ - Added "free" method to allocator.
+ - Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9
+
+v17.04 Public bug fixes release
+
+ The following functions have been ported to use the new accurate padding:
+ -  CLColorConvertKernel
+ -  CLEdgeNonMaxSuppressionKernel
+ -  CLEdgeTraceKernel
+ -  CLGaussianPyramidHorKernel
+ -  CLGaussianPyramidVertKernel
+ -  CLGradientKernel
+ -  NEChannelCombineKernel
+ -  NEFillArrayKernel
+ -  NEGaussianPyramidHorKernel
+ -  NEGaussianPyramidVertKernel
+ -  NEHarrisScoreFP16Kernel
+ -  NEHarrisScoreKernel
+ -  NEHOGDetectorKernel
+ -  NELogits1DMaxKernel
+ -  NELogits1DShiftExpSumKernel
+ -  NELogits1DNormKernel
+ -  NENonMaximaSuppression3x3FP16Kernel
+ -  NENonMaximaSuppression3x3Kernel
+
+v17.03.1 First Major public release of the sources
+ - Renamed the library to arm_compute
+ - New CPP target introduced for C++ kernels shared between Arm® Neon™ and CL functions.
+ - New padding calculation interface introduced and ported most kernels / functions to use it.
+ - New OpenCL kernels / functions:
+   - CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
+ - New Arm® Neon™ kernels / functions:
+   - @ref NENormalizationLayerKernel / @ref NENormalizationLayer
+   - NETransposeKernel / @ref NETranspose
+   - NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / @ref NESoftmaxLayer
+   - @ref NEIm2ColKernel, @ref NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / @ref NEConvolutionLayer
+   - NEGEMMMatrixAccumulateBiasesKernel / @ref NEFullyConnectedLayer
+   - @ref NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp
+
+v17.03 Sources preview
+ - New OpenCL kernels / functions:
+   - CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
+   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, @ref CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
+   - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer
+   - CLTransposeKernel / @ref CLTranspose
+   - CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow
+   - @ref CLNormalizationLayerKernel / @ref CLNormalizationLayer
+   - CLLaplacianPyramid, CLLaplacianReconstruct
+ - New Arm® Neon™ kernels / functions:
+   - NEActivationLayerKernel / @ref NEActivationLayer
+   - GEMM refactoring + FP16 support (Requires armv8.2 CPU): @ref NEGEMMInterleave4x4Kernel, @ref NEGEMMTranspose1xWKernel, @ref NEGEMMMatrixMultiplyKernel, @ref NEGEMMMatrixAdditionKernel / @ref NEGEMM
+   - NEPoolingLayerKernel / @ref NEPoolingLayer
+
+v17.02.1 Sources preview
+ - New OpenCL kernels / functions:
+   - CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, CLLogits1DNormKernel / @ref CLSoftmaxLayer
+   - CLPoolingLayerKernel / @ref CLPoolingLayer
+   - @ref CLIm2ColKernel, @ref CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / CLConvolutionLayer
+   - @ref CLRemapKernel / @ref CLRemap
+   - CLGaussianPyramidHorKernel, CLGaussianPyramidVertKernel / CLGaussianPyramid, CLGaussianPyramidHalf, CLGaussianPyramidOrb
+   - CLMinMaxKernel, CLMinMaxLocationKernel / CLMinMaxLocation
+   - CLNonLinearFilterKernel / CLNonLinearFilter
+ - New Arm® Neon™ FP16 kernels (Requires armv8.2 CPU)
+   - NEAccumulateWeightedFP16Kernel
+   - NEBox3x3FP16Kernel
+   - NENonMaximaSuppression3x3FP16Kernel
+
+v17.02 Sources preview
+ - New OpenCL kernels / functions:
+   - CLActivationLayerKernel / @ref CLActivationLayer
+   - CLChannelCombineKernel / CLChannelCombine
+   - CLDerivativeKernel / CLChannelExtract
+   - CLFastCornersKernel / CLFastCorners
+   - CLMeanStdDevKernel / CLMeanStdDev
+ - New Arm® Neon™ kernels / functions:
+   - HOG / SVM: NEHOGOrientationBinningKernel, NEHOGBlockNormalizationKernel, NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / NEHOGDescriptor, NEHOGDetector, NEHOGGradient, NEHOGMultiDetection
+   - NENonLinearFilterKernel / NENonLinearFilter
+ - Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
+ - Switched all the kernels / functions to use tensors instead of images.
+ - Updated documentation to include instructions to build the library from sources.
+
+v16.12 Binary preview release
+ - Original release
+
+ */
+} // namespace arm_compute
+\ No newline at end of file
diff --git a/docs/user_guide/tests.dox b/docs/user_guide/tests.dox
new file mode 100644
index 0000000000..0d166b9693
--- /dev/null
+++ b/docs/user_guide/tests.dox
@@ -0,0 +1,385 @@
+///
+/// Copyright (c) 2017-2020 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+namespace test
+{
+/**
+@page tests Validation and Benchmarks
+
+@tableofcontents
+
+@section tests_overview Overview
+
+Benchmark and validation tests are based on the same framework to setup and run
+the tests. In addition to running simple, self-contained test functions the
+framework supports fixtures and data test cases. The former allows to share
+common setup routines between various backends thus reducing the amount of
+duplicated code. The latter can be used to parameterize tests or fixtures with
+different inputs, e.g. different tensor shapes. One limitation is that
+tests/fixtures cannot be parameterized based on the data type if static type
+information is needed within the test (e.g. to validate the results).
+
+@note By default tests are not built. To enable them you need to add validation_tests=1 and / or benchmark_tests=1 to your SCons line.
+
+@note Tests are not included in the pre-built binary archive, you have to build them from sources.
+
+@subsection tests_overview_fixtures Fixtures
+
+Fixtures can be used to share common setup, teardown or even run tasks among
+multiple test cases. For that purpose a fixture can define a `setup`,
+`teardown` and `run` method. Additionally the constructor and destructor might
+also be customized.
+
+An instance of the fixture is created immediately before the actual test is
+executed. After construction the @ref framework::Fixture::setup method is called. Then the test
+function or the fixtures `run` method is invoked. After test execution the
+@ref framework::Fixture::teardown method is called and lastly the fixture is destructed.
+
+@subsubsection tests_overview_fixtures_fixture Fixture
+
+Fixtures for non-parameterized test are straightforward. The custom fixture
+class has to inherit from @ref framework::Fixture and choose to implement any of the
+`setup`, `teardown` or `run` methods. None of the methods takes any arguments
+or returns anything.
+
+    class CustomFixture : public framework::Fixture
+    {
+        void setup()
+        {
+            _ptr = malloc(4000);
+        }
+
+        void run()
+        {
+            ARM_COMPUTE_ASSERT(_ptr != nullptr);
+        }
+
+        void teardown()
+        {
+            free(_ptr);
+        }
+
+        void *_ptr;
+    };
+
+@subsubsection tests_overview_fixtures_data_fixture Data fixture
+
+The advantage of a parameterized fixture is that arguments can be passed to the setup method at runtime. To make this possible the setup method has to be a template with a type parameter for every argument (though the template parameter doesn't have to be used). All other methods remain the same.
+
+    class CustomFixture : public framework::Fixture
+    {
+    #ifdef ALTERNATIVE_DECLARATION
+        template <typename ...>
+        void setup(size_t size)
+        {
+            _ptr = malloc(size);
+        }
+    #else
+        template <typename T>
+        void setup(T size)
+        {
+            _ptr = malloc(size);
+        }
+    #endif
+
+        void run()
+        {
+            ARM_COMPUTE_ASSERT(_ptr != nullptr);
+        }
+
+        void teardown()
+        {
+            free(_ptr);
+        }
+
+        void *_ptr;
+    };
+
+@subsection tests_overview_test_cases Test cases
+
+All following commands can be optionally prefixed with `EXPECTED_FAILURE_` or
+`DISABLED_`.
+
+@subsubsection tests_overview_test_cases_test_case Test case
+
+A simple test case function taking no inputs and having no (shared) state.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the dataset mode in which the test will be active.
+
+
+    TEST_CASE(TestCaseName, DatasetMode::PRECOMMIT)
+    {
+        ARM_COMPUTE_ASSERT_EQUAL(1 + 1, 2);
+    }
+
+@subsubsection tests_overview_test_cases_fixture_fixture_test_case Fixture test case
+
+A simple test case function taking no inputs that inherits from a fixture. The
+test case will have access to all public and protected members of the fixture.
+Only the setup and teardown methods of the fixture will be used. The body of
+this function will be used as test function.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            void setup() override
+            {
+                _one = 1;
+            }
+
+        protected:
+            int _one;
+    };
+
+    FIXTURE_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT)
+    {
+        ARM_COMPUTE_ASSERT_EQUAL(_one + 1, 2);
+    }
+
+@subsubsection tests_overview_test_cases_fixture_register_fixture_test_case Registering a fixture as test case
+
+Allows to use a fixture directly as test case. Instead of defining a new test
+function the run method of the fixture will be executed.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            void setup() override
+            {
+                _one = 1;
+            }
+
+            void run() override
+            {
+                ARM_COMPUTE_ASSERT_EQUAL(_one + 1, 2);
+            }
+
+        protected:
+            int _one;
+    };
+
+    REGISTER_FIXTURE_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT);
+
+
+@subsubsection tests_overview_test_cases_data_test_case Data test case
+
+A parameterized test case function that has no (shared) state. The dataset will
+be used to generate versions of the test case with different inputs.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the dataset mode in which the test will be active.
+- Third argument is the dataset.
+- Further arguments specify names of the arguments to the test function. The
+  number must match the arity of the dataset.
+
+
+    DATA_TEST_CASE(TestCaseName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}), num)
+    {
+        ARM_COMPUTE_ASSERT(num < 4);
+    }
+
+@subsubsection tests_overview_test_cases_fixture_data_test_case Fixture data test case
+
+A parameterized test case that inherits from a fixture. The test case will have
+access to all public and protected members of the fixture. Only the setup and
+teardown methods of the fixture will be used. The setup method of the fixture
+needs to be a template and has to accept inputs from the dataset as arguments.
+The body of this function will be used as test function. The dataset will be
+used to generate versions of the test case with different inputs.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+- Fourth argument is the dataset.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            template <typename T>
+            void setup(T num)
+            {
+                _num = num;
+            }
+
+        protected:
+            int _num;
+    };
+
+    FIXTURE_DATA_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}))
+    {
+        ARM_COMPUTE_ASSERT(_num < 4);
+    }
+
+@subsubsection tests_overview_test_cases_register_fixture_data_test_case Registering a fixture as data test case
+
+Allows to use a fixture directly as parameterized test case. Instead of
+defining a new test function the run method of the fixture will be executed.
+The setup method of the fixture needs to be a template and has to accept inputs
+from the dataset as arguments. The dataset will be used to generate versions of
+the test case with different inputs.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+- Fourth argument is the dataset.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            template <typename T>
+            void setup(T num)
+            {
+                _num = num;
+            }
+
+            void run() override
+            {
+                ARM_COMPUTE_ASSERT(_num < 4);
+            }
+
+        protected:
+            int _num;
+    };
+
+    REGISTER_FIXTURE_DATA_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}));
+
+@section writing_tests Writing validation tests
+
+Before starting a new test case have a look at the existing ones. They should
+provide a good overview how test cases are structured.
+
+- The C++ reference needs to be added to `tests/validation/CPP/`. The
+  reference function is typically a template parameterized by the underlying
+  value type of the `SimpleTensor`. This makes it easy to specialise for
+  different data types.
+- If all backends have a common interface it makes sense to share the setup
+  code. This can be done by adding a fixture in
+  `tests/validation/fixtures/`. Inside of the `setup` method of a fixture
+  the tensors can be created and initialised and the function can be configured
+  and run. The actual test will only have to validate the results. To be shared
+  among multiple backends the fixture class is usually a template that accepts
+  the specific types (data, tensor class, function class etc.) as parameters.
+- The actual test cases need to be added for each backend individually.
+  Typically the will be multiple tests for different data types and for
+  different execution modes, e.g. precommit and nightly.
+
+@section tests_running_tests Running tests
+@subsection tests_running_tests_benchmark_and_validation Benchmarking and validation suites
+@subsubsection tests_running_tests_benchmarking_filter Filter tests
+All tests can be run by invoking
+
+    ./arm_compute_benchmark ./data
+
+where `./data` contains the assets needed by the tests.
+
+If only a subset of the tests has to be executed the `--filter` option takes a
+regular expression to select matching tests.
+
+    ./arm_compute_benchmark --filter='^NEON/.*AlexNet' ./data
+
+@note Filtering will be much faster if the regular expression starts from the start ("^") or end ("$") of the line.
+
+Additionally each test has a test id which can be used as a filter, too.
+However, the test id is not guaranteed to be stable when new tests are added.
+Only for a specific build the same the test will keep its id.
+
+    ./arm_compute_benchmark --filter-id=10 ./data
+
+All available tests can be displayed with the `--list-tests` switch.
+
+    ./arm_compute_benchmark --list-tests
+
+More options can be found in the `--help` message.
+
+@subsubsection tests_running_tests_benchmarking_runtime Runtime
+By default every test is run once on a single thread. The number of iterations
+can be controlled via the `--iterations` option and the number of threads via
+`--threads`.
+
+@subsubsection tests_running_tests_benchmarking_output Output
+By default the benchmarking results are printed in a human readable format on
+the command line. The colored output can be disabled via `--no-color-output`.
+As an alternative output format JSON is supported and can be selected via
+`--log-format=json`. To write the output to a file instead of stdout the
+`--log-file` option can be used.
+
+@subsubsection tests_running_tests_benchmarking_mode Mode
+Tests contain different datasets of different sizes, some of which will take several hours to run.
+You can select which datasets to use by using the `--mode` option, we recommed you use `--mode=precommit` to start with.
+
+@subsubsection tests_running_tests_benchmarking_instruments Instruments
+You can use the `--instruments` option to select one or more instruments to measure the execution time of the benchmark tests.
+
+`PMU` will try to read the CPU PMU events from the kernel (They need to be enabled on your platform)
+
+`MALI` will try to collect Arm® Mali™ hardware performance counters. (You need to have a recent enough Arm® Mali™ driver)
+
+`WALL_CLOCK_TIMER` will measure time using `gettimeofday`: this should work on all platforms.
+
+You can pass a combinations of these instruments: `--instruments=PMU,MALI,WALL_CLOCK_TIMER`
+
+@note You need to make sure the instruments have been selected at compile time using the `pmu=1` or `mali=1` scons options.
+
+@subsubsection tests_running_examples Examples
+
+To run all the precommit validation tests:
+
+	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit
+
+To run the OpenCL precommit validation tests:
+
+	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit --filter="^CL.*"
+
+To run the Arm® Neon™ precommit benchmark tests with PMU and Wall Clock timer in miliseconds instruments enabled:
+
+	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^NEON.*" --instruments="pmu,wall_clock_timer_ms" --iterations=10
+
+To run the OpenCL precommit benchmark tests with OpenCL kernel timers in miliseconds enabled:
+
+	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^CL.*" --instruments="opencl_timer_ms" --iterations=10
+
+@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
+*/
+} // namespace test
+} // namespace arm_compute
author	Sheri Zhang <sheri.zhang@arm.com>	2021-04-30 16:53:41 +0100
committer	Sang-Hoon Park <sang-hoon.park@arm.com>	2021-05-05 09:40:36 +0000
commit	d813bab10bb4fe954fa0e962e1402ed1377617da (patch)
tree	6e107f6788fc7f396087e8efa29161bfeb2099cc /docs/user_guide
parent	6124ce60b54eb5639ed19d46c79fce21cca2c83b (diff)
download	ComputeLibrary-d813bab10bb4fe954fa0e962e1402ed1377617da.tar.gz