From d813bab10bb4fe954fa0e962e1402ed1377617da Mon Sep 17 00:00:00 2001
From: Sheri Zhang <sheri.zhang@arm.com>
Date: Fri, 30 Apr 2021 16:53:41 +0100
Subject: Restructure documentation

The documentation has been restructured
for better grouping and readability.

Resolves: COMPMID-4198

Signed-off-by: Sheri Zhang <sheri.zhang@arm.com>
Change-Id: I8c8bc77f0aab8d63f1659f2235dbab634422a68c
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/5568
Tested-by: Georgios Pinitas <georgios.pinitas@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
---
 docs/00_introduction.dox                           | 2028 --------------------
 docs/01_library.dox                                |  571 ------
 docs/02_tests.dox                                  |  385 ----
 docs/04_adding_operator.dox                        |  332 ----
 docs/05_contribution_guidelines.dox                |  452 -----
 docs/07_errata.dox                                 |   76 -
 docs/08_api.dox                                    |  135 --
 docs/Doxyfile                                      |   26 +-
 docs/DoxygenLayout.xml                             |  212 ++
 docs/contributor_guide/adding_operator.dox         |  334 ++++
 docs/contributor_guide/contribution_guidelines.dox |  452 +++++
 docs/contributor_guide/implementation_topics.dox   |  143 ++
 docs/user_guide/advanced.dox                       |  114 ++
 docs/user_guide/api.dox                            |  135 ++
 docs/user_guide/data_layout.dox                    |   41 +
 docs/user_guide/data_type.dox                      |   47 +
 docs/user_guide/errata.dox                         |   76 +
 docs/user_guide/how_to_build_and_run_examples.dox  |  541 ++++++
 docs/user_guide/introduction.dox                   |   74 +
 docs/user_guide/library.dox                        |  402 ++++
 docs/user_guide/programming_model.dox              |   70 +
 docs/user_guide/release_version_and_change_log.dox | 1389 ++++++++++++++
 docs/user_guide/tests.dox                          |  385 ++++
 23 files changed, 4430 insertions(+), 3990 deletions(-)
 delete mode 100644 docs/00_introduction.dox
 delete mode 100644 docs/01_library.dox
 delete mode 100644 docs/02_tests.dox
 delete mode 100644 docs/04_adding_operator.dox
 delete mode 100644 docs/05_contribution_guidelines.dox
 delete mode 100644 docs/07_errata.dox
 delete mode 100644 docs/08_api.dox
 create mode 100644 docs/DoxygenLayout.xml
 create mode 100644 docs/contributor_guide/adding_operator.dox
 create mode 100644 docs/contributor_guide/contribution_guidelines.dox
 create mode 100644 docs/contributor_guide/implementation_topics.dox
 create mode 100644 docs/user_guide/advanced.dox
 create mode 100644 docs/user_guide/api.dox
 create mode 100644 docs/user_guide/data_layout.dox
 create mode 100644 docs/user_guide/data_type.dox
 create mode 100644 docs/user_guide/errata.dox
 create mode 100644 docs/user_guide/how_to_build_and_run_examples.dox
 create mode 100644 docs/user_guide/introduction.dox
 create mode 100644 docs/user_guide/library.dox
 create mode 100644 docs/user_guide/programming_model.dox
 create mode 100644 docs/user_guide/release_version_and_change_log.dox
 create mode 100644 docs/user_guide/tests.dox

diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
deleted file mode 100644
index 68533852e6..0000000000
--- a/docs/00_introduction.dox
+++ /dev/null
@@ -1,2028 +0,0 @@
-///
-/// Copyright (c) 2017-2021 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/** @mainpage Introduction
-
-@tableofcontents
-
-The Compute Library is a collection of low-level machine learning functions optimized for both Arm CPUs and GPUs using SIMD technologies.
-
-Several builds of the library are available using various configurations:
- - OS: Linux, Android, macOS or bare metal.
- - Architecture: armv7a (32bit) or arm64-v8a (64bit).
- - Technology: Arm® Neon™ / OpenCL / Arm® Neon™ and OpenCL.
- - Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance.
-
-@section S0_1_contact Contact / Support
-
-Please create an issue on <a href="https://github.com/ARM-software/ComputeLibrary/issues">Github</a>.
-
-In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run:
-
-    $ strings android-armv7a-cl-asserts/libarm_compute.so | grep arm_compute_version
-    arm_compute_version=v16.12 Build options: {'embed_kernels': '1', 'opencl': '1', 'arch': 'armv7a', 'neon': '0', 'asserts': '1', 'debug': '0', 'os': 'android', 'Werror': '1'} Git hash=f51a545d4ea12a9059fe4e598a092f1fd06dc858
-
-@section S0_2_prebuilt_binaries Pre-built binaries
-
-For each release we provide some pre-built binaries of the library [here](https://github.com/ARM-software/ComputeLibrary/releases)
-
-These binaries have been built using the following toolchains:
-            - Linux armv7a: gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf
-            - Linux arm64-v8a: gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu
-            - Android armv7a: clang++ / libc++ NDK r18b
-            - Android am64-v8a: clang++ / libc++ NDK r20b
-
-@warning Make sure to use a compatible toolchain to build your application or you will get some std::bad_alloc errors at runtime.
-
-@section S1_file_organisation File organisation
-
-This archive contains:
- - The arm_compute header and source files
- - The latest Khronos OpenCL 1.2 C headers from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a>
- - The latest Khronos cl2.hpp from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> (API version 2.1 when this document was written)
- - The latest Khronos EGL 1.5 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos EGL registry</a>
- - The sources for a stub version of libOpenCL.so, libGLESv1_CM.so, libGLESv2.so and libEGL.so to help you build your application.
- - An examples folder containing a few examples to compile and link against the library.
- - A @ref utils folder containing headers with some boiler plate code used by the examples.
- - This documentation.
-
- For detailed information about file organization, please refer to Files -> File List section of this documentation.
-
-@section S2_versions_changelog Release versions and changelog
-
-@subsection S2_1_versions Release versions
-
-All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number.
-If there is more than one release in a month then an extra sequential number is appended at the end:
-
-	v17.03 (First release of March 2017)
-	v17.03.1 (Second release of March 2017)
-	v17.04 (First release of April 2017)
-
-@note We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.
-
-@subsection S2_2_changelog Changelog
-
-v21.05 Public major release
- - Removed computer vision support from Arm® Neon™ backend
- - Removed the following functions:
-   - NEAbsoluteDifference
-   - NEAccumulate
-   - NEBox3x3
-   - NECannyEdge
-   - NEChannelCombine
-   - NEChannelExtract
-   - NEColorConvert
-   - NEConvolution
-   - NEDerivative
-   - NEDilate
-   - NEEqualizeHistogram
-   - NEErode
-   - NEFastCorners
-   - NEGaussian3x3
-   - NEGaussian5x5
-   - NEGaussianPyramid
-   - NEHOGDescriptor
-   - NEHOGDetector
-   - NEHOGGradient
-   - NEHOGMultiDetection
-   - NEHarrisCorners
-   - NEHistogram
-   - NEIntegralImage
-   - NELaplacianPyramid
-   - NELaplacianReconstruct
-   - NEMagnitude
-   - NEMeanStdDev
-   - NEMedian3x3
-   - NEMinMaxLocation
-   - NENonLinearFilter
-   - NEOpticalFlow
-   - NEPhase
-   - NEScharr3x3
-   - NESobel3x3
-   - NESobel5x5
-   - NESobel7x7
-   - NETableLookup
-   - NEThreshold
-   - NEWarpAffine
-   - NEWarpPerspectiveKernel
-
- - Remove all GLES kernels / functions / tests / examples
- - Removed computer vision support from CL backend
- - Removed the following functions:
-   - CLAbsoluteDifference
-   - CLAccumulate
-   - CLBox3x3
-   - CLCannyEdge
-   - CLChannelCombine
-   - CLChannelExtract
-   - CLColorConvert
-   - CLConvolution
-   - CLDerivative
-   - CLDilate
-   - CLEqualizeHistogram
-   - CLErode
-   - CLFastCorners
-   - CLGaussian3x3
-   - CLGaussian5x5
-   - CLGaussianPyramid
-   - CLHOGDescriptor
-   - CLHOGDetector
-   - CLHOGGradient
-   - CLHOGMultiDetection
-   - CLHarrisCorners
-   - CLHistogram
-   - CLIntegralImage
-   - CLLaplacianPyramid
-   - CLLaplacianReconstruct
-   - CLMagnitude
-   - CLMeanStdDev
-   - CLMedian3x3
-   - CLMinMaxLocation
-   - CLNonLinearFilter
-   - CLOpticalFlow
-   - CLPhase
-   - CLScharr3x3
-   - CLSobel3x3
-   - CLSobel5x5
-   - CLSobel7x7
-   - CLTableLookup
-   - CLThreshold
-   - CLWarpAffine
-   - CLWarpPerspective
- 
-v21.02 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Upgrade C++ standard to C++14
- - Add macOS support
- - Add Armv8-R AArch64 architecture support
- - Add SVE/SVE2 support for:
-   - NEScaleKernel
-   - @ref NEActivationLayer
-   - @ref NEArithmeticAddition
-   - @ref NEBatchNormalizationLayerKernel
-   - @ref cpu::kernels::CpuLogits1DSoftmaxKernel
-   - @ref cpu::kernels::CpuLogits1DMaxKernel
-   - @ref cpu::kernels::CpuElementwiseUnaryKernel
- - Remove padding from OpenCL kernels:
-   - CLDirectConvolutionLayerKernel
-   - @ref CLArgMinMaxLayerKernel
-   - @ref CLPadLayerKernel
-   - @ref CLROIAlignLayerKernel
-   - @ref CLRangeKernel
-   - CLScaleKernel
-   - @ref CLSelectKernel
-   - @ref CLBitwiseKernel
-   - @ref opencl::kernels::ClFloorKernel
-   - CLTransposeKernel
- - Deprecate functions in CLTuner:
-    - add_lws_to_table
-    - import_lws_table
-    - lws_table
- - Remove functions:
-   - NELocallyConnectedLayer / CLLocallyConnectedLayer
-   - NEIm2Col
-   - NECol2Im
-   - NEGEMMInterleave4x4
-   - NEGEMMTranspose1xW
-   - NEComputeAllAnchors / CLComputeAllAnchors
-   - NEGEMMAssemblyDispatch
-   - NEUpsampleLayer / CLUpsampleLayer
- - Remove kernels:
-   - NEGEMMMatrixVectorMultiplyKernel
-   - NELocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedMatrixMultiplyKernel
-   - NEUpsampleLayerKernel / CLUpsampleLayerKernel
- - Extend OpenCL tuner with workgroup batch size support
-   - Experimental extension for the OpenCL tuner to tune the batches of work groups distribute to compute units
- - Add functionality to load the OpenCL GEMM heuristics at runtime
-   - The GEMM heuristic file (MLGO) can be used to update the default GEMM heuristics available for OpenCL
- - Note: there might be performance regressions against v20.08 in Inception v3 using int8 data types on Arm Mali-G77 GPUs. Currently under investigation
- - Note: data-type decoupling is in progress and expiremental. Warning of unused symbols might be raised
-
-v20.11 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Performance regressions can be noted when executing Depthwise Convolution on Arm® Neon™ with a depth multiplier > 1 for quantized data type.
-   This is planned to be resolved in 21.02 release.
- - Added new data type QASYMM8_SIGNED support for @ref NEROIAlignLayer.
- - Added new data type S32 support for:
-   - NEArithmeticSubtraction
-   - NEArithmeticSubtractionKernel
-   - @ref NEPixelWiseMultiplication
-   - NEPixelWiseMultiplicationKernel
-   - NEElementwiseDivision
-   - NEDivisionOperationKernel
- - Interface change
-   - Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension
-     on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5.
-     The supported value range of axis is [-rank, rank).
-     This change applies to the following functions:
-      - @ref NESoftmaxLayer
-      - @ref NELogSoftmaxLayer
-      - @ref CLSoftmaxLayer
-      - @ref CLLogSoftmaxLayer
-      - GCSoftmaxLayer
- - New OpenCL kernels / functions:
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
-   - @ref CLLogicalNot
-   - @ref CLLogicalAnd
-   - @ref CLLogicalOr
- - New Arm® Neon™ kernels / functions:
-   - @ref NELogicalNot
-   - @ref NELogicalAnd
-   - @ref NELogicalOr
- - Removed padding from Arm® Neon™ kernels:
-   - NEComplexPixelWiseMultiplicationKernel
-   - NENonMaximaSuppression3x3Kernel
-   - @ref NERemapKernel
-   - @ref NEGEMMInterleave4x4Kernel
-   - NEDirectConvolutionLayerKernel
-   - NEScaleKernel
-   - NELocallyConnectedMatrixMultiplyKernel
-   - @ref NEGEMMLowpOffsetContributionKernel
-   - @ref NEGEMMTranspose1xWKernel
-   - NEPoolingLayerKernel
-   - NEConvolutionKernel
-   - NEDepthwiseConvolutionLayerNativeKernel
-   - @ref NEGEMMLowpMatrixMultiplyKernel
-   - @ref NEGEMMMatrixMultiplyKernel
-   - NEDirectConvolutionLayerOutputStageKernel
-   - @ref NEReductionOperationKernel
-   - @ref NEGEMMLowpMatrixAReductionKernel
-   - @ref NEGEMMLowpMatrixBReductionKernel
- - Removed padding from OpenCL kernels:
-   - CLBatchConcatenateLayerKernel
-   - CLElementwiseOperationKernel
-   - @ref CLBatchNormalizationLayerKernel
-   - CLPoolingLayerKernel
-   - @ref CLWinogradInputTransformKernel
-   - @ref CLGEMMLowpMatrixMultiplyNativeKernel
-   - @ref CLGEMMLowpMatrixAReductionKernel
-   - @ref CLGEMMLowpMatrixBReductionKernel
-   - @ref CLGEMMLowpOffsetContributionOutputStageKernel
-   - @ref CLGEMMLowpOffsetContributionKernel
-   - @ref CLWinogradOutputTransformKernel
-   - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
-   - @ref CLFuseBatchNormalizationKernel
-   - @ref CLDepthwiseConvolutionLayerNativeKernel
-   - @ref CLDepthConvertLayerKernel
-   - CLCopyKernel
-   - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
-   - CLActivationLayerKernel
-   - @ref CLWinogradFilterTransformKernel
-   - CLWidthConcatenateLayerKernel
-   - CLWidthConcatenate4TensorsKernel
-   - CLWidthConcatenate2TensorsKernel
-   - CLLogits1DMaxShiftExpSumKernel
-   - CLLogits1DNormKernel
-   - CLHeightConcatenateLayerKernel
-   - @ref CLGEMMMatrixMultiplyKernel
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
-   - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-   - CLDepthConcatenateLayerKernel
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
- - Removed OpenCL kernels / functions:
-   - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-   - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
-   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
- - Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
-     - CLLocallyConnectedLayer
-     - CLLocallyConnectedMatrixMultiplyKernel
-     - CLAbsoluteDifference
-     - CLAbsoluteDifferenceKernel
-     - CLAccumulate
-     - CLAccumulateKernel
-     - CLAccumulateSquared
-     - CLAccumulateSquaredKernel
-     - CLAccumulateWeighted
-     - CLAccumulateWeightedKernel
-     - CLAccumulateWeightedFP16Kernel
-     - CLBox3x3
-     - CLBox3x3Kernel
-     - CLBox3x3FP16Kernel
-     - CLCannyEdge
-     - CLChannelCombine
-     - CLChannelCombineKernel
-     - CLChannelExtract
-     - CLChannelExtractKernel
-     - CLColorConvert
-     - CLColorConvertKernel
-     - CLConvolution3x3
-     - CLConvolutionRectangle
-     - CLConvolutionRectangleKernel
-     - CLConvolutionSquare
-     - CLConvolutionKernel
-     - CLDerivative
-     - CLDerivativeKernel
-     - CLDilate
-     - CLDilateKernel
-     - CLEqualizeHistogram
-     - CLErode
-     - CLErodeKernel
-     - CLFastCorners
-     - CLFastCornersKernel
-     - CLGaussian3x3
-     - CLGaussian3x3Kernel
-     - CLGaussian5x5
-     - CLGaussian5x5HorKernel
-     - CLGaussian5x5VertKernel
-     - CLGaussianPyramid
-     - CLGaussianPyramidHalf
-     - CLGaussianPyramidOrb
-     - CLHarrisCorners
-     - CLHarrisScoreKernel
-     - CLHarrisScoreFP16Kernel
-     - CLHistogram
-     - CLHistogramKernel
-     - CLHOGOrientationBinningKernel
-     - CLHOGBlockNormalizationKernel
-     - CLHOGDetectorKernel
-     - CLHOGNonMaximaSuppressionKernel
-     - CLHOGDescriptor
-     - CLHOGDetector
-     - CLHOGGradient
-     - CLHOGMultiDetection
-     - CLHOGOrientationBinningKernel
-     - CLHOGBlockNormalizationKernel
-     - CLHOGDetectorKernel
-     - CLIntegralImage
-     - CLIntegralImageKernel
-     - CLLaplacianReconstruct
-     - CLLaplacianPyramid
-     - CLMagnitude
-     - CLMagnitudePhaseKernel
-     - CLMedian3x3
-     - CLMedian3x3Kernel
-     - CLMinMaxLocation
-     - CLMinMaxLocationKernel
-     - CLNonLinearFilter
-     - CLNonLinearFilterKernel
-     - CLNonMaximaSuppression3x3
-     - CLNonMaximaSuppression3x3FP16Kernel
-     - CLNonMaximaSuppression3x3Kernel
-     - CLOpticalFlow
-     - CLPhase
-     - CLRemap
-     - CLRemapKernel
-     - CLScharr3x3
-     - CLScharr3x3Kernel
-     - CLSobel3x3
-     - CLSobel3x3Kernel
-     - CLSobel5x5
-     - CLSobel5x5HorKernel
-     - CLSobel5x5VertKernel
-     - CLSobel7x7
-     - CLSobel7x7HorKernel
-     - CLSobel7x7VertKernel
-     - CLThreshold
-     - CLThresholdKernel
-     - CLWarpAffine
-     - CLWarpAffineKernel
-     - CLWarpPerspective
-     - CLWarpPerspectiveKernel
- - Deprecated Arm® Neon™ kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
-     - NELocallyConnectedLayer
-     - NELocallyConnectedMatrixMultiplyKernel
-     - NEAbsoluteDifference
-     - NEAbsoluteDifferenceKernel
-     - NEAccumulate
-     - NEAccumulateKernel
-     - NEAccumulateSquared
-     - NEAccumulateSquaredKernel
-     - NEAccumulateWeighted
-     - NEAccumulateWeightedKernel
-     - NEAccumulateWeightedFP16Kernel
-     - NEBox3x3
-     - NEBox3x3Kernel
-     - NEBox3x3FP16Kernel
-     - NECannyEdge
-     - NEChannelCombine
-     - NEChannelCombineKernel
-     - NEChannelExtract
-     - NEChannelExtractKernel
-     - NEColorConvert
-     - NEColorConvertKernel
-     - NEConvolution3x3
-     - NEConvolutionRectangle
-     - NEConvolutionRectangleKernel
-     - NEConvolutionSquare
-     - NEConvolutionKernel
-     - NEDerivative
-     - NEDerivativeKernel
-     - NEDilate
-     - NEDilateKernel
-     - NEEqualizeHistogram
-     - NEErode
-     - NEErodeKernel
-     - NEFastCorners
-     - NEFastCornersKernel
-     - NEGaussian3x3
-     - NEGaussian3x3Kernel
-     - NEGaussian5x5
-     - NEGaussian5x5HorKernel
-     - NEGaussian5x5VertKernel
-     - NEGaussianPyramid
-     - NEGaussianPyramidHalf
-     - NEGaussianPyramidOrb
-     - NEHarrisCorners
-     - NEHarrisScoreKernel
-     - NEHarrisScoreFP16Kernel
-     - NEHistogram
-     - NEHistogramKernel
-     - NEHOGOrientationBinningKernel
-     - NEHOGBlockNormalizationKernel
-     - NEHOGDetectorKernel
-     - NEHOGNonMaximaSuppressionKernel
-     - NEHOGDescriptor
-     - NEHOGDetector
-     - NEHOGGradient
-     - NEHOGMultiDetection
-     - NEHOGOrientationBinningKernel
-     - NEHOGBlockNormalizationKernel
-     - NEHOGDetectorKernel
-     - NEIntegralImage
-     - NEIntegralImageKernel
-     - NELaplacianReconstruct
-     - NELaplacianPyramid
-     - NEMagnitude
-     - NEMagnitudePhaseKernel
-     - NEMedian3x3
-     - NEMedian3x3Kernel
-     - NEMinMaxLocation
-     - NEMinMaxLocationKernel
-     - NENonLinearFilter
-     - NENonLinearFilterKernel
-     - NENonMaximaSuppression3x3
-     - NENonMaximaSuppression3x3FP16Kernel
-     - NENonMaximaSuppression3x3Kernel
-     - NEOpticalFlow
-     - NEPhase
-     - NERemap
-     - NERemapKernel
-     - NEScharr3x3
-     - NEScharr3x3Kernel
-     - NESobel3x3
-     - NESobel3x3Kernel
-     - NESobel5x5
-     - NESobel5x5HorKernel
-     - NESobel5x5VertKernel
-     - NESobel7x7
-     - NESobel7x7HorKernel
-     - NESobel7x7VertKernel
-     - NEThreshold
-     - NEThresholdKernel
-     - NEWarpAffine
-     - NEWarpAffineKernel
-     - NEWarpPerspective
-     - NEWarpPerspectiveKernel
- - Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
-     - GCAbsoluteDifference
-     - GCActivationLayer
-     - GCArithmeticAddition
-     - GCBatchNormalizationLayer
-     - GCConcatenateLayer
-     - GCConvolutionLayer
-     - GCDepthwiseConvolutionLayer
-     - GCDirectConvolutionLayer
-     - GCDropoutLayer
-     - GCFillBorder
-     - GCFullyConnectedLayer
-     - GCGEMM
-     - GCGEMMInterleave4x4
-     - GCGEMMTranspose1xW
-     - GCNormalizationLayer
-     - GCNormalizePlanarYUVLayer
-     - GCPixelWiseMultiplication
-     - GCPoolingLayer
-     - GCScale
-     - GCSoftmaxLayer
-     - GCTensorShift
-     - GCTranspose
-
-
-v20.08 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Added new data type QASYMM8_SIGNED support for:
-   - @ref CLArgMinMaxLayer
-   - @ref CLArgMinMaxLayerKernel
- - Added new data type U8 support for:
-   - @ref NECropKernel
-   - CLCropKernel
- - Added aligh_corner support for nearest neighbor interpolation in:
-   - NEScaleKernel
-   - CLScaleKernel
- - New OpenCL kernels / functions:
-   - @ref CLMaxUnpoolingLayerKernel
- - New Arm® Neon™ kernels / functions:
-   - @ref NEMaxUnpoolingLayerKernel
- - New graph example:
-   - graph_yolov3_output_detector
- - GEMMTuner improvements:
-   - Added fp16 support
-   - Output json files for easier integration
-   - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
-   - More robust script for running benchmarks
- - Removed padding from:
-   - NEPixelWiseMultiplicationKernel
-   - NEHeightConcatenateLayerKernel
-   - NEThresholdKernel
-   - NEBatchConcatenateLayerKernel
-   - NETransposeKernel
-   - @ref NEBatchNormalizationLayerKernel
-   - NEArithmeticSubtractionKernel
-   - @ref NEBoundingBoxTransformKernel
-   - NELogits1DMaxKernel
-   - NELogits1DSoftmaxKernel
-   - @ref NEROIPoolingLayerKernel
-   - @ref NEROIAlignLayerKernel
-   - NEYOLOLayerKernel
-   - NEUpsampleLayerKernel
-   - NEFloorKernel
-   - NEWidthConcatenateLayerKernel
-   - NEDepthConcatenateLayerKernel
-   - @ref NENormalizationLayerKernel
-   - @ref NEL2NormalizeLayerKernel
-   - NEFillArrayKernel
-   - @ref NEDepthConvertLayerKernel
-   - @ref NERangeKernel
-   - @ref NEPriorBoxLayer
- - Removed OpenCL kernels / functions:
-   - CLGEMMLowpQuantizeDownInt32ToUint8Scale
-   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
- - Removed Arm® Neon™ kernels / functions:
-   - NEGEMMLowpQuantizeDownInt32ToUint8Scale
-   - NEGEMMMatrixAccumulateBiasesKernel
- - Deprecated functions / interfaces:
-   - Non-descriptor based interfaces for NEThreshold, CLThreshold
-   - Non-descriptor based interfaces for @ref NEScale, @ref CLScale and GCScale
-   - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer :
-      The default "axis" value for @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer is changed from 1 to 0.
-      Only axis 0 is supported.
-      The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0.
-      Only axis 0 is supported.
- - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity.
- - Removed padding requirement for the input (e.g. LHS of GEMM) and output in @ref CLGEMMMatrixMultiplyNativeKernel, @ref CLGEMMMatrixMultiplyReshapedKernel, @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only)
-   - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output.
-   - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation.
-   - Only on Arm® Mali™ Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since @ref CLGEMMMatrixMultiplyKernel is called and currently requires padding.
- - Added support for exporting the OpenCL buffer object to the OpenCL image object in @ref CLGEMMMatrixMultiplyReshapedKernel and @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
-   - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
-   - The padding requirement for the OpenCL image object is considered into the @ref CLGEMMReshapeRHSMatrixKernel.
-   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate @ref CLGEMMConvolutionLayer.
-
-v20.05 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Updated recommended NDK version to r18b.
- - Updated recommended gcc version to Linaro 6.3.1.
- - Added Bfloat16 type support
- - Added Bfloat16 support in:
-     - @ref NEWeightsReshapeKernel
-     - @ref NEConvolutionLayerReshapeWeights
-     - @ref NEIm2ColKernel
-     - NEIm2Col
-     - @ref NEDepthConvertLayerKernel
-     - @ref NEDepthConvertLayer
-     - @ref NEGEMMConvolutionLayer
-     - NEGEMMAssemblyDispatch
- - Added new data type QASYMM8_SIGNED support for:
-     - @ref CLDirectConvolutionLayer
-     - @ref CLDeconvolutionLayer
-     - @ref CLDirectDeconvolutionLayer
-     - @ref CLGEMMDeconvolutionLayer
-     - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
-     - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
-     - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
-     - @ref CLReductionOperation
-     - @ref CLReduceMean
-     - @ref NEScale
-     - NEScaleKernel
-     - NEUpsampleLayer
-     - @ref NECast
-     - @ref NEReductionOperation
-     - @ref NEReduceMean
-     - @ref NEArgMinMaxLayer
-     - @ref NEDeconvolutionLayer
-     - @ref NEGEMMLowpQuantizeDownInt32ScaleKernel
-     - @ref CPPBoxWithNonMaximaSuppressionLimit
-     - @ref CPPDetectionPostProcessLayer
-     - @ref CPPPermuteKernel
-     - @ref CPPPermute
-     - @ref CPPTopKVKernel
-     - @ref CPPTopKV
-     - @ref CPPUpsample
-     - @ref CPPUpsampleKernel
- - New OpenCL kernels / functions:
-     - @ref CLQLSTMLayer
-     - @ref CLQLSTMLayerNormalizationKernel
- - New Arm® Neon™ kernels / functions:
-     - @ref NEQLSTMLayer
-     - @ref NEQLSTMLayerNormalizationKernel
- - Added HARD_SWISH support in:
-     - CLActivationLayerKernel
-     - NEActivationLayerKernel
- - Deprecated OpenCL kernels / functions:
-     - CLGEMMLowpQuantizeDownInt32ToUint8Scale
-     - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
- - Deprecated Arm® Neon™ kernels / functions:
-     - NEGEMMLowpQuantizeDownInt32ToUint8Scale
- - Removed CPP kernels / functions:
-     - CPPFlipWeightsKernel
- - Removed PoolingLayerInfo constructors without Data Layout.
- - Removed CLDepthwiseConvolutionLayer3x3
- - Removed NEDepthwiseConvolutionLayerOptimized
- - Added support for Winograd 3x3,4x4 on Arm® Neon™ FP16:
-     - @ref NEWinogradConvolutionLayer
-     - @ref NEWinogradLayerTransformInputKernel
-     - @ref NEWinogradLayerTransformOutputKernel
-     - @ref NEWinogradLayerTransformWeightsKernel
- - Added CLCompileContext
- - Added Arm® Neon™ GEMM kernel with 2D window support
-
-v20.02.1 Maintenance release
- - Added Android-NN build script.
-
-v20.02 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Added new data type QASYMM8_SIGNED support for:
-     - @ref CLDepthwiseConvolutionLayer
-     - CLDepthwiseConvolutionLayer3x3
-     - @ref CLGEMMConvolutionLayer
-     - @ref CLGEMMLowpMatrixMultiplyCore
-     - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-     - @ref CLGEMMLowpMatrixMultiplyNativeKernel
-     - @ref NEActivationLayer
-     - NEComparisonOperationKernel
-     - @ref NEConvolutionLayer
-     - @ref NEDepthwiseConvolutionLayer
-     - NEDepthwiseConvolutionLayer3x3Kernel
-     - NEDirectConvolutionLayerOutputStageKernel
-     - @ref NEElementwiseComparison
-     - @ref NEElementwiseMax
-     - @ref NEElementwiseMin
-     - @ref NEElementwiseSquaredDiff
-     - @ref NEFullyConnectedLayer
-     - NEGEMMMatrixVectorMultiplyKernel
-     - @ref NEPixelWiseMultiplication
-     - @ref NEPoolingLayer
-     - @ref NEPReluLayer
- - Added support for QSYMM8_PER_CHANNEL in:
-     - NEDepthwiseConvolutionLayer3x3Kernel
- - Added support for split sizes in:
-     - @ref CLSplit
-     - @ref NESplit
- - New OpenCL kernels / functions:
-     - @ref CLFill
-     - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
- - New Arm® Neon™ kernels / functions:
-     - @ref NEFill
-     - @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
- - Deprecated Arm® Neon™ functions / interfaces:
-     - CLDepthwiseConvolutionLayer3x3
-     - NEDepthwiseConvolutionLayerOptimized
-     - PoolingLayerInfo constructors without Data Layout.
- - Added support for quantization with multiplier greater than 1 on Arm® Neon™ and CL.
- - Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to @ref CLQuantizationLayer.
- - Added the ability to build bootcode for bare metal.
- - Added support for generating synthetic QASYMM8 graphs.
- - Added support for F16 datatype in VGG16.
- - Removed pre-built binaries for GLES.
-
-v19.11.1 Public maintenance release
- - Fix offset calculation in NEReductionOperationKernel.
- - Fix data layout in NEScaleKernel for nhwc.
- - Retain configuration step data layout to avoid side-effects.
- - Perform sqrt in double domain for L2 pooling.
- - Fix output shape calculation for Reduce Mean
- - Restrict cases where optimized NEPadLayer runs.
-
-v19.11 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Updated recommended NDK version to r17c.
- - Deprecated OpenCL kernels / functions:
-    - CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
-    - CLDepthwiseIm2ColKernel
-    - CLDepthwiseSeparableConvolutionLayer
-    - CLDepthwiseVectorToTensorKernel
-    - CLDirectConvolutionLayerOutputStageKernel
- - Deprecated Arm® Neon™ kernels / functions:
-    - NEDepthwiseWeightsReshapeKernel
-    - NEDepthwiseIm2ColKernel
-    - NEDepthwiseSeparableConvolutionLayer
-    - NEDepthwiseVectorToTensorKernel
-    - NEDepthwiseConvolutionLayer3x3
- - New OpenCL kernels / functions:
-    - @ref CLInstanceNormalizationLayerKernel / @ref CLInstanceNormalizationLayer
-    - @ref CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated
-      OpenCL kernels / functions)
-    - @ref CLLogSoftmaxLayer
- - New Arm® Neon™ kernels / functions:
-    - @ref NEBoundingBoxTransformKernel / @ref NEBoundingBoxTransform
-    - @ref NEComputeAllAnchorsKernel / NEComputeAllAnchors
-    - @ref NEDetectionPostProcessLayer
-    - @ref NEGenerateProposalsLayer
-    - @ref NEInstanceNormalizationLayerKernel / @ref NEInstanceNormalizationLayer
-    - @ref NELogSoftmaxLayer
-    - @ref NEROIAlignLayerKernel / @ref NEROIAlignLayer
- - Added QASYMM8 support for:
-    - @ref CLGenerateProposalsLayer
-    - @ref CLROIAlignLayer
-    - @ref CPPBoxWithNonMaximaSuppressionLimit
- - Added QASYMM16 support for:
-    - @ref CLBoundingBoxTransform
- - Added FP16 support for:
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
- - Added new data type QASYMM8_PER_CHANNEL support for:
-    - CLDequantizationLayer
-    - @ref NEDequantizationLayer
- - Added new data type QSYMM8_PER_CHANNEL support for:
-    - @ref CLConvolutionLayer
-    - @ref NEConvolutionLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref NEDepthwiseConvolutionLayer
- - Added FP16 mixed-precision support for:
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
-    - CLPoolingLayerKernel
- - Added FP32 and FP16 ELU activation for:
-    - @ref CLActivationLayer
-    - @ref NEActivationLayer
- - Added asymmetric padding support for:
-    - @ref CLDirectDeconvolutionLayer
-    - @ref CLGEMMDeconvolutionLayer
-    - @ref NEDeconvolutionLayer
- - Added SYMMETRIC and REFLECT modes for @ref CLPadLayerKernel / @ref CLPadLayer.
- - Replaced the calls to NECopyKernel and NEMemsetKernel with @ref NEPadLayer in @ref NEGenerateProposalsLayer.
- - Replaced the calls to CLCopyKernel and CLMemsetKernel with @ref CLPadLayer in @ref CLGenerateProposalsLayer.
- - Improved performance for CL Inception V3 - FP16.
- - Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
- - Improved Arm® Neon™ performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
- - Improved Arm® Neon™ performance for MobileNet-SSD by improving the output detection performance.
- - Optimized @ref CLPadLayer.
- - Optimized CL generic depthwise convolution layer by introducing @ref CLDepthwiseConvolutionLayerNativeKernel.
- - Reduced memory consumption by implementing weights sharing.
-
-v19.08.1 Public maintenance release
- - Fix offset calculation in NEReductionOperationKernel.
- - Fix data layout in NEScaleKernel for nhwc.
- - Retain configuration step data layout to avoid side-effects.
- - Perform sqrt in double domain for L2 pooling.
- - Fix output shape calculation for Reduce Mean
- - Fix broadcast CLPixelwiseMultiplication with 5D tensors
-
-v19.08 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Deprecated Arm® Neon™ functions
-    - NEDepthConcatenateLayer
-    - NEWidthConcatenateLayer
- - Deprecated OpenCL kernels / functions
-    - CLDepthConcatenateLayer
-    - CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
-    - CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
-    - CLWidthConcatenateLayer
- - New Arm® Neon™ kernels / functions:
-    - @ref NEAbsLayer
-    - @ref NECast
-    - @ref NEElementwisePower
-    - @ref NELogLayer
-    - @ref NELSTMLayerQuantized
-    - @ref NENegLayer
-    - @ref NEPReluLayer
-    - @ref NESinLayer
-    - NEBatchConcatenateLayerKernel
-    - @ref NEDepthToSpaceLayerKernel / @ref NEDepthToSpaceLayer
-    - NEDepthwiseConvolutionLayerNativeKernel
-    - @ref NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-    - @ref NEMeanStdDevNormalizationKernel / @ref NEMeanStdDevNormalizationLayer
-    - @ref NESpaceToDepthLayerKernel / @ref NESpaceToDepthLayer
- - New OpenCL kernels / functions:
-    - @ref CLAbsLayer
-    - @ref CLElementwisePower
-    - @ref CLLogLayer
-    - @ref CLLSTMLayerQuantized
-    - @ref CLNegLayer
-    - @ref CLPReluLayer
-    - @ref CLSinLayer
-    - CLBatchConcatenateLayerKernel
-    - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer
-    - @ref CLGEMMLowpMatrixMultiplyNativeKernel
-    - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-    - @ref CLGEMMMatrixMultiplyNativeKernel
-    - CLMeanStdDevNormalizationKernel /CLMeanStdDevNormalizationLayer
-    - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer
- - New examples:
-    - neon_opticalflow
-    - cl_cache
-    - neon_permute
- - Added support for FP16 in @ref NEDeconvolutionLayer
- - Added support for FP16 in @ref CLDeconvolutionLayer
- - Added support for REDUCE_MIN and REDUCE_MAX in @ref ReductionOperation
- - Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
- - Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
- - Re-factored the depthwise convolution layer kernel on Arm® Neon™ for generic cases
- - Added an optimized depthwise convolution layer kernel for 5x5 filters (Neon only)
- - Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
- - Altered @ref QuantizationInfo interface to support per-channel quantization.
- - The CLDepthwiseConvolutionLayer3x3 will be included by @ref CLDepthwiseConvolutionLayer to accommodate for future optimizations.
- - The NEDepthwiseConvolutionLayerOptimized will be included by @ref NEDepthwiseConvolutionLayer to accommodate for future optimizations.
- - Removed inner_border_right and inner_border_top parameters from @ref CLDeconvolutionLayer interface
- - Removed inner_border_right and inner_border_top parameters from @ref NEDeconvolutionLayer interface
- - Optimized the Arm® Neon™ assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel
-
-v19.05 Public major release
- - Various bug fixes.
- - Various optimisations.
- - New Arm® Neon™ kernels / functions:
-    - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer
-    - NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication
-    - @ref NECropKernel / @ref NECropResize
-    - NEDepthwiseConvolutionAssemblyDispatch
-    - @ref NEFFTDigitReverseKernel
-    - @ref NEFFTRadixStageKernel
-    - @ref NEFFTScaleKernel
-    - @ref NEGEMMLowpOffsetContributionOutputStageKernel
-    - NEHeightConcatenateLayerKernel
-    - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer
-    - @ref NEFFT1D
-    - @ref NEFFT2D
-    - @ref NEFFTConvolutionLayer
- - New OpenCL kernels / functions:
-    - CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
-    - CLCropKernel / @ref CLCropResize
-    - @ref CLDeconvolutionReshapeOutputKernel
-    - @ref CLFFTDigitReverseKernel
-    - @ref CLFFTRadixStageKernel
-    - @ref CLFFTScaleKernel
-    - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-    - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
-    - CLHeightConcatenateLayerKernel
-    - @ref CLDirectDeconvolutionLayer
-    - @ref CLFFT1D
-    - @ref CLFFT2D
-    - @ref CLFFTConvolutionLayer
-    - @ref CLGEMMDeconvolutionLayer
- - New OpenGLES kernels / functions:
-    - GCConcatenateLayer
- - Deprecated functions/interfaces
-    - GCDepthConcatenateLayer
-    - NEWidthConcatenateLayer
-    - NEDepthConcatenateLayer
-    - CLWidthConcatenateLayer
-    - CLDepthConcatenateLayer
-    - CLGEMMInterleave4x4
-    - CLGEMMTranspose1xW
- - Support different quantization info in CLConcatLayer.
- - Add checks on different input/output quantization info were not supported.
- - Tensors have different quantization information.
- - Add FP16 support checks.
- - Fix output quantization CLDeptwiseConv3x3 when activation is fused.
- - New graph examples:
-     - graph_convolution
-     - graph_fully_connected
-     - graph_depthwise_convolution
-     - Deepspeech v0.4.1
- - Add support for QASYMM8 in NEArithmeticSubtractionKernel.
- - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
- - Add support for QASYMM8 NEDeconvolution.
- - Add support for DequantizationLayer for Neon/CL.
- - Add support for dilation in CLDepthwiseConvolution.
- - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
- - Optimize CLDeconvolution.
- - Add StackLayer to the graph API.
- - Add support for "reflect" padding mode in NEPad.
- - Winograd 7x7 NHWC on OpenCL.
- - Rework CL ML layers to run exclusively on CL.
- - Support different quantization info in PoolingLayer.
- - Implement and test import memory interfaces.
- - Added new tests and removed old ones.
- - Various clang-tidy fixes.
-
-v19.02 Public major release
- - Various bug fixes.
- - Various optimisations.
- - New Arm® Neon™ kernels / functions:
-    - @ref NETileKernel / @ref NETile
-    - @ref NEFuseBatchNormalizationKernel / @ref NEFuseBatchNormalization
-    - NEElementwiseOperationKernel
-    - @ref NEElementwiseMax
-    - @ref NEElementwiseMin
-    - @ref NEElementwiseSquaredDiff
-    - @ref NESelectKernel / @ref NESelect
-    - @ref NESplit
-    - @ref NESlice
-    - @ref NEUnstack
-    - @ref NEStridedSliceKernel / @ref NEStridedSlice
-    - NEElementwiseUnaryKernel
-    - @ref NERsqrtLayer
-    - @ref NEExpLayer
-    - @ref NEReverseKernel / @ref NEReverse
-    - @ref NEArgMinMaxLayer
-    - @ref NEStackLayerKernel / @ref NEStackLayer
-    - @ref NERangeKernel / @ref NERange
-    - @ref NEPadLayer
-    - NEMemsetKernel
-    - @ref NEGatherKernel / @ref NEGather
-    - @ref NEElementwiseComparison
-    - @ref NEElementwiseComparisonStatic
-    - NEComparisonOperationKernel
-    - @ref NEElementwiseDivision
- - New OpenCL kernels / functions:
-    - @ref CLSelectKernel / @ref CLSelect
-    - @ref CLTileKernel / @ref CLTile
-    - @ref CLComparisonKernel / @ref CLComparison
-    - @ref CLArgMinMaxLayer
-    - @ref CLElementwiseMax
-    - @ref CLElementwiseMin
-    - @ref CLElementwiseSquaredDiff
-    - @ref CLStackLayerKernel / @ref CLStackLayer
-    - @ref CLReverse / @ref CLReverseKernel
-    - @ref CLRsqrtLayer
-    - @ref CLExpLayer
-    - CLElementWiseUnaryLayerKernel
-    - @ref CLGEMMReshapeLHSMatrixKernel
-    - @ref CLGEMMReshapeRHSMatrixKernel
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
-    - @ref CLRangeKernel / @ref CLRange
-    - @ref CLUnstack
-    - @ref CLGatherKernel / @ref CLGather
-    - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
- - New CPP kernels / functions:
-    - @ref CPPDetectionOutputLayer
-    - @ref CPPTopKV / @ref CPPTopKVKernel
- - Added new examples:
-    - graph_ssd_mobilenet.cpp
-    - graph_mobilenet_v2.cpp
-    - graph_resnet12.cpp
-    - graph_srcnn955.cpp
-    - graph_vgg_vdsr.cpp
-    - graph_inception_resnet_v1.cpp
- - Add 4D tensors support to
-    - @ref NESoftmaxLayer
- - Fused activation in @ref CLWinogradConvolutionLayer
- - Extented @ref NEPermute to support more cases
- - Added Neon/SVE GEMM Hybrid kernels
- - Added u8 and s8 hybrid assembly kernels
- - Introduced GEMM strategy name in NEGEMMAssemblyWrapper
- - Improved @ref CLTuner
- - Fused the bias addition within @ref CLGEMM
- - Added support for QASYMM8 LOGISTIC activation in @ref NEActivationLayer
- - Added NHWC data layout support to:
-    - @ref NEScale for F16
-    - @ref CLNormalizationLayer IN_MAP_2D for FP32/FP16
-    - @ref NEL2NormalizeLayer for FP32/FP16
-    - @ref NENormalizationLayer IN_MAP_2D for FP32/FP16
-    - @ref CLROIAlignLayer
-    - @ref CLGenerateProposalsLayer
- - Added QASYMM8 support to the following kernels:
-    - NEArithmeticAdditionKernel
-    - @ref NEScale
- - Added new tests and improved validation and benchmarking suites.
- - Deprecated functions/interfaces
-    - Usage of inner_border_right and inner_border_top has been deprecated in @ref CLDeconvolutionLayer and @ref NEDeconvolutionLayer
-
-v18.11 Public major release
- - Various bug fixes.
- - Various optimisations.
- - New Arm® Neon™ kernels / functions:
-    - @ref NEChannelShuffleLayer / @ref NEChannelShuffleLayerKernel
-    - @ref NEReduceMean
-    - @ref NEReorgLayer / @ref NEReorgLayerKernel
-    - @ref NEPriorBoxLayer / @ref NEPriorBoxLayerKernel
-    - NEUpsampleLayer / NEUpsampleLayerKernel
-    - NEYOLOLayer / NEYOLOLayerKernel
- - New OpenCL kernels / functions:
-    - @ref CLBatchToSpaceLayer / @ref CLBatchToSpaceLayerKernel
-    - @ref CLBoundingBoxTransform / @ref CLBoundingBoxTransformKernel
-    - @ref CLComputeAllAnchorsKernel
-    - @ref CLGenerateProposalsLayer
-    - @ref CLNormalizePlanarYUVLayer / @ref CLNormalizePlanarYUVLayerKernel
-    - @ref CLReorgLayer / @ref CLReorgLayerKernel
-    - @ref CLSpaceToBatchLayer / @ref CLSpaceToBatchLayerKernel
-    - @ref CLPadLayer
-    - @ref CLReduceMean
-    - @ref CLPriorBoxLayer / @ref CLPriorBoxLayerKernel
-    - @ref CLROIAlignLayer / @ref CLROIAlignLayerKernel
-    - @ref CLSlice
-    - @ref CLSplit
-    - @ref CLStridedSlice / @ref CLStridedSliceKernel
-    - CLUpsampleLayer / CLUpsampleLayerKernel
-    - CLYOLOLayer / CLYOLOLayerKernel
- - New CPP kernels / functions:
-    - @ref CPPBoxWithNonMaximaSuppressionLimit / @ref CPPBoxWithNonMaximaSuppressionLimitKernel
- - Added the validate method in:
-    - @ref NEDepthConvertLayer
-    - @ref NEFloor / @ref CLFloor
-    - @ref NEGEMMMatrixAdditionKernel
-    - @ref NEReshapeLayer / @ref CLReshapeLayer
-    - @ref CLScale
- - Added new examples:
-    - graph_shufflenet.cpp
-    - graph_yolov3.cpp
- - Added documentation for add a new function or kernel.
- - Improved doxygen documentation adding a list of the existing functions.
- - Add 4D tensors support to
-    - CLWidthConcatenateLayer
-    - CLFlattenLayer
-    - @ref CLSoftmaxLayer
- - Add dot product support for @ref CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
- - Add SVE support
- - Fused batch normalization into convolution layer weights in @ref CLFuseBatchNormalization
- - Fuses activation in @ref CLDepthwiseConvolutionLayer3x3NCHWKernel, @ref CLDepthwiseConvolutionLayer3x3NHWCKernel and @ref NEGEMMConvolutionLayer
- - Added NHWC data layout support to:
-    - @ref CLChannelShuffleLayer
-    - @ref CLDeconvolutionLayer
-    - @ref CLL2NormalizeLayer
- - Added QASYMM8 support to the following kernels:
-    - CLScaleKernel
-    - NEDepthwiseConvolutionLayer3x3Kernel
-    - CLPixelWiseMultiplicationKernel
- - Added FP16 support to the following kernels:
-    - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
-    - NEDepthwiseConvolutionLayer3x3Kernel
-    - @ref CLNormalizePlanarYUVLayerKernel
-    - @ref CLWinogradConvolutionLayer (5x5 kernel)
- - More tests added to both validation and benchmarking suites.
-
-v18.08 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Updated recommended NDK version to r17b.
- - Removed support for QS8/QS16 data types.
- - Added support for grouped convolution in @ref CLConvolutionLayer.
- - Added NHWC data layout support to:
-    - NEDepthConcatenateLayer / CLDepthConcatenateLayer
-    - @ref NEWinogradConvolutionLayer / @ref CLWinogradConvolutionLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref CLDirectConvolutionLayer
-    - @ref CLConvolutionLayer
-    - @ref CLScale
-    - @ref CLIm2ColKernel
- - New Arm® Neon™ kernels / functions:
-    - @ref NERNNLayer
- - New OpenCL kernels / functions:
-    - @ref CLArithmeticDivision
- - Introduced prepare() stage support in the graph API for GLES.
- - Added support for memory reusage when trying to allocate smaller CLTensors.
- - Enabled NHWC execution on graph examples.
- - Added JPEG accessor for validation purposes.
- - Added validate methods to some kernels / functions.
-
-v18.05 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Major redesign in the interface for the neon kernels implemented in assembly.
- - Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
- - Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in neon functions.
- - Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
- - Moved neon assembly kernels to the folder src/core/Neon/kernels/arm_gemm.
- - Improved doxygen documentation.
- - Improved memory management for layer's transitions.
- - Added support for NHWC data layout in tensors.
- - Added NHWC data layout support to:
-    - @ref NEGEMMConvolutionLayer
-    - @ref NEDirectConvolutionLayer
-    - @ref NEPoolingLayer / @ref CLPoolingLayer
-    - @ref NEBatchNormalizationLayer / @ref CLBatchNormalizationLayer
-    - @ref NEDepthwiseConvolutionLayer
-    - @ref NEScale
-    - NEIm2Col
- - Added support for dilated convolutions in @ref NEConvolutionLayer and @ref CLConvolutionLayer.
- - New OpenCL kernels / functions:
-    - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel
-    - CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights
-    - @ref CLCopy / CLCopyKernel
-    - @ref CLLSTMLayer
-    - @ref CLRNNLayer
-    - CLWidthConcatenateLayer / CLWidthConcatenateLayerKernel
-    - @ref CLWinogradFilterTransformKernel / @ref CLWinogradInputTransformKernel / @ref CLWinogradConvolutionLayer
-    - @ref CLWinogradInputTransformKernel / @ref CLWinogradInputTransform
- - New Arm® Neon™ kernels / functions:
-    - NEConvertFullyConnectedWeightsKernel / @ref NEConvertFullyConnectedWeights.
- - Created the validate method in @ref CLDepthwiseConvolutionLayer.
- - Beta and gamma are no longer mandatory arguments in @ref NEBatchNormalizationLayer and @ref CLBatchNormalizationLayer.
- - Added depth multiplier support in @ref NEDepthwiseConvolutionLayer and @ref CLDepthwiseConvolutionLayer.
- - Added broadcast multiply support in @ref NEPixelWiseMultiplication / NEPixelWiseMultiplicationKernel.
- - Port mobilenet example to NHWC data layout.
- - Enabled Winograd method in @ref CLConvolutionLayer.
- - Renamed NEWinogradLayer to @ref NEWinogradConvolutionLayer.
- - Updated @ref NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/Neon/kernels/arm_gemm.
- - Added memory manager support in GLES functions.
- - Major refactoring of the graph API.
- - Added GLES backend in the graph API.
- - Added support for the memory manager in the graph API.
- - Enabled Winograd Convolution method in the graph API.
- - Added support for grouped convolutions in the graph API.
- - Replaced NEDeconvolutionLayerUpsampleKernel with NEScaleKernel in @ref NEDeconvolutionLayer.
- - Added fast maths flag in @ref CLConvolutionLayer.
- - Added new tests and benchmarks in validation and benchmark frameworks
- - Merge Activation layer with Convolution Layer (Neon. CL, GLES)
- - Added support to OpenCL 2.0 SVM
- - Added support to import memory in OpenCL tensors.
- - Added the prepare() method to perform any one off pre-processing before running the function.
- - Added new examples:
-    - graph_inception_v4.cpp
-    - graph_resnext50.cpp
- - Added memory measurement instrument for CL.
-
-v18.03 Public maintenance release
- - Various bug fixes.
- - Fixed bug in @ref NEActivationLayer
- - Fix in @ref CLTuner when using batches.
- - Updated recommended NDK version to r16b (And fixed warnings).
- - Fixed bug in validation code.
- - Added Inception v4 graph example.
- - Renamed NEWinogradLayer.cpp to @ref NEWinogradConvolutionLayer
-
-v18.02 Public major release
- - Various Arm® Neon™ / OpenCL / GLES optimisations.
- - Various bug fixes.
- - Changed default number of threads on big LITTLE systems.
- - Refactored examples and added:
-    - graph_mobilenet_qassym8
-    - graph_resnet
-    - graph_squeezenet_v1_1
- - Renamed @ref CLConvolutionLayer into @ref CLGEMMConvolutionLayer and created a new @ref CLConvolutionLayer to select the fastest convolution method.
- - Renamed @ref NEConvolutionLayer into @ref NEGEMMConvolutionLayer and created a new @ref NEConvolutionLayer to select the fastest convolution method.
- - Added in place support to:
-    - @ref CLActivationLayer
-    - @ref CLBatchNormalizationLayer
- - Added QASYMM8 support to:
-    - @ref CLActivationLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref NEDepthwiseConvolutionLayer
-    - @ref NESoftmaxLayer
- - Added FP16 support to:
-    - CLDepthwiseConvolutionLayer3x3
-    - @ref CLDepthwiseConvolutionLayer
- - Added broadcasting support to NEArithmeticAddition / @ref CLArithmeticAddition / @ref CLPixelWiseMultiplication
- - Added fused batched normalization and activation to @ref CLBatchNormalizationLayer and @ref NEBatchNormalizationLayer
- - Added support for non-square pooling to @ref NEPoolingLayer and @ref CLPoolingLayer
- - New OpenCL kernels / functions:
-    - CLDirectConvolutionLayerOutputStageKernel
- - New Arm® Neon™ kernels / functions
-    - Added name() method to all kernels.
-    - Added support for Winograd 5x5.
-    - NEPermuteKernel / @ref NEPermute
-    - @ref NEWinogradLayerTransformInputKernel / NEWinogradLayer
-    - @ref NEWinogradLayerTransformOutputKernel / NEWinogradLayer
-    - @ref NEWinogradLayerTransformWeightsKernel / NEWinogradLayer
-    - Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel
- - New GLES kernels / functions:
-    - GCTensorShiftKernel / GCTensorShift
-
-v18.01 Public maintenance release
- - Various bug fixes
- - Added some of the missing validate() methods
- - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample
- - Added CLPermuteKernel / @ref CLPermute
- - Added method to clean the programs cache in the CL Kernel library.
- - Added GCArithmeticAdditionKernel / GCArithmeticAddition
- - Added GCDepthwiseConvolutionLayer3x3Kernel / GCDepthwiseConvolutionLayer3x3
- - Added GCNormalizePlanarYUVLayerKernel / GCNormalizePlanarYUVLayer
- - Added GCScaleKernel / GCScale
- - Added GCWeightsReshapeKernel / GCConvolutionLayer
- - Added FP16 support to the following GLES compute kernels:
-    - GCCol2ImKernel
-    - GCGEMMInterleave4x4Kernel
-    - GCGEMMTranspose1xWKernel
-    - GCIm2ColKernel
- - Refactored Arm® Neon™ Winograd (NEWinogradLayerKernel)
- - Added NEDirectConvolutionLayerOutputStageKernel
- - Added QASYMM8 support to the following Arm® Neon™ kernels:
-    - NEDepthwiseConvolutionLayer3x3Kernel
-    - @ref NEFillBorderKernel
-    - NEPoolingLayerKernel
- - Added new examples:
-    - graph_cl_mobilenet_qasymm8.cpp
-    - graph_inception_v3.cpp
-    - gc_dc.cpp
- - More tests added to both validation and benchmarking suites.
-
-v17.12 Public major release
- - Most machine learning functions on OpenCL support the new data type QASYMM8
- - Introduced logging interface
- - Introduced opencl timer
- - Reworked GEMMLowp interface
- - Added new Arm® Neon™ assembly kernels for GEMMLowp, SGEMM and HGEMM
- - Added validation method for most Machine Learning kernels / functions
- - Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
- - Added sgemm example for OpenCL
- - Added absolute difference example for GLES compute
- - Added new tests and benchmarks in validation and benchmark frameworks
- - Added new kernels / functions for GLES compute
-
- - New OpenGL ES kernels / functions
-    - GCAbsoluteDifferenceKernel / GCAbsoluteDifference
-    - GCActivationLayerKernel / GCActivationLayer
-    - GCBatchNormalizationLayerKernel / GCBatchNormalizationLayer
-    - GCCol2ImKernel
-    - GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
-    - GCDirectConvolutionLayerKernel / GCDirectConvolutionLayer
-    - GCDropoutLayerKernel / GCDropoutLayer
-    - GCFillBorderKernel / GCFillBorder
-    - GCGEMMInterleave4x4Kernel / GCGEMMInterleave4x4
-    - GCGEMMMatrixAccumulateBiasesKernel / GCGEMMMatrixAdditionKernel / GCGEMMMatrixMultiplyKernel / GCGEMM
-    - GCGEMMTranspose1xWKernel / GCGEMMTranspose1xW
-    - GCIm2ColKernel
-    - GCNormalizationLayerKernel / GCNormalizationLayer
-    - GCPixelWiseMultiplicationKernel / GCPixelWiseMultiplication
-    - GCPoolingLayerKernel / GCPoolingLayer
-    - GCLogits1DMaxKernel / GCLogits1DShiftExpSumKernel / GCLogits1DNormKernel / GCSoftmaxLayer
-    - GCTransposeKernel / GCTranspose
-
- - New Arm® Neon™ kernels / functions
-    - arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
-    - arm_compute::NEHGEMMAArch64FP16Kernel
-    - NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / @ref NEDepthwiseConvolutionLayer
-    - @ref NEGEMMLowpOffsetContributionKernel / @ref NEGEMMLowpMatrixAReductionKernel / @ref NEGEMMLowpMatrixBReductionKernel / @ref NEGEMMLowpMatrixMultiplyCore
-    - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-    - NEWinogradLayer / NEWinogradLayerKernel
-
- - New OpenCL kernels / functions
-    - @ref CLGEMMLowpOffsetContributionKernel / @ref CLGEMMLowpMatrixAReductionKernel / @ref CLGEMMLowpMatrixBReductionKernel / @ref CLGEMMLowpMatrixMultiplyCore
-    - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-
- - New graph nodes for Arm® Neon™ and OpenCL
-    - graph::BranchLayer
-    - graph::DepthConvertLayer
-    - graph::DepthwiseConvolutionLayer
-    - graph::DequantizationLayer
-    - graph::FlattenLayer
-    - graph::QuantizationLayer
-    - graph::ReshapeLayer
-
-v17.10 Public maintenance release
- - Bug fixes:
-    - Check the maximum local workgroup size supported by OpenCL devices
-    - Minor documentation updates (Fixed instructions to build the examples)
-    - Introduced a graph::GraphContext
-    - Added a few new Graph nodes, support for branches and grouping.
-    - Automatically enable cl_printf in debug builds
-    - Fixed bare metal builds for armv7a
-    - Added AlexNet and cartoon effect examples
-    - Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)
-
-v17.09 Public major release
- - Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers.
- - Memory Manager (@ref BlobLifetimeManager, @ref BlobMemoryPool, @ref ILifetimeManager, @ref IMemoryGroup, @ref IMemoryManager, @ref IMemoryPool, @ref IPoolManager, @ref MemoryManagerOnDemand, @ref PoolManager)
- - New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework).
- - Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both Arm® Neon™ and OpenCL.
- - New Arm® Neon™ kernels / functions:
-    - arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel
-    - NEDequantizationLayerKernel / @ref NEDequantizationLayer
-    - NEFloorKernel / @ref NEFloor
-    - @ref NEL2NormalizeLayerKernel / @ref NEL2NormalizeLayer
-    - NEQuantizationLayerKernel @ref NEMinMaxLayerKernel / @ref NEQuantizationLayer
-    - @ref NEROIPoolingLayerKernel / @ref NEROIPoolingLayer
-    - @ref NEReductionOperationKernel / @ref NEReductionOperation
-    - NEReshapeLayerKernel / @ref NEReshapeLayer
-
- - New OpenCL kernels / functions:
-    - @ref CLDepthwiseConvolutionLayer3x3NCHWKernel @ref CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 @ref CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer
-    - CLDequantizationLayerKernel / CLDequantizationLayer
-    - CLDirectConvolutionLayerKernel / @ref CLDirectConvolutionLayer
-    - CLFlattenLayer
-    - CLFloorKernel / @ref CLFloor
-    - CLGEMMTranspose1xW
-    - CLGEMMMatrixVectorMultiplyKernel
-    - @ref CLL2NormalizeLayerKernel / @ref CLL2NormalizeLayer
-    - CLQuantizationLayerKernel @ref CLMinMaxLayerKernel / @ref CLQuantizationLayer
-    - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer
-    - @ref CLReductionOperationKernel / @ref CLReductionOperation
-    - CLReshapeLayerKernel / @ref CLReshapeLayer
-
-v17.06 Public major release
- - Various bug fixes
- - Added support for fixed point 8 bit (QS8) to the various Arm® Neon™ machine learning kernels.
- - Added unit tests and benchmarks (AlexNet, LeNet)
- - Added support for sub tensors.
- - Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
- - Added @ref OMPScheduler (OpenMP) scheduler for Neon
- - Added @ref SingleThreadScheduler scheduler for Arm® Neon™ (For bare metal)
- - User can specify his own scheduler by implementing the @ref IScheduler interface.
- - New OpenCL kernels / functions:
-    - @ref CLBatchNormalizationLayerKernel / @ref CLBatchNormalizationLayer
-    - CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
-    - CLHOGOrientationBinningKernel CLHOGBlockNormalizationKernel, CLHOGDetectorKernel / CLHOGDescriptor CLHOGDetector CLHOGGradient CLHOGMultiDetection
-    - CLLocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedLayer
-    - @ref CLWeightsReshapeKernel / @ref CLConvolutionLayerReshapeWeights
- - New C++ kernels:
-    - CPPDetectionWindowNonMaximaSuppressionKernel
- - New Arm® Neon™ kernels / functions:
-    - @ref NEBatchNormalizationLayerKernel / @ref NEBatchNormalizationLayer
-    - NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer
-    - NEDirectConvolutionLayerKernel / @ref NEDirectConvolutionLayer
-    - NELocallyConnectedMatrixMultiplyKernel / NELocallyConnectedLayer
-    - @ref NEWeightsReshapeKernel / @ref NEConvolutionLayerReshapeWeights
-
-v17.05 Public bug fixes release
- - Various bug fixes
- - Remaining of the functions ported to use accurate padding.
- - Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
- - Added "free" method to allocator.
- - Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9
-
-v17.04 Public bug fixes release
-
- The following functions have been ported to use the new accurate padding:
- -  CLColorConvertKernel
- -  CLEdgeNonMaxSuppressionKernel
- -  CLEdgeTraceKernel
- -  CLGaussianPyramidHorKernel
- -  CLGaussianPyramidVertKernel
- -  CLGradientKernel
- -  NEChannelCombineKernel
- -  NEFillArrayKernel
- -  NEGaussianPyramidHorKernel
- -  NEGaussianPyramidVertKernel
- -  NEHarrisScoreFP16Kernel
- -  NEHarrisScoreKernel
- -  NEHOGDetectorKernel
- -  NELogits1DMaxKernel
- -  NELogits1DShiftExpSumKernel
- -  NELogits1DNormKernel
- -  NENonMaximaSuppression3x3FP16Kernel
- -  NENonMaximaSuppression3x3Kernel
-
-v17.03.1 First Major public release of the sources
- - Renamed the library to arm_compute
- - New CPP target introduced for C++ kernels shared between Arm® Neon™ and CL functions.
- - New padding calculation interface introduced and ported most kernels / functions to use it.
- - New OpenCL kernels / functions:
-   - CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
- - New Arm® Neon™ kernels / functions:
-   - @ref NENormalizationLayerKernel / @ref NENormalizationLayer
-   - NETransposeKernel / @ref NETranspose
-   - NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / @ref NESoftmaxLayer
-   - @ref NEIm2ColKernel, @ref NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / @ref NEConvolutionLayer
-   - NEGEMMMatrixAccumulateBiasesKernel / @ref NEFullyConnectedLayer
-   - @ref NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp
-
-v17.03 Sources preview
- - New OpenCL kernels / functions:
-   - CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
-   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, @ref CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
-   - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer
-   - CLTransposeKernel / @ref CLTranspose
-   - CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow
-   - @ref CLNormalizationLayerKernel / @ref CLNormalizationLayer
-   - CLLaplacianPyramid, CLLaplacianReconstruct
- - New Arm® Neon™ kernels / functions:
-   - NEActivationLayerKernel / @ref NEActivationLayer
-   - GEMM refactoring + FP16 support (Requires armv8.2 CPU): @ref NEGEMMInterleave4x4Kernel, @ref NEGEMMTranspose1xWKernel, @ref NEGEMMMatrixMultiplyKernel, @ref NEGEMMMatrixAdditionKernel / @ref NEGEMM
-   - NEPoolingLayerKernel / @ref NEPoolingLayer
-
-v17.02.1 Sources preview
- - New OpenCL kernels / functions:
-   - CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, CLLogits1DNormKernel / @ref CLSoftmaxLayer
-   - CLPoolingLayerKernel / @ref CLPoolingLayer
-   - @ref CLIm2ColKernel, @ref CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / CLConvolutionLayer
-   - @ref CLRemapKernel / @ref CLRemap
-   - CLGaussianPyramidHorKernel, CLGaussianPyramidVertKernel / CLGaussianPyramid, CLGaussianPyramidHalf, CLGaussianPyramidOrb
-   - CLMinMaxKernel, CLMinMaxLocationKernel / CLMinMaxLocation
-   - CLNonLinearFilterKernel / CLNonLinearFilter
- - New Arm® Neon™ FP16 kernels (Requires armv8.2 CPU)
-   - NEAccumulateWeightedFP16Kernel
-   - NEBox3x3FP16Kernel
-   - NENonMaximaSuppression3x3FP16Kernel
-
-v17.02 Sources preview
- - New OpenCL kernels / functions:
-   - CLActivationLayerKernel / @ref CLActivationLayer
-   - CLChannelCombineKernel / CLChannelCombine
-   - CLDerivativeKernel / CLChannelExtract
-   - CLFastCornersKernel / CLFastCorners
-   - CLMeanStdDevKernel / CLMeanStdDev
- - New Arm® Neon™ kernels / functions:
-   - HOG / SVM: NEHOGOrientationBinningKernel, NEHOGBlockNormalizationKernel, NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / NEHOGDescriptor, NEHOGDetector, NEHOGGradient, NEHOGMultiDetection
-   - NENonLinearFilterKernel / NENonLinearFilter
- - Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
- - Switched all the kernels / functions to use tensors instead of images.
- - Updated documentation to include instructions to build the library from sources.
-
-v16.12 Binary preview release
- - Original release
-
-@section S3_how_to_build How to build the library and the examples
-
-@subsection S3_1_build_options Build options
-
-scons 2.3 or above is required to build the library.
-To see the build options available simply run ```scons -h```:
-
-        debug: Debug (yes|no)
-            default: False
-
-        asserts: Enable asserts (this flag is forced to 1 for debug=1) (yes|no)
-            default: False
-
-        logging: Logging (this flag is forced to 1 for debug=1) (yes|no)
-            default: False
-
-        arch: Target Architecture (armv7a|arm64-v8a|arm64-v8.2-a|arm64-v8.2-a-sve|arm64-v8.2-a-sve2|x86_32|x86_64|armv8a|armv8.2-a|armv8.2-a-sve|armv8.6-a|armv8.6-a-sve|armv8.6-a-sve2|armv8r64|x86)
-            default: armv7a
-
-        estate: Execution State (auto|32|64)
-            default: auto
-
-        os: Target OS (linux|android|macos|tizen|bare_metal)
-            default: linux
-
-        build: Build type (native|cross_compile|embed_only)
-            default: cross_compile
-
-        examples: Build example programs (yes|no)
-            default: True
-
-        gemm_tuner: Build gemm_tuner programs (yes|no)
-            default: True
-
-        Werror: Enable/disable the -Werror compilation flag (yes|no)
-            default: True
-
-        standalone: Builds the tests as standalone executables, links statically with libgcc, libstdc++ and libarm_compute (yes|no)
-            default: False
-
-        opencl: Enable OpenCL support (yes|no)
-            default: True
-
-        neon: Enable Arm® Neon™ support (yes|no)
-            default: False
-
-        embed_kernels: Embed OpenCL kernels in library binary (yes|no)
-            default: True
-
-        compress_kernels: Compress embedded OpenCL kernels in library binary. Note embed_kernels should be enabled as well (yes|no)
-            default: False
-
-        set_soname: Set the library's soname and shlibversion (requires SCons 2.4 or above) (yes|no)
-            default: False
-
-        openmp: Enable OpenMP backend (yes|no)
-            default: False
-
-        cppthreads: Enable C++11 threads backend (yes|no)
-            default: True
-
-        build_dir: Specify sub-folder for the build ( /path/to/build_dir )
-            default: .
-
-        install_dir: Specify sub-folder for the install ( /path/to/install_dir )
-            default:
-
-        exceptions: Enable/disable C++ exception support (yes|no)
-            default: True
-
-        linker_script: Use an external linker script ( /path/to/linker_script )
-            default:
-
-        custom_options: Custom options that can be used to turn on/off features
-            (all|none|comma-separated list of names)
-            allowed names: disable_mmla_fp
-            default: none
-
-        data_type_support: Enable a list of data types to support
-            (all|none|comma-separated list of names)
-            allowed names: qasymm8 qasymm8_signed qsymm16 fp16 fp32
-            default: all
-
-        toolchain_prefix: Override the toolchain prefix
-            default:
-
-        compiler_prefix: Override the compiler prefix
-            default:
-
-        extra_cxx_flags: Extra CXX flags to be appended to the build command
-            default:
-
-        extra_link_flags: Extra LD flags to be appended to the build command
-            default:
-
-        compiler_cache: Command to prefix to the C and C++ compiler (e.g ccache)
-            default:
-
-        specs_file: Specs file to use
-            default: rdimon.specs
-
-        benchmark_examples: Build benchmark examples programs (yes|no)
-            default: False
-
-        validate_examples: Build validate examples programs (yes|no)
-            default: False
-
-        reference_openmp: Build reference validation with openmp (yes|no)
-            default: True
-
-        validation_tests: Build validation test programs (yes|no)
-            default: False
-
-        benchmark_tests: Build benchmark test programs (yes|no)
-            default: False
-
-        test_filter: Pattern to specify the tests' filenames to be compiled
-            default: *.cpp
-
-        pmu: Enable PMU counters (yes|no)
-            default: False
-
-        mali: Enable Arm® Mali™ hardware counters (yes|no)
-            default: False
-
-        external_tests_dir: Add examples, benchmarks and tests to the tests suite from an external path ( /path/to/external_tests_dir )
-            default:
-
-@b debug / @b asserts:
- - With debug=1 asserts are enabled, and the library is built with symbols and no optimisations enabled.
- - With debug=0 and asserts=1: Optimisations are enabled and symbols are removed, however all the asserts are still present (This is about 20% slower than the release build)
- - With debug=0 and asserts=0: All optimisations are enable and no validation is performed, if the application misuses the library it is likely to result in a crash. (Only use this mode once you are sure your application is working as expected).
-
-@b arch: The x86_32 and x86_64 targets can only be used with neon=0 and opencl=1.
-
-@b os: Choose the operating system you are targeting: Linux, Android or bare metal.
-@note bare metal can only be used for Arm® Neon™ (not OpenCL), only static libraries get built and Neon's multi-threading support is disabled.
-
-@b build: you can either build directly on your device (native) or cross compile from your desktop machine (cross-compile). In both cases make sure the compiler is available in your path.
-
-@note If you want to natively compile for 32bit on a 64bit Arm device running a 64bit OS then you will have to use cross-compile too.
-
-There is also an 'embed_only' option which will generate all the .embed files for the OpenCL kernels. This might be useful if using a different build system to compile the library.
-
-In addittion the option 'compress_kernels' will compress the embedded OpenCL kernel files using zlib and inject them in the library. This is useful for reducing the binary size. Note, this option is only available for Android when 'embed_kernels' is enabled.
-
-@b Werror: If you are compiling using the same toolchains as the ones used in this guide then there shouldn't be any warning and therefore you should be able to keep Werror=1. If with a different compiler version the library fails to build because of warnings interpreted as errors then, if you are sure the warnings are not important, you might want to try to build with Werror=0 (But please do report the issue on Github).
-
-@b opencl / @b neon: Choose which SIMD technology you want to target. (Neon for Arm Cortex-A CPUs or OpenCL for Arm® Mali™ GPUs)
-
-@b embed_kernels: For OpenCL only: set embed_kernels=1 if you want the OpenCL kernels to be built in the library's binaries instead of being read from separate ".cl" / ".cs" files. If embed_kernels is set to 0 then the application can set the path to the folder containing the OpenCL kernel files by calling CLKernelLibrary::init(). By default the path is set to "./cl_kernels".
-
-@b set_soname: Do you want to build the versioned version of the library ?
-
-If enabled the library will contain a SONAME and SHLIBVERSION and some symlinks will automatically be created between the objects.
-Example:
-  libarm_compute_core.so -> libarm_compute_core.so.1.0.0
-  libarm_compute_core.so.1 -> libarm_compute_core.so.1.0.0
-  libarm_compute_core.so.1.0.0
-
-@note This options is disabled by default as it requires SCons version 2.4 or above.
-
-@b extra_cxx_flags: Custom CXX flags which will be appended to the end of the build command.
-
-@b build_dir: Build the library in a subfolder of the "build" folder. (Allows to build several configurations in parallel).
-
-@b examples: Build or not the examples
-
-@b validation_tests: Enable the build of the validation suite.
-
-@b benchmark_tests: Enable the build of the benchmark tests
-
-@b pmu: Enable the PMU cycle counter to measure execution time in benchmark tests. (Your device needs to support it)
-
-@b mali: Enable the collection of Arm® Mali™ hardware counters to measure execution time in benchmark tests. (Your device needs to have a Arm® Mali™ driver that supports it)
-
-@b openmp Build in the OpenMP scheduler for Neon.
-
-@note Only works when building with g++ not clang++
-
-@b cppthreads Build in the C++11 scheduler for Neon.
-
-@sa Scheduler::set
-
-@b external_tests_dir Add examples, benchmarks and tests to the tests suite from an external path ( /path/to/external_tests_dir )
-
-In order to use this option, the external tests directory must have the following structure:
-
-    EXTERNAL_TESTS_DIR:
-    └── tests
-        ├── benchmark
-        │   ├── CL
-        │   ├── datasets
-        │   ├── fixtures
-        │   └── Neon
-        └── validation
-            ├── CL
-            ├── datasets
-            ├── fixtures
-            └── Neon
-
-Then, build the library with `external_tests_dir=<PATH_TO_EXTERNAL_TESTS_DIR>`.
-
-@subsection S3_2_linux Building for Linux
-
-@subsubsection S3_2_1_library How to build the library ?
-
-For Linux, the library was successfully built and tested using the following Linaro GCC toolchain:
-
- - gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
- - gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
-
-To cross-compile the library in debug mode, with Arm® Neon™ only support, for Linux 32bit:
-
-	scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
-
-To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit:
-
-	scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a
-
-You can also compile the library natively on an Arm device by using <b>build=native</b>:
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=arm64-v8a build=native
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native
-
-@note g++ for Arm is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler.
-
-For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b>
-
-	apt-get install g++-arm-linux-gnueabihf
-
-Then run
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile
-
-or simply remove the build parameter as build=cross_compile is the default value:
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a
-
-@subsubsection S3_2_2_examples How to manually build the examples ?
-
-The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
-
-@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
-
-To cross compile a Arm® Neon™ example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -larm_compute_core -o neon_convolution
-
-To cross compile a Arm® Neon™ example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -larm_compute_core -o neon_convolution
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
-
-To cross compile an OpenCL example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
-
-To cross compile an OpenCL example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
-
-To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
-
-i.e. to cross compile the "graph_lenet" example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-i.e. to cross compile the "graph_lenet" example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
-
-@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
-
-To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 32bit:
-
-	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -larm_compute -larm_compute_core -o neon_convolution
-
-To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 64bit:
-
-	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -larm_compute_core -o neon_convolution
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
-
-To compile natively (i.e directly on an Arm device) for OpenCL for Linux 32bit or Linux 64bit:
-
-	g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
-
-To compile natively the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
-
-i.e. to natively compile the "graph_lenet" example for Linux 32bit:
-
-	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-i.e. to natively compile the "graph_lenet" example for Linux 64bit:
-
-	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
-
-@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
-
-@note These two commands assume libarm_compute.so is available in your library path, if not add the path to it using -L (e.g. -Llib/linux-arm64-v8a-neon-cl-asserts/)
-@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
-
-To run the built executable simply run:
-
-	LD_LIBRARY_PATH=build ./neon_convolution
-
-or
-
-	LD_LIBRARY_PATH=build ./cl_convolution
-
-@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
-
-For example:
-
-	LD_LIBRARY_PATH=. ./graph_lenet --help
-
-Below is a list of the common parameters among the graph examples :
-@snippet utils/CommonGraphOptions.h Common graph examples parameters
-
-@subsubsection S3_2_3_sve Build for SVE or SVE2
-
-In order to build for SVE or SVE2 you need a compiler that supports them. You can find more information in the following these links:
-    -# GCC: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/sve-support
-    -# LLVM: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/llvm-toolchain/sve-support
-
-@note You the need to indicate the toolchains using the scons "toolchain_prefix" parameter.
-
-An example build command with SVE is:
-
-        scons arch=arm64-v8.2-a-sve os=linux build_dir=arm64 -j55 standalone=0 opencl=0 openmp=0 validation_tests=1 neon=1 cppthreads=1 toolchain_prefix=aarch64-none-linux-gnu-
-
-@subsection S3_3_android Building for Android
-
-For Android, the library was successfully built and tested using Google's standalone toolchains:
- - clang++ from NDK r18b for armv7a
- - clang++ from NDK r20b for arm64-v8a
- - clang++ from NDK r20b for arm64-v8.2-a with FP16 support
-
-For NDK r18 or older, here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a>:
-- Download the NDK r18b from here: https://developer.android.com/ndk/downloads/index.html to directory $NDK
-- Make sure you have Python 2.7 installed on your machine.
-- Generate the 32 and/or 64 toolchains by running the following commands to your toolchain dirctory $MY_TOOLCHAINS:
-
-	$NDK/build/tools/make_standalone_toolchain.py --arch arm64 --install-dir $MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b --stl libc++ --api 21
-	$NDK/build/tools/make_standalone_toolchain.py --arch arm --install-dir $MY_TOOLCHAINS/arm-linux-android-ndk-r18b --stl libc++ --api 21
-
-For NDK r19 or newer, you can directly <a href="https://developer.android.com/ndk/downloads">Download</a> the NDK package for your development platform, without the need to launch the make_standalone_toolchain.py script. You can find all the prebuilt binaries inside $NDK/toolchains/llvm/prebuilt/$OS_ARCH/bin/.
-@attention the building script will look for a binary named "aarch64-linux-android-clang++", while the prebuilt binaries will have their API version as a suffix to their filename (e.g. "aarch64-linux-android21-clang++"). You should copy/rename the binary removing this suffix, or - alternatively - create an alias for it.
-
-@attention We used to use gnustl but as of NDK r17 it is deprecated so we switched to libc++
-
-@note Make sure to add the toolchains to your PATH:
-
-	export PATH=$PATH:$MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b/bin:$MY_TOOLCHAINS/arm-linux-android-ndk-r18b/bin
-
-@subsubsection S3_3_1_library How to build the library ?
-
-To cross-compile the library in debug mode, with Arm® Neon™ only support, for Android 32bit:
-
-	CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a
-
-To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit:
-
-	CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a
-
-@subsubsection S3_3_2_examples How to manually build the examples ?
-
-The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
-
-@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
-
-Once you've got your Android standalone toolchain built and added to your path you can do the following:
-
-To cross compile a Arm® Neon™ example:
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_arm -static-libstdc++ -pie
-	#64 bit:
-	aarch64-linux-android-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_aarch64 -static-libstdc++ -pie
-
-To cross compile an OpenCL example:
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
-	#64 bit:
-	aarch64-linux-android-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
-
-To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also.
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
-	#64 bit:
-	aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
-
-@note Due to some issues in older versions of the Arm® Mali™ OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android.
-@note When linked statically the arm_compute_graph library currently needs the --whole-archive linker flag in order to work properly
-
-Then you need to do is upload the executable and the shared library to the device using ADB:
-
-	adb push neon_convolution_arm /data/local/tmp/
-	adb push cl_convolution_arm /data/local/tmp/
-	adb push gc_absdiff_arm /data/local/tmp/
-	adb shell chmod 777 -R /data/local/tmp/
-
-And finally to run the example:
-
-	adb shell /data/local/tmp/neon_convolution_arm
-	adb shell /data/local/tmp/cl_convolution_arm
-	adb shell /data/local/tmp/gc_absdiff_arm
-
-For 64bit:
-
-	adb push neon_convolution_aarch64 /data/local/tmp/
-	adb push cl_convolution_aarch64 /data/local/tmp/
-	adb push gc_absdiff_aarch64 /data/local/tmp/
-	adb shell chmod 777 -R /data/local/tmp/
-
-And finally to run the example:
-
-	adb shell /data/local/tmp/neon_convolution_aarch64
-	adb shell /data/local/tmp/cl_convolution_aarch64
-	adb shell /data/local/tmp/gc_absdiff_aarch64
-
-@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
-
-For example:
-	adb shell /data/local/tmp/graph_lenet --help
-
-In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on Neon, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run.
-
-@subsection S3_4_macos Building for macOS
-
-The library was successfully natively built for Apple Silicon under macOS 11.1 using clang v12.0.0.
-
-To natively compile the library with accelerated CPU support:
-
-	scons Werror=1 -j8 neon=1 opencl=0 os=macos arch=arm64-v8a build=native
-
-@note Initial support disables feature discovery through HWCAPS and thread scheduling affinity controls
-
-@subsection S3_5_bare_metal Building for bare metal
-
-For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains:
- - arm-eabi for armv7a
- - aarch64-elf for arm64-v8a
-
-Download linaro for <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/arm-eabi/">armv7a</a> and <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/aarch64-elf/">arm64-v8a</a>.
-
-@note Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin
-
-@subsubsection S3_5_1_library How to build the library ?
-
-To cross-compile the library with Arm® Neon™ support for baremetal arm64-v8a:
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=arm64-v8a build=cross_compile cppthreads=0 openmp=0 standalone=1
-
-@subsubsection S3_5_2_examples How to manually build the examples ?
-
-Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a/index.html">here</a>.
-
-@subsection S3_6_windows_host Building on a Windows host system
-
-Using `scons` directly from the Windows command line is known to cause
-problems. The reason seems to be that if `scons` is setup for cross-compilation
-it gets confused about Windows style paths (using backslashes). Thus it is
-recommended to follow one of the options outlined below.
-
-@subsubsection S3_6_1_ubuntu_on_windows Bash on Ubuntu on Windows
-
-The best and easiest option is to use
-<a href="https://msdn.microsoft.com/en-gb/commandline/wsl/about">Ubuntu on Windows</a>.
-This feature is still marked as *beta* and thus might not be available.
-However, if it is building the library is as simple as opening a *Bash on
-Ubuntu on Windows* shell and following the general guidelines given above.
-
-@subsubsection S3_6_2_cygwin Cygwin
-
-If the Windows subsystem for Linux is not available <a href="https://www.cygwin.com/">Cygwin</a>
-can be used to install and run `scons`, the minimum Cygwin version must be 3.0.7 or later. In addition
-to the default packages installed by Cygwin `scons` has to be selected in the installer. (`git` might
-also be useful but is not strictly required if you already have got the source
-code of the library.) Linaro provides pre-built versions of
-<a href="http://releases.linaro.org/components/toolchain/binaries/">GCC cross-compilers</a>
-that can be used from the Cygwin terminal. When building for Android the
-compiler is included in the Android standalone toolchain. After everything has
-been set up in the Cygwin terminal the general guide on building the library
-can be followed.
-
-@subsection S3_7_cl_requirements OpenCL DDK Requirements
-
-@subsubsection S3_7_1_cl_hard_requirements Hard Requirements
-
-Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Arm® Mali™ OpenCL DDK r8p0 and above as an extension (respective extension flag is \a -cl-arm-non-uniform-work-group-size).
-
-Enabling 16-bit floating point calculations require \a cl_khr_fp16 extension to be supported. All Arm® Mali™ GPUs with compute capabilities have native support for half precision floating points.
-
-@subsubsection S3_7_2_cl_performance_requirements Performance improvements
-
-Integer dot product built-in function extensions (and therefore optimized kernels) are available with Arm® Mali™ OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are \a cl_arm_integer_dot_product_int8, \a cl_arm_integer_dot_product_accumulate_int8 and \a cl_arm_integer_dot_product_accumulate_int16.
-
-OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported.
-
-SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.
-
-@subsection S3_8_cl_tuner OpenCL Tuner
-
-The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
-The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
-The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
-In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
-
-If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:
-
-https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice
-
-Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.
-
-CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.
-
-    #Example: 2 unique Matrix Multiply configurations
-@code{.cpp}
-    TensorShape a0 = TensorShape(32,32);
-    TensorShape b0 = TensorShape(32,32);
-    TensorShape c0 = TensorShape(32,32);
-    TensorShape a1 = TensorShape(64,64);
-    TensorShape b1 = TensorShape(64,64);
-    TensorShape c1 = TensorShape(64,64);
-
-    Tensor a0_tensor;
-    Tensor b0_tensor;
-    Tensor c0_tensor;
-    Tensor a1_tensor;
-    Tensor b1_tensor;
-    Tensor c1_tensor;
-
-    a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
-    b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
-    c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
-    a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
-    b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
-    c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
-
-    CLGEMM gemm0;
-    CLGEMM gemm1;
-
-    // Configuration 0
-    gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
-
-    // Configuration 1
-    gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);
-@endcode
-
-@subsubsection S3_8_1_cl_tuner_how_to How to use it
-
-All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file
-
-    #Enable CL tuner
-    ./graph_mobilenet --enable-tuner –-target=CL
-    ./arm_compute_benchmark --enable-tuner
-
-    #Export/Import to/from a file
-    ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
-    ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv
-
-If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.
-
-Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:
-
-    -# Disable the power management
-    -# Keep the GPU frequency constant
-    -# Run multiple times the network (i.e. 10).
-
-If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.
-
-@code{.cpp}
-CLTuner tuner;
-
-// Setup Scheduler
-CLScheduler::get().default_init(&tuner);
-@endcode
-
-After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".
-- tuner.save_to_file("results.csv");
-
-This file can be also imported using the method "load_from_file("results.csv")".
-- tuner.load_from_file("results.csv");
-*/
-} // namespace arm_compute
diff --git a/docs/01_library.dox b/docs/01_library.dox
deleted file mode 100644
index 25535d111a..0000000000
--- a/docs/01_library.dox
+++ /dev/null
@@ -1,571 +0,0 @@
-///
-/// Copyright (c) 2017-2020 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/**
-@page architecture Library architecture
-
-@tableofcontents
-
-@section S4_1_1 Core vs Runtime libraries
-
-The Core library is a low level collection of algorithms implementations, it is designed to be embedded in existing projects and applications:
-
-- It doesn't allocate any memory (All the memory allocations/mappings have to be handled by the caller).
-- It doesn't perform any kind of multi-threading (but provide information to the caller about how the workload can be split).
-
-The Runtime library is a very basic wrapper around the Core library which can be used for quick prototyping, it is basic in the sense that:
-
-- It allocates images and tensors by using standard malloc().
-- It multi-threads Arm® Neon™ code in a very basic way using a very simple pool of threads.
-- For OpenCL it uses the default CLScheduler command queue for all mapping operations and kernels.
-
-For maximum performance, it is expected that the users would re-implement an equivalent to the runtime library which suits better their needs (With a more clever multi-threading strategy, load-balancing between Arm® Neon™ and OpenCL, etc.)
-
-@section S4_1_2 Data-type and Data-layout support
-
-Compute Library supports a wide list of data-types, information can been directly found in the documentation of each kernel/function.
-The main data-types that the Machine Learning functions support are the following:
-- BFLOAT16: 16-bit non-standard brain floating point
-- F16: 16-bit half precision floating point
-- F32: 32-bit single precision floating point
-- QASYMM8: 8-bit unsigned asymmetric quantized
-- QASYMM8_SIGNED: 8-bit signed asymmetric quantized
-- QSYMM8_PER_CHANNEL: 8-bit signed symmetric quantized (Used for the weights)
-
-Moreover, Compute Library supports the following data layouts (fast changing dimension from right to left):
-- NHWC: The native layout of Compute Library that delivers the best performance where channels are in the fastest changing dimension
-- NCHW: Legacy layout where width is in the fastest changing dimension
-where N = batches, C = channels, H = height, W = width
-
-@section S4_1_3 Fast-math support
-
-Compute Library supports different types of convolution methods, fast-math flag is only used for the Winograd algorithm.
-When the fast-math flag is enabled, both Arm® Neon™ and CL convolution layers will try to dispatch the fastest implementation available, which may introduce a drop in accuracy as well. The different scenarios involving the fast-math flag are presented below:
-- For FP32:
-    - no-fast-math: Only supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7
-    - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
-- For fp16:
-    - no-fast-math: No Winograd support
-    - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
-
-@section S4_1_4 Thread-safety
-
-Although the library supports multi-threading during workload dispatch, thus parallelizing the execution of the workload at multiple threads, the current runtime module implementation is not thread-safe in the sense of executing different functions from separate threads.
-This lies to the fact that the provided scheduling mechanism wasn't designed with thread-safety in mind.
-As it is true with the rest of the runtime library a custom scheduling mechanism can be re-implemented to account for thread-safety if needed and be injected as the library's default scheduler.
-
-@section S4_2_windows_kernels_mt_functions Windows, kernels, multi-threading and functions
-
-@subsection S4_2_1_windows Windows
-
-A @ref Window represents a workload to execute, it can handle up to @ref Coordinates::num_max_dimensions dimensions.
-Each dimension is defined by a start, end and step.
-
-It can split into subwindows as long as *all* the following rules remain true for all the dimensions:
-
-- max[n].start() <= sub[n].start() < max[n].end()
-- sub[n].start() < sub[n].end() <= max[n].end()
-- max[n].step() == sub[n].step()
-- (sub[n].start() - max[n].start()) % max[n].step() == 0
-- (sub[n].end() - sub[n].start()) % max[n].step() == 0
-
-@subsection S4_2_2 Kernels
-
-Each implementation of the @ref IKernel interface (base class of all the kernels in the core library) works in the same way:
-
-OpenCL kernels:
-
-@code{.cpp}
-// Initialize the CLScheduler with the default context and default command queue
-// Implicitly initializes the CLKernelLibrary to use ./cl_kernels as location for OpenCL kernels files and sets a default device for which OpenCL programs are built.
-CLScheduler::get().default_init();
-
-cl::CommandQueue q = CLScheduler::get().queue();
-//Create a kernel object:
-MyKernel kernel;
-// Initialize the kernel with the input/output and options you want to use:
-kernel.configure( input, output, option0, option1);
-// Retrieve the execution window of the kernel:
-const Window& max_window = kernel.window();
-// Run the whole kernel in the current thread:
-kernel.run( q, max_window ); // Enqueue the kernel to process the full window on the default queue
-
-// Wait for the processing to complete:
-q.finish();
-@endcode
-
-Neon / CPP kernels:
-
-@code{.cpp}
-//Create a kernel object:
-MyKernel kernel;
-// Initialize the kernel with the input/output and options you want to use:
-kernel.configure( input, output, option0, option1);
-// Retrieve the execution window of the kernel:
-const Window& max_window = kernel.window();
-// Run the whole kernel in the current thread:
-kernel.run( max_window ); // Run the kernel on the full window
-@endcode
-
-@subsection S4_2_3 Multi-threading
-
-The previous section shows how to run a Arm® Neon™ / CPP kernel in the current thread, however if your system has several CPU cores, you will probably want the kernel to use several cores. Here is how this can be done:
-
-@code{.cpp}
-    ThreadInfo info;
-    info.cpu_info = &_cpu_info;
-
-    const Window      &max_window     = kernel->window();
-    const unsigned int num_iterations = max_window.num_iterations(split_dimension);
-    info.num_threads                  = std::min(num_iterations, _num_threads);
-
-    if(num_iterations == 0)
-    {
-        return;
-    }
-
-    if(!kernel->is_parallelisable() || info.num_threads == 1)
-    {
-        kernel->run(max_window, info);
-    }
-    else
-    {
-        int  t         = 0;
-        auto thread_it = _threads.begin();
-
-        for(; t < info.num_threads - 1; ++t, ++thread_it)
-        {
-            Window win     = max_window.split_window(split_dimension, t, info.num_threads);
-            info.thread_id = t;
-            thread_it->start(kernel, win, info);
-        }
-
-        // Run last part on main thread
-        Window win     = max_window.split_window(split_dimension, t, info.num_threads);
-        info.thread_id = t;
-        kernel->run(win, info);
-
-        try
-        {
-            for(auto &thread : _threads)
-            {
-                thread.wait();
-            }
-        }
-        catch(const std::system_error &e)
-        {
-            std::cerr << "Caught system_error with code " << e.code() << " meaning " << e.what() << '\n';
-        }
-    }
-@endcode
-
-This is a very basic implementation which was originally used in the Arm® Neon™ runtime library by all the Arm® Neon™ functions.
-
-@sa CPPScheduler
-
-@note Some kernels need some local temporary buffer to perform their calculations. In order to avoid memory corruption between threads, the local buffer must be of size: ```memory_needed_per_thread * num_threads``` and a unique thread_id between 0 and num_threads must be assigned to the @ref ThreadInfo object passed to the ```run``` function.
-
-@subsection S4_2_4 Functions
-
-Functions will automatically allocate the temporary buffers mentioned above, and will automatically multi-thread kernels' executions using the very basic scheduler described in the previous section.
-
-Simple functions only call a single kernel (e.g NEConvolution3x3), while more complex ones consist of several kernels pipelined together (e.g @ref NEFullyConnectedLayer ). Check their documentation to find out which kernels are used by each function.
-
-@code{.cpp}
-//Create a function object:
-MyFunction function;
-// Initialize the function with the input/output and options you want to use:
-function.configure( input, output, option0, option1);
-// Execute the function:
-function.run();
-@endcode
-
-@warning The Compute Library requires Arm® Mali™ OpenCL DDK r8p0 or higher (OpenCL kernels are compiled using the -cl-arm-non-uniform-work-group-size flag)
-
-@note All OpenCL functions and objects in the runtime library use the command queue associated with CLScheduler for all operations, a real implementation would be expected to use different queues for mapping operations and kernels in order to reach a better GPU utilization.
-
-@subsection S4_4_1_cl_scheduler OpenCL Scheduler and kernel library
-
-The Compute Library runtime uses a single command queue and context for all the operations.
-
-The user can get / set this context and command queue through CLScheduler's interface.
-
-The user can get / set the target GPU device through the CLScheduler's interface.
-
-@attention Make sure the application is using the same context as the library as in OpenCL it is forbidden to share objects across contexts. This is done by calling @ref CLScheduler::init() or @ref CLScheduler::default_init() at the beginning of your application.
-
-@attention Make sure the scheduler's target is not changed after function classes are created.
-
-All OpenCL kernels used by the library are built and stored in @ref CLKernelLibrary.
-If the library is compiled with embed_kernels=0 the application can set the path to the OpenCL kernels by calling @ref CLKernelLibrary::init(), by default the path is set to "./cl_kernels"
-
-@subsection S4_4_2_events_sync OpenCL events and synchronization
-
-In order to block until all the jobs in the CLScheduler's command queue are done executing the user can call @ref CLScheduler::sync() or create a sync event using @ref CLScheduler::enqueue_sync_event()
-
-@subsection S4_4_2_cl_neon OpenCL / Arm® Neon™ interoperability
-
-You can mix OpenCL and Arm® Neon™ kernels and functions. However it is the user's responsibility to handle the mapping/unmapping of OpenCL objects.
-
-@section S4_5_algorithms Algorithms
-
-All computer vision algorithms in this library have been implemented following the [OpenVX 1.1 specifications](https://www.khronos.org/registry/vx/specs/1.1/html/). Please refer to the Khronos documentation for more information.
-
-@section S4_6_images_tensors Images, padding, border modes and tensors
-
-Most kernels and functions in the library process images, however, in order to be future proof most of the kernels actually accept tensors. See below for more information about how they are related.
-
-@attention Each memory object can be written by only one kernel, however it can be read by several kernels. Writing to the same object from several kernels will result in undefined behavior. The kernel writing to an object must be configured before the kernel(s) reading from it.
-
-@subsection S4_6_1_padding_and_border Padding and border modes
-
-Several algorithms require a neighborhood around the current pixel to compute it's value. This means the algorithm will not be able to process the borders of the image unless you give it more information about how those border pixels should be processed. The @ref BorderMode enum is used for this purpose.
-
-You have 3 types of @ref BorderMode :
-
-- @ref BorderMode::UNDEFINED : Neighbor pixels outside of the image are treated as undefined. As a result all the pixels which are on the border will have a value which is undefined.
-- @ref BorderMode::REPLICATE : Neighbor pixels outside of the image are treated as having the same value as the closest valid pixel.
-- @ref BorderMode::CONSTANT : Neighbor pixels outside of the image are treated as having the same constant value. (The user can choose what this value should be).
-
-Moreover both OpenCL and Arm® Neon™ use vector loads and stores instructions to access the data in buffers, so in order to avoid having special cases to handle for the borders all the images and tensors used in this library must be padded.
-
-@subsubsection padding Padding
-
-There are different ways padding can be calculated:
-
-- Accurate padding:
-
-@note It's important to call allocate @b after the function is configured: if the image / tensor is already allocated then the function will shrink its execution window instead of increasing the padding. (See below for more details).
-
-- Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions). In that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (which translates into a smaller valid region for the output). See also @ref valid_region).
-If you don't want to manually set the padding but still want to allocate your objects upfront then you can use auto_padding. It guarantees that the allocation will have enough padding to run any of the provided functions.
-
-@code{.cpp}
-Image     src, dst;
-
-// Use auto padding for the input:
-src.info()->init_auto_padding(TensorShape(640u,480u), Format::U8);
-
-// Use manual padding for the destination image
-dst.info()->init(src.info()->tensor_shape(), Format::U8, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);
-
-// Allocate all the images
-src.allocator()->allocate();
-dst.allocator()->allocate();
-// Fill the input image with the content of the PPM image if a filename was provided:
-fill_image(src);
-
-NEGaussian3x3 gauss;
-
-// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)
-gauss.configure(&src, &dst, BorderMode::UNDEFINED);
-
-//Execute the functions:
-gauss.run();
-@endcode
-
-@warning Some kernels need up to 3 neighbor values to calculate the value of a given pixel. Therefore, to be safe, we use a 4-pixel padding all around the image. In addition, some kernels read and write up to 32 pixels at the same time. To cover that case as well we add an extra 32 pixels of padding at the end of each row. As a result auto padded buffers waste a lot of memory and are less cache friendly. It is therefore recommended to use accurate padding or manual padding wherever possible.
-
-@subsubsection valid_region Valid regions
-
-Some kernels (like edge detectors for example) need to read values of neighboring pixels to calculate the value of a given pixel, it is therefore not possible to calculate the values of the pixels on the edges.
-
-Another case is: if a kernel processes 8 pixels per iteration and the image's dimensions are not a multiple of 8 and not enough padding is available then the kernel will not be able to process the pixels near the right edge. As a result these pixels will be left undefined.
-
-In order to know which pixels have been calculated, each kernel sets a valid region for each output image or tensor. See also @ref TensorInfo::valid_region(), @ref ValidRegion
-
-@subsection S4_6_2_tensors Tensors
-
-Tensors are multi-dimensional arrays with a maximum of @ref Coordinates::num_max_dimensions dimensions.
-
-Depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as an one-dimensional tensor. Further, an image is actually just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
-
-@note Most algorithms process images (i.e a 2D slice of the tensor), therefore only padding along the X and Y axes is required (2D slices can be stored contiguously in memory).
-
-@subsection S4_6_3_description_conventions Images and Tensors description conventions
-
-Image objects are defined by a @ref Format and dimensions expressed as [width, height, batch]
-
-Tensors are defined by a @ref DataType plus a number of channels (Always expected to be 1 for now) and their dimensions are expressed as [width, height, feature_maps, batch].
-
-In other words, the lower three dimensions of a tensor specify a single input in [width, height, feature_maps], while any other specified dimension represents a batch in the appropriate dimension space.
-For example, a tensor with dimensions [128, 128, 64, 16] represents a 1D batch space with 16 batches of 128 elements in width and height and 64 feature maps each.
-Each kernel specifies the expected layout of each of its tensors in its documentation.
-
-@note Unless specified otherwise in the kernel's or function's documentation all tensors and images parameters passed must have identical dimensions.
-
-@note Unless specified otherwise in the kernel's or function's documentation the number of channels for tensors is expected to be 1 (For images, the number of channels is inferred from the @ref Format).
-
-@attention Regardless of the @ref DataType used by a tensor the @ref ITensor::buffer() method will always return a uint8_t pointer, and all the metadata in @ref TensorInfo will be expressed in bytes. It is the user's responsibility to cast the pointer to the correct type.
-
-For example, to read the element located at the coordinates (x,y) of a float tensor:
-
-@code{.cpp}
-float value = *reinterpret_cast<float*>(input.buffer() + input.info()->offset_element_in_bytes(Coordinates(x,y)));
-@endcode
-
-@subsection S4_6_4_working_with_objects Working with Images and Tensors using iterators
-
-The library provides some iterators to access objects' data.
-Iterators are created by associating a data object (An image or a tensor for example) with an iteration window.
-
-Iteration windows are defined by an array of dimensions, each of which consists of a start, end and step.
-
-The @ref execute_window_loop function takes an execution window, a lambda function and one or more iterators.
-It will iterate through every element of the execution window and for each element it will update the iterators accordingly and call the lambda function.
-
-Here are a couple of examples of how to use the iterators to fill / read tensors:
-
-@snippet examples/neon_copy_objects.cpp Copy objects example
-
-@subsection S4_6_5_sub_tensors Sub-tensors
-
-Sub-tensors are aliases to existing Tensors, as a result creating a sub-tensor does not result in any underlying memory allocation.
-
-Sub-tensors can be used to access a sub-set of the parent tensor, something that can be useful in case different operations need to be performed on different parts of a tensor.
-
-Moreover, sub-tensors can be used to perform zero copy tensor concatenation.
-
-The API for creating a sub-tensor is the following:
-@code{.cpp}
-SubTensor(ITensor *parent, const TensorShape &tensor_shape, const Coordinates &coords)
-@endcode
-
-Where \a parent is the parent tensor which we want to create an alias for, \a tensor_shape is the shape of the sub-tensor and \a coords are the starting indexing coordinates of the sub-tensor within the parent tensor.
-
-@note Two sub-tensor concrete classes for different targets are currently supported : @ref CLSubTensor and @ref SubTensor
-
-@warning Limitation of the sub-tensor is that it cannot be extracted spatially, meaning sub-tensors should have the same width and height as the parent tensor. The main reasons for this is the fact that individual kernels might need to operate with a step size that is not a multiple of the sub-tensor spatial dimension. This could lead to elements being overwritten by different kernels operating on different sub-tensors of the same underlying tensor.
-
-@section S4_7_memory_manager MemoryManager
-
-@ref IMemoryManager is a memory managing interface that can be used to reduce the memory requirements of a given pipeline by recycling temporary buffers.
-
-@subsection S4_7_1_memory_manager_components MemoryGroup, MemoryPool and MemoryManager Components
-
-@subsubsection S4_7_1_1_memory_group MemoryGroup
-
-@ref IMemoryGroup defines the memory managing granularity.
-
-MemoryGroup binds a number of objects to a bucket of memory requirements that need to be fulfilled in order for an operation or list of operations to be executed.
-
-Requesting backing memory for a specific group can be done using @ref IMemoryGroup::acquire and releasing the memory back using @ref IMemoryGroup::release.
-
-@subsubsection S4_7_1_2_memory_pool MemoryPool
-
-@ref IMemoryPool defines a pool of memory that can be used to provide backing memory to a memory group.
-
-@note @ref BlobMemoryPool is currently implemented which models the memory requirements as a vector of distinct memory blobs.
-
-@subsubsection S4_7_1_2_memory_manager_components MemoryManager Components
-
-@ref IMemoryManager consists of two components:
-- @ref ILifetimeManager that keeps track of the lifetime of the registered objects of the memory groups and given an @ref IAllocator creates an appropriate memory pool that fulfils the memory requirements of all the registered memory groups.
-- @ref IPoolManager that safely manages the registered memory pools.
-
-@note @ref BlobLifetimeManager is currently implemented which models the memory requirements as a vector of distinct memory blobs.
-
-@subsection S4_7_2_working_with_memory_manager Working with the Memory Manager
-Using a memory manager to reduce the memory requirements of a pipeline can be summed in the following steps:
-
-Initially a memory manager must be set-up:
-@code{.cpp}
-Allocator  allocator{};                                                               // Create an allocator to use for the backing memory allocation
-auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
-auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
-auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
-@endcode
-
-Once done, memory groups can be registered to use the memory manager:
-@code{.cpp}
-MemoryGroup memory_group(mm); // Create a memory group and set the memory manager to use
-@endcode
-
-@note If a memory manager is not specified then all allocation will be immediate instead of deferred through the memory manager.
-
-Next step is to set objects to be managed by the memory group. It is important though to note that the lifetime of an object is tracked from the @ref MemoryGroup::manage() and the @ref TensorAllocator::allocate calls.
-@ref MemoryGroup::manage flags that the object will be needed starting now and when @ref TensorAllocator::allocate is called it signals the end of the object lifetime.
-@code{.cpp}
-Tensor tmp1, tmp2, tmp3;            // Create example tensors
-memory_group.manage(&tmp1);         // Start managing object tmp1 and start its lifetime
-memory_group.manage(&tmp2);         // Start managing object tmp2 and start its lifetime
-
-operation1.configure(&tmp1, &tmp2); // Configure a function/kernel using tmp1 and tmp2
-
-tmp1.allocator()->allocate();       // Flag that the lifetime of object tmp1 has ended
-
-memory_group.manage(&tmp3);         // Start managing object tmp3 and start its lifetime
-
-operation2.configure(&tmp2, &tmp3); // Configure a function/kernel using tmp2 and tmp3
-
-tmp2.allocator()->allocate();       // Flag that the lifetime of object tmp2 has ended
-tmp3.allocator()->allocate();       // Flag that the lifetime of object tmp3 has ended
-@endcode
-
-@warning The configuration step should be done sequentially by a single thread so that all the lifetimes are captured correclty.
-
-When configuration of all the operations is finished then the memory manager have to be populated:
-@code{.cpp}
-mm->populate(&allocator), 2 /* num_pools */); // Populate memory manager pools
-@endcode
-
-Finally, during execution of the pipeline the memory of the appropriate memory group should be requested before running:
-@code{.cpp}
-memory_group.acquire(); // Request memory for the group
-
-operation1.run();       // Run operation1
-operation2.run();       // Run operation2
-
-memory_group.release(); // Release memory so that it can be reused
-@endcode
-@note Execution of a pipeline can be done in a multi-threading environment as memory acquisition/release are thread safe.
-@note If you are handling sensitive data and it's required to zero out the memory buffers before freeing, make sure to also zero out the intermediate buffers. You can access the buffers through the memory group's mappings.
-
-@subsection S4_7_3_memory_manager_function_support Function support
-
-Most of the library's function have been ported to use @ref IMemoryManager for their internal temporary buffers.
-
-If that is the case, a memory manager can be passed to them during construction to reuse memory among these functions.
-@code{.cpp}
-// Setup Memory Manager
-CLBufferAllocator  allocator{};                                                       // Create an allocator to use for the backing memory allocation
-auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
-auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
-auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
-
-// Create two convolution layers and use the memory manager to manager their internal temporary buffers
-CLConvolutionLayer conv1(mm), conv2(mm);
-
-// Configure layers
-conv1.configure(...);
-conv2.configure(...);
-
-// Populate memory manager
-mm->populate(&allocator), 1 /* num_pools */); // Populate memory manager pools
-
-// Run layers (Memory will be recycled for internal buffers for conv1 and conv2
-conv1.run();
-conv2.run();
-@endcode
-
-@section S4_8_import_memory Import Memory Interface
-
-The implemented @ref TensorAllocator and @ref CLTensorAllocator objects provide an interface capable of importing existing memory to a tensor as backing memory.
-
-A simple Arm® Neon™ example can be the following:
-@code{.cpp}
-// External backing memory
-void* external_ptr = ...;
-
-// Create and initialize tensor
-Tensor tensor;
-tensor.allocator()->init(tensor_info);
-
-// Import existing pointer as backing memory
-tensor.allocator()->import_memory(external_ptr);
-@endcode
-
-It is important to note the following:
-- Ownership of the backing memory is not transferred to the tensor itself.
-- The tensor mustn't be memory managed.
-- Padding requirements should be accounted by the client code. In other words, if padding is required by the tensor after the function configuration step, then the imported backing memory should account for it. Padding can be checked through the @ref TensorInfo::padding() interface.
-
-@section S4_9_opencl_tuner OpenCL Tuner
-
-OpenCL kernels when dispatched to the GPU take two arguments:
-- The Global Workgroup Size (GWS): That's the number of times to run an OpenCL kernel to process all the elements we want to process.
-- The Local Workgroup Size (LWS): That's the number of elements we want to run in parallel on a GPU core at a given point in time.
-
-The LWS can be required by an algorithm (For example if it contains memory barriers or uses local memory) but it can also be used for performance reasons to tweak the performance of a kernel: the execution time of the overall kernel might vary significantly depending on how the GWS is broken down.
-
-However, there is no universal rule regarding which LWS is best for a given kernel, so instead we created the @ref CLTuner.
-
-When the @ref CLTuner is enabled ( Target = 2 for the graph examples), the first time an OpenCL kernel is executed the Compute Library will try to run it with a variety of LWS values and will remember which one performed best for subsequent runs. At the end of the run the @ref graph::Graph will try to save these tuning parameters to a file.
-
-However this process takes quite a lot of time, which is why it cannot be enabled all the time. @ref CLTuner supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
-
-But, when the @ref CLTuner is disabled ( Target = 1 for the graph examples), the @ref graph::Graph will try to reload the file containing the tuning parameters, then for each executed kernel the Compute Library will use the fine tuned LWS if it was present in the file or use a default LWS value if it's not.
-
-@section S4_10_cl_queue_prioritites OpenCL Queue Priorities
-
-OpenCL 2.1 exposes the `cl_khr_priority_hints` extensions that if supported by an underlying implementation allows the user to specify priority hints to the created command queues.
-Is important to note that this does not specify guarantees or the explicit scheduling behavior, this is something that each implementation needs to expose.
-
-In some cases, priority queues can be used when there is an implicit internal priority between graphics and compute queues and thus allow some level of priority control between them.
-At the moment three priority level can be specified:
-- CL_QUEUE_PRIORITY_HIGH_KHR
-- CL_QUEUE_PRIORITY_MED_KHR
-- CL_QUEUE_PRIORITY_LOW_KHR
-
-Compute Library allows extraction of the internal OpenCL queue or the ability to inject directly a user-defined queue to the @ref CLScheduler.
-This way the user can utilize this extension to define priorities between the queues and setup the OpenCL scheduler mechanism to utilize them.
-
-@code{.cpp}
-cl_queue_properties queue_properties[] = {CL_QUEUE_PRIORITY_KHR, CL_QUEUE_PRIORITY_HIGH_KHR, 0};
-cl_command_queue priority_queue = clCreateCommandQueueWithProperties(ctx, dev, queue_properties, &error);
-CLScheduler::get().set_queue(::cl::CommandQueue(priority_queue));
-@endcode
-
-@section S4_11_weights_manager Weights Manager
-
-@ref IWeightsManager is a weights managing interface that can be used to reduce the memory requirements of a given pipeline by reusing transformed weights across multiple function executions.
-@ref IWeightsManager is responsible for managing weight tensors alongside with their transformations.
-@ref ITransformWeights provides an interface for running the desired transform function. This interface is used by the weights manager.
-
-@subsection S4_10_1_working_with_weights_manager Working with the Weights Manager
-Following is a simple example that uses the weights manager:
-
-Initially a weights manager must be set-up:
-@code{.cpp}
-auto  wm = std::make_shared<IWeightsManager>(); // Create a weights manager
-@endcode
-
-Once done, weights can be managed, configured and run:
-@code{.cpp}
-wm->manage(weights); // Manage the weights
-wm->acquire(weights, &_reshape_weights_managed_function); // Acquire the address of the transformed weights based on the transform function
-wm->run(weights, &_reshape_weights_managed_function);     // Run the transpose function
-@endcode
-
-@section S5_0_experimental Experimental Features
-
-@subsection S5_1_run_time_context Run-time Context
-
-Some of the Compute Library components are modelled as singletons thus posing limitations to supporting some use-cases and ensuring a more client-controlled API.
-Thus, we are introducing an aggregate service interface @ref IRuntimeContext which will encapsulate the services that the singletons were providing and allow better control of these by the client code.
-Run-time context encapsulates a list of mechanisms, some of them are: scheduling, memory management, kernel caching and others.
-Consequently, this will allow finer control of these services among pipelines when Compute Library is integrated in higher level frameworks.
-
-This feature introduces some changes to our API.
-All the kernels/functions will now accept a Runtime Context object which will allow the function to use the mentioned services.
-
-Finally, we will try to adapt our code-base progressively to use the new mechanism but will continue supporting the legacy mechanism to allow a smooth transition. Changes will apply to all our three backends: Neon, OpenCL and OpenGL ES.
-
-@subsection S5_2_clvk CLVK
-
-Compute Library offers experimental support for [CLVK](https://github.com/kpet/clvk). If CLVK is installed in the system, users can select the backend when running a graph example with --target=clvk.
-If no target is specified and more that one OpenCL implementations are present, Compute Library will pick the first available.
-*/
-} // namespace arm_compute
diff --git a/docs/02_tests.dox b/docs/02_tests.dox
deleted file mode 100644
index 70d2f3d67b..0000000000
--- a/docs/02_tests.dox
+++ /dev/null
@@ -1,385 +0,0 @@
-///
-/// Copyright (c) 2017-2020 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-namespace test
-{
-/**
-@page tests Validation and benchmarks tests
-
-@tableofcontents
-
-@section tests_overview Overview
-
-Benchmark and validation tests are based on the same framework to setup and run
-the tests. In addition to running simple, self-contained test functions the
-framework supports fixtures and data test cases. The former allows to share
-common setup routines between various backends thus reducing the amount of
-duplicated code. The latter can be used to parameterize tests or fixtures with
-different inputs, e.g. different tensor shapes. One limitation is that
-tests/fixtures cannot be parameterized based on the data type if static type
-information is needed within the test (e.g. to validate the results).
-
-@note By default tests are not built. To enable them you need to add validation_tests=1 and / or benchmark_tests=1 to your SCons line.
-
-@note Tests are not included in the pre-built binary archive, you have to build them from sources.
-
-@subsection tests_overview_fixtures Fixtures
-
-Fixtures can be used to share common setup, teardown or even run tasks among
-multiple test cases. For that purpose a fixture can define a `setup`,
-`teardown` and `run` method. Additionally the constructor and destructor might
-also be customized.
-
-An instance of the fixture is created immediately before the actual test is
-executed. After construction the @ref framework::Fixture::setup method is called. Then the test
-function or the fixtures `run` method is invoked. After test execution the
-@ref framework::Fixture::teardown method is called and lastly the fixture is destructed.
-
-@subsubsection tests_overview_fixtures_fixture Fixture
-
-Fixtures for non-parameterized test are straightforward. The custom fixture
-class has to inherit from @ref framework::Fixture and choose to implement any of the
-`setup`, `teardown` or `run` methods. None of the methods takes any arguments
-or returns anything.
-
-    class CustomFixture : public framework::Fixture
-    {
-        void setup()
-        {
-            _ptr = malloc(4000);
-        }
-
-        void run()
-        {
-            ARM_COMPUTE_ASSERT(_ptr != nullptr);
-        }
-
-        void teardown()
-        {
-            free(_ptr);
-        }
-
-        void *_ptr;
-    };
-
-@subsubsection tests_overview_fixtures_data_fixture Data fixture
-
-The advantage of a parameterized fixture is that arguments can be passed to the setup method at runtime. To make this possible the setup method has to be a template with a type parameter for every argument (though the template parameter doesn't have to be used). All other methods remain the same.
-
-    class CustomFixture : public framework::Fixture
-    {
-    #ifdef ALTERNATIVE_DECLARATION
-        template <typename ...>
-        void setup(size_t size)
-        {
-            _ptr = malloc(size);
-        }
-    #else
-        template <typename T>
-        void setup(T size)
-        {
-            _ptr = malloc(size);
-        }
-    #endif
-
-        void run()
-        {
-            ARM_COMPUTE_ASSERT(_ptr != nullptr);
-        }
-
-        void teardown()
-        {
-            free(_ptr);
-        }
-
-        void *_ptr;
-    };
-
-@subsection tests_overview_test_cases Test cases
-
-All following commands can be optionally prefixed with `EXPECTED_FAILURE_` or
-`DISABLED_`.
-
-@subsubsection tests_overview_test_cases_test_case Test case
-
-A simple test case function taking no inputs and having no (shared) state.
-
-- First argument is the name of the test case (has to be unique within the
-  enclosing test suite).
-- Second argument is the dataset mode in which the test will be active.
-
-
-    TEST_CASE(TestCaseName, DatasetMode::PRECOMMIT)
-    {
-        ARM_COMPUTE_ASSERT_EQUAL(1 + 1, 2);
-    }
-
-@subsubsection tests_overview_test_cases_fixture_fixture_test_case Fixture test case
-
-A simple test case function taking no inputs that inherits from a fixture. The
-test case will have access to all public and protected members of the fixture.
-Only the setup and teardown methods of the fixture will be used. The body of
-this function will be used as test function.
-
-- First argument is the name of the test case (has to be unique within the
-  enclosing test suite).
-- Second argument is the class name of the fixture.
-- Third argument is the dataset mode in which the test will be active.
-
-
-    class FixtureName : public framework::Fixture
-    {
-        public:
-            void setup() override
-            {
-                _one = 1;
-            }
-
-        protected:
-            int _one;
-    };
-
-    FIXTURE_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT)
-    {
-        ARM_COMPUTE_ASSERT_EQUAL(_one + 1, 2);
-    }
-
-@subsubsection tests_overview_test_cases_fixture_register_fixture_test_case Registering a fixture as test case
-
-Allows to use a fixture directly as test case. Instead of defining a new test
-function the run method of the fixture will be executed.
-
-- First argument is the name of the test case (has to be unique within the
-  enclosing test suite).
-- Second argument is the class name of the fixture.
-- Third argument is the dataset mode in which the test will be active.
-
-
-    class FixtureName : public framework::Fixture
-    {
-        public:
-            void setup() override
-            {
-                _one = 1;
-            }
-
-            void run() override
-            {
-                ARM_COMPUTE_ASSERT_EQUAL(_one + 1, 2);
-            }
-
-        protected:
-            int _one;
-    };
-
-    REGISTER_FIXTURE_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT);
-
-
-@subsubsection tests_overview_test_cases_data_test_case Data test case
-
-A parameterized test case function that has no (shared) state. The dataset will
-be used to generate versions of the test case with different inputs.
-
-- First argument is the name of the test case (has to be unique within the
-  enclosing test suite).
-- Second argument is the dataset mode in which the test will be active.
-- Third argument is the dataset.
-- Further arguments specify names of the arguments to the test function. The
-  number must match the arity of the dataset.
-
-
-    DATA_TEST_CASE(TestCaseName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}), num)
-    {
-        ARM_COMPUTE_ASSERT(num < 4);
-    }
-
-@subsubsection tests_overview_test_cases_fixture_data_test_case Fixture data test case
-
-A parameterized test case that inherits from a fixture. The test case will have
-access to all public and protected members of the fixture. Only the setup and
-teardown methods of the fixture will be used. The setup method of the fixture
-needs to be a template and has to accept inputs from the dataset as arguments.
-The body of this function will be used as test function. The dataset will be
-used to generate versions of the test case with different inputs.
-
-- First argument is the name of the test case (has to be unique within the
-  enclosing test suite).
-- Second argument is the class name of the fixture.
-- Third argument is the dataset mode in which the test will be active.
-- Fourth argument is the dataset.
-
-
-    class FixtureName : public framework::Fixture
-    {
-        public:
-            template <typename T>
-            void setup(T num)
-            {
-                _num = num;
-            }
-
-        protected:
-            int _num;
-    };
-
-    FIXTURE_DATA_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}))
-    {
-        ARM_COMPUTE_ASSERT(_num < 4);
-    }
-
-@subsubsection tests_overview_test_cases_register_fixture_data_test_case Registering a fixture as data test case
-
-Allows to use a fixture directly as parameterized test case. Instead of
-defining a new test function the run method of the fixture will be executed.
-The setup method of the fixture needs to be a template and has to accept inputs
-from the dataset as arguments. The dataset will be used to generate versions of
-the test case with different inputs.
-
-- First argument is the name of the test case (has to be unique within the
-  enclosing test suite).
-- Second argument is the class name of the fixture.
-- Third argument is the dataset mode in which the test will be active.
-- Fourth argument is the dataset.
-
-
-    class FixtureName : public framework::Fixture
-    {
-        public:
-            template <typename T>
-            void setup(T num)
-            {
-                _num = num;
-            }
-
-            void run() override
-            {
-                ARM_COMPUTE_ASSERT(_num < 4);
-            }
-
-        protected:
-            int _num;
-    };
-
-    REGISTER_FIXTURE_DATA_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}));
-
-@section writing_tests Writing validation tests
-
-Before starting a new test case have a look at the existing ones. They should
-provide a good overview how test cases are structured.
-
-- The C++ reference needs to be added to `tests/validation/CPP/`. The
-  reference function is typically a template parameterized by the underlying
-  value type of the `SimpleTensor`. This makes it easy to specialise for
-  different data types.
-- If all backends have a common interface it makes sense to share the setup
-  code. This can be done by adding a fixture in
-  `tests/validation/fixtures/`. Inside of the `setup` method of a fixture
-  the tensors can be created and initialised and the function can be configured
-  and run. The actual test will only have to validate the results. To be shared
-  among multiple backends the fixture class is usually a template that accepts
-  the specific types (data, tensor class, function class etc.) as parameters.
-- The actual test cases need to be added for each backend individually.
-  Typically the will be multiple tests for different data types and for
-  different execution modes, e.g. precommit and nightly.
-
-@section tests_running_tests Running tests
-@subsection tests_running_tests_benchmark_and_validation Benchmarking and validation suites
-@subsubsection tests_running_tests_benchmarking_filter Filter tests
-All tests can be run by invoking
-
-    ./arm_compute_benchmark ./data
-
-where `./data` contains the assets needed by the tests.
-
-If only a subset of the tests has to be executed the `--filter` option takes a
-regular expression to select matching tests.
-
-    ./arm_compute_benchmark --filter='^NEON/.*AlexNet' ./data
-
-@note Filtering will be much faster if the regular expression starts from the start ("^") or end ("$") of the line.
-
-Additionally each test has a test id which can be used as a filter, too.
-However, the test id is not guaranteed to be stable when new tests are added.
-Only for a specific build the same the test will keep its id.
-
-    ./arm_compute_benchmark --filter-id=10 ./data
-
-All available tests can be displayed with the `--list-tests` switch.
-
-    ./arm_compute_benchmark --list-tests
-
-More options can be found in the `--help` message.
-
-@subsubsection tests_running_tests_benchmarking_runtime Runtime
-By default every test is run once on a single thread. The number of iterations
-can be controlled via the `--iterations` option and the number of threads via
-`--threads`.
-
-@subsubsection tests_running_tests_benchmarking_output Output
-By default the benchmarking results are printed in a human readable format on
-the command line. The colored output can be disabled via `--no-color-output`.
-As an alternative output format JSON is supported and can be selected via
-`--log-format=json`. To write the output to a file instead of stdout the
-`--log-file` option can be used.
-
-@subsubsection tests_running_tests_benchmarking_mode Mode
-Tests contain different datasets of different sizes, some of which will take several hours to run.
-You can select which datasets to use by using the `--mode` option, we recommed you use `--mode=precommit` to start with.
-
-@subsubsection tests_running_tests_benchmarking_instruments Instruments
-You can use the `--instruments` option to select one or more instruments to measure the execution time of the benchmark tests.
-
-`PMU` will try to read the CPU PMU events from the kernel (They need to be enabled on your platform)
-
-`MALI` will try to collect Arm® Mali™ hardware performance counters. (You need to have a recent enough Arm® Mali™ driver)
-
-`WALL_CLOCK_TIMER` will measure time using `gettimeofday`: this should work on all platforms.
-
-You can pass a combinations of these instruments: `--instruments=PMU,MALI,WALL_CLOCK_TIMER`
-
-@note You need to make sure the instruments have been selected at compile time using the `pmu=1` or `mali=1` scons options.
-
-@subsubsection tests_running_examples Examples
-
-To run all the precommit validation tests:
-
-	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit
-
-To run the OpenCL precommit validation tests:
-
-	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit --filter="^CL.*"
-
-To run the Arm® Neon™ precommit benchmark tests with PMU and Wall Clock timer in miliseconds instruments enabled:
-
-	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^NEON.*" --instruments="pmu,wall_clock_timer_ms" --iterations=10
-
-To run the OpenCL precommit benchmark tests with OpenCL kernel timers in miliseconds enabled:
-
-	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^CL.*" --instruments="opencl_timer_ms" --iterations=10
-
-@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
-*/
-} // namespace test
-} // namespace arm_compute
diff --git a/docs/04_adding_operator.dox b/docs/04_adding_operator.dox
deleted file mode 100644
index aef1bb4af0..0000000000
--- a/docs/04_adding_operator.dox
+++ /dev/null
@@ -1,332 +0,0 @@
-///
-/// Copyright (c) 2018-2019 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-
-namespace arm_compute
-{
-/**
-@page add_operator Adding new operators
-
-@tableofcontents
-
-@section S4_1_introduction Introduction
-In Compute Library there are two main parts or modules:
-- The core library consists of a low-level collection of algorithms implemented in C++ and optimized for Arm CPUs and GPUs. The core module is designed to be embedded in other projects and it doesn't perform any memory management or scheduling.
-- The runtime library is a wrapper of the core library and provides other additional features like memory management, multithreaded execution of workloads and allocation of the intermediate tensors.
-
-The library can be integrated in an existing external library or application that provides its own scheduler or a specific memory manager. In that case, the right solution is to use only the core library which means that the user must also manage all the memory allocation not only for the input/output tensor but also for the intermediate tensors/variables necessary. On the other hand, if the user doesn't want to care about allocation and multithreading then the right choice is to use the functions from the runtime library.
-
-Apart from these components that get linked into the application, the sources also include the validation test suite and the C++ reference implementations against which all the operators are validated.
-
-
-@section S4_1_supporting_new_operators Supporting new operators
-
-Following are the steps involved in adding support for a new operator in Compute Library
-- Add new data types (if required)
-- Add the kernel to the core library.
-- Add the function to the runtime library.
-- Add validation tests.
-    - Add the reference implementation.
-    - Add the fixture
-    - register the tests.
-
-@subsection S4_1_1_add_datatypes Adding new data types
-
-Compute Library declares a few new datatypes related to its domain, kernels, and functions in the library process Tensors and Images (Computer Vision functions). Tensors are multi-dimensional arrays with a maximum of Coordinates::num_max_dimensions dimensions; depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as a one-dimensional tensor. Furthermore, an image is just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
-All the datatype classes or structures are grouped in the core library folder arm_compute/core  like the @ref ITensor, @ref ITensorInfo (all the information of a tensor), TensorShape and simpler types are in arm_compute/core/Types.h.
-
-If an operator handles a new datatype, it must be added to the library. While adding a new data type to the library, it's necessary to implement the function to enable printing, the to_string() method and the output stream insertion (<<) operator. Every datatype implements these two functions in utils/TypePrinter.h
-
-A quick example, in <a href="https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/core/Types.h">Types.h</a> we add:
-
-@snippet arm_compute/core/Types.h DataLayout enum definition
-
-And for printing:
-
-@snippet utils/TypePrinter.h Print DataLayout type
-
-In Compute Library, we use namespaces to group all the operators, functions, classes and interfaces. The main namespace to use is arm_compute. In the test suite, the test framework and the individual tests use nested namespaces like @ref test::validation or @ref test::benchmark to group the different purposes of various parts of the suite.
-Utility functions like conversion or type cast operators, that are shared by multiple operators are in arm_compute/core/Utils.h. Non-inlined function definitions go in the corresponding .cpp files in the src folder.
-Similarly, all common functions that process shapes, like calculating output shapes of an operator or shape conversions etc are in arm_compute/core/utils/misc/ShapeCalculator.h.
-
-
-@subsection S4_1_2_add_kernel Add a kernel
-As we mentioned at the beginning, the kernel is the implementation of the operator or algorithm partially using a specific programming language related to the backend we want to use. Adding a kernel in the library means implementing the algorithm in a SIMD technology like Arm® Neon™ or OpenCL. All kernels in Compute Library must implement a common interface IKernel or one of the specific subinterfaces.
-IKernel is the common interface for all the kernels in the core library, it contains the main methods for configure and run the kernel itself, such as window()  that return the maximum window the kernel can be executed on or is_parallelisable() for indicate whether or not the kernel is parallelizable. If the kernel is parallelizable then the window returned by the window() method can be split into sub-windows which can then be run in parallel, in the other case, only the window returned by window() can be passed to the run method.
-There are specific interfaces for OpenCL and Neon: @ref ICLKernel, INEKernel (using INEKernel = @ref ICPPKernel).
-
-- @ref ICLKernel is the common interface for all the OpenCL kernels. It implements the inherited methods and adds all the methods necessary to configure the CL kernel, such as set/return the Local-Workgroup-Size hint, add single, array or tensor argument, set the targeted GPU architecture according to the CL device. All these methods are used during the configuration and the run of the operator.
-- INEKernel inherits from @ref IKernel as well and it's the common interface for all kernels implemented in Neon, it adds just the run and the name methods.
-
-There are two others implementation of @ref IKernel called @ref ICLSimpleKernel and INESimpleKernel, they are the interface for simple kernels that have just one input tensor and one output tensor.
-Creating a new kernel implies adding new files:
-- src/core/CL/kernels/CLReshapeLayerKernel.h
-- src/core/CL/cl_kernels/reshape_layer.cl
-- src/core/CL/kernels/CLReshapeLayerKernel.cpp
-- src/core/CL/CLKernelLibrary.cpp
-
-Neon kernel
-- arm_compute/core/NEON/kernels/NEReshapeLayerKernel.h
-- src/core/NEON/kernels/NEReshapeLayerKernel.cpp
-
-We must register the new layer in the respective libraries:
-- src/core/CL/CLKernels.h
-- arm_compute/core/NEON/NEKernels.h
-
-These files contain the list of all kernels available in the corresponding Compute Library's backend, for example CLKernels:
-@code{.cpp}
-... 
-#include "src/core/CL/kernels/CLMinMaxLayerKernel.h"
-#include "src/core/CL/kernels/CLMinMaxLocationKernel.h"
-... 
-#include "src/core/CL/kernels/CLReshapeLayerKernel.h"
-... 
-
-@endcode
-
-For OpenCL we need to update the CLKernelLibrary.cpp, adding the appropriate code to embed the .cl kernel in the library. The OpenCL code can be compiled offline and embed in the library as binary.
-The essential operation we want to do with a kernel will be
-- create the kernel object
-- initialize the kernel with the input/output and any other parameters that may be required
-- retrieve the execution window of the kernel and run the whole kernel window in the current thread or use the multithreading.
-
-Each kernel will have to implement the method:
-- %validate: is a static function that checks if the given info will lead to a valid configuration of the kernel.
-- configure: configure the kernel, its window, accessor, valid region, etc for the given set of tensors and other parameters.
-- run: execute the kernel in the window
-
-The structure of the kernel .cpp file should be similar to the next ones.
-For OpenCL:
-@snippet src/core/gpu/cl/kernels/ClReshapeKernel.cpp ClReshapeKernel Kernel
-The run will call the function defined in the .cl file.
-
-For the Arm® Neon™ backend case:
-@snippet src/core/cpu/kernels/CpuReshapeKernel.cpp NEReshapeLayerKernel Kernel
-
-In the Arm® Neon™ case, there is no need to add an extra file and we implement the kernel in the same NEReshapeLayerKernel.cpp file.
-If the tests are already in place, the new kernel can be tested using the existing tests by adding the configure and run of the kernel to the compute_target() in the fixture.
-
-
-@subsection S4_1_3_add_function Add a function
-
-%Memory management and scheduling the underlying kernel(s) must be handled by the function implementation. A kernel class must support window() API which return the execute window for the configuration that the kernel is configured for. A window specifies the dimensions of a workload. It has a start and end on each of the dimension. A maximum of Coordinates::num_max_dimensions is supported. The run time layer is expected to query the kernel for the window size and schedule the window as it sees fit. It could choose to split the window into sub windows so that it could be run in parallel. The split must adhere to the following rules
-
-- max[n].start() <= sub[n].start() < max[n].end()
-- sub[n].start() < sub[n].end() <= max[n].end()
-- max[n].step() == sub[n].step()
-- (sub[n].start() - max[n].start()) % max[n].step() == 0
-- (sub[n].end() - sub[n].start()) % max[n].step() == 0
-
-@ref CPPScheduler::schedule provides a sample implementation that is used for Arm® Neon™ kernels.
-%Memory management is the other aspect that the runtime layer is supposed to handle. %Memory management of the tensors is abstracted using TensorAllocator. Each tensor holds a pointer to a TensorAllocator object, which is used to allocate and free the memory at runtime. The implementation that is currently supported in Compute Library allows memory blocks, required to be fulfilled for a given operator, to be grouped together under a @ref MemoryGroup. Each group can be acquired and released. The underlying implementation of memory groups vary depending on whether Arm® Neon™ or CL is used. The memory group class uses memory pool to provide the required memory. It also uses the memory manager to manage the lifetime and a IPoolManager to manage the memory pools registered with the memory manager.
-
-
-We have seen the various interfaces for a kernel in the core library, the same structure the same file structure design exists in the runtime module. IFunction is the base class for all the functions, it has two child interfaces: ICLSimpleFunction and INESimpleFunction that are used as base class for functions which call a single kernel.
-
-The new operator has to implement %validate(), configure() and run(), these methods will call the respective function in the kernel considering that the multi-threading is used for the kernels which are parallelizable, by default std::thread::hardware_concurrency() threads are used. For Arm® Neon™ function can be used CPPScheduler::set_num_threads() to manually set the number of threads, whereas for OpenCL kernels all the kernels are enqueued on the queue associated with CLScheduler and the queue is then flushed.
-For the runtime functions, there is an extra method implemented: prepare(), this method prepares the function for the run, it does all the heavy operations that are done only once (reshape the weight, release the memory not necessary after the reshape, etc). The prepare method can be called standalone or in the first run, if not called before, after then the function will be marked as prepared.
-The files we add are:
-
-OpenCL function
-- arm_compute/runtime/CL/functions/CLReshapeLayer.h
-- src/runtime/CL/functions/CLReshapeLayer.cpp
-
-Neon function
-- arm_compute/runtime/NEON/functions/NEReshapeLayer.h
-- src/runtime/NEON/functions/NEReshapeLayer.cpp
-
-As we did in the kernel we have to edit the runtime libraries to register the new operator modifying the relative library file:
-- arm_compute/runtime/CL/CLFunctions.h
-- arm_compute/runtime/NEON/NEFunctions.h
-
-For the special case where the new function calls only one kernel, we could use as base class ICLSimpleFunction or INESimpleFunction. The configure and the validate methods will simply call the corresponding functions. The structure will be:
-@snippet src/runtime/CL/functions/CLReshapeLayer.cpp CLReshapeLayer snippet
-
-
-If the function is more complicated and calls more than one kernel we have to use the memory manager to manage the intermediate tensors; in the configure() method we call the manage() function passing the tensor to keep track, in the run method we will have to acquire all the buffer managed and released at the end.
-For OpenCL if we want to add two tensor input and reshape the result:
-
-@code{.cpp}
-using namespace arm_compute;
-
-CLAddReshapeLayer:: CLAddReshapeLayer(std::shared_ptr<IMemoryManager> memory_manager)
-    : _memory_group(std::move(memory_manager))
-{
-}
-
-void CLAddReshapeLayer::configure(const ICLTensor *input1, const ICLTensor *input2, ICLTensor *output)
-{
-    // Allocate memory
-    TensorInfo info();
-    add_output.allocator()->init(info);
-
-    // Manage intermediate buffers
-    memory_group.manage(&_addOutput);
-
-    // Initialise kernel
-    _add_kernel.configure(input1, input2, &add_output);
-    _reshape_kernel.configure(&add_output, output);
-
-    // Allocate intermediate tensors
-    add_output.allocator()->allocate();
-}
-
-Status CLAddReshapeLayer::validate(const ITensorInfo *input1, const ITensorInfo *input2, const ITensorInfo *output)
-{
-    TensorInfo add_output();
-    ARM_COMPUTE_RETURN_ERROR_ON(CLAddLayerKernel::validate(input1, input2, add_output));
-    ARM_COMPUTE_RETURN_ERROR_ON(CLReshapeLayerKernel::validate(add_output, output));
-    return Status{};
-}
-
-void CLAddReshapeLayer::run()
-{
-    memory_group.acquire();
-
-    // Run Add
-    add_kernel.run();
-
-    // Run Reshape
-    CLScheduler::get().enqueue(reshape_kernel);
-
-    memory_group.release();
-}
-
-@endcode
-
-For Neon:
-
-@code{.cpp}
-using namespace arm_compute;
-
-NEAddReshapeLayer:: NEAddReshapeLayer (std::shared_ptr<IMemoryManager> memory_manager)
-    : _memory_group(std::move(memory_manager))
-{
-}
-
-void NEAddReshapeLayer::configure(const ITensor *input1, const ITensor *input2, ITensor *output)
-{
-    // Allocate memory
-    TensorInfo info();
-    add_output.allocator()->init(info);
-
-    // Manage intermediate buffers
-    memory_group.manage(&_addOutput);
-
-    // Initialise kernel
-    add_kernel.configure(input1, input2, &addOutput);
-    reshape_kernel.configure(&addOutput, output);
-
-    // Allocate intermediate tensors
-    add_output.allocator()->allocate();
-}
-
-void NEAddReshapeLayer::run()
-{
-    memory_group.acquire();
-
-    // Run Add
-    add_kernel.run();
-
-    // Run Reshape
-    NEScheduler::get().schedule(_reshape_kernel.get(), Window::DimY);
-
-    memory_group.release();
-}
-@endcode
-
-
-At this point, everything is in place at the library level. If you are following an tests driven implementation and all the tests are already in place, we can call the function configuration in the fixture and remove any redundant code like the allocation of the intermediate tensors since it's done in the function. Run the final tests to check the results match with the expected results from the reference implementation.
-
-@subsection S4_1_4_add_validation Add validation artifacts
-
-@subsubsection S4_1_4_1_add_reference Add the reference implementation and the tests
-As mentioned in the introduction, the reference implementation is a pure C++ implementation without any optimization or backend specific instruction.
-The refence implementation consist of two files into the folder tests/validation/reference:
-- tests/validation/reference/ReshapeLayer.h
-- tests/validation/reference/ReshapeLayer.cpp
-
-where we will put respectively the declaration and definition of the new operator.
-All the utility functions that are used ONLY in the tests are in test/validation/helpers.h, for all the others, as mentioned before, there are helpers in the library.
-Compute Library and the tests do use templates, the reference implementation is a generic implementation independent from the datatype and we use the templates to generalize the datatype concept.
-Following the example, let's have a look at the ReshapeLayer operator:
-
-- tests/validation/reference/ReshapeLayer.h
-
-@snippet tests/validation/reference/ReshapeLayer.h ReshapeLayer
-
-- tests/validation/reference/ReshapeLayer.cpp
-
-@snippet tests/validation/reference/ReshapeLayer.cpp ReshapeLayer
-
-An explicit instantiation of the template for the required datatypes must be added in the .cpp file.
-
-@subsubsection S4_1_4_2_add_dataset Add dataset
-One of the parameters of the tests is the dataset, it will be used to generate versions of the test case with different inputs.
-To pass the dataset at the fixture data test case we have three cases
-- the operator dataset is simple so it can be added directly in the test case data declaration
-- we can create a class that return tuples at the test framework
-
-@snippet tests/datasets/PoolingTypesDataset.h PoolingTypes datasets
-
-- if we want to create dynamically the dataset combining different parameter, we can create the dataset using iterators.
-For example, dataset for ReshapeLayer:
-
-@snippet tests/datasets/ReshapeLayerDataset.h ReshapeLayer datasets
-
-@subsubsection S4_1_4_3_add_fixture  Add a fixture and a data test case
-
-Benchmark and validation tests are based on the same framework to setup and run the tests. In addition to running simple, self-contained test functions the framework supports fixtures and data test cases.
-Fixtures can be used to share common setup, teardown or even run tasks among multiple test cases, for that purpose a fixture can define a "setup", "teardown" and "run" method.
-Adding tests for the new operator in the runtime library we need to implement at least the setup method, that is used to call two methods for configure, run and return the output respectively of the target (CL or Neon) and the reference (C++ implementation).
-
-For example let's have a look at Reshape Layer Fixture :
-
-@snippet tests/validation/fixtures/ReshapeLayerFixture.h ReshapeLayer fixture
-
-In the fixture class above we can see that the setup method computes the target and reference and store them in the two members _target and _reference which will be used later to check for correctness.
-The compute_target method reflects the exact behavior expected when we call a function. The input and output tensor must be declared, function configured, tensors allocated, the input tensor filled with required data, and finally, the function must be run and the results returned.
-This fixture is used in the test case, that is a parameterized test case that inherits from a fixture. The test case will have access to all public and protected members of the fixture. Only the setup and teardown methods of the fixture will be used. The setup method of the fixture needs to be a template and must accept inputs from the dataset as arguments.
-The body of this function will be used as a test function.
-For the fixture test case the first argument is the name of the test case (has to be unique within the enclosing test suite), the second argument is the class name of the fixture, the third argument is the dataset mode in which the test will be active (PRECOMMIT or NIGTHLY) and the fourth argument is the dataset.
-For example:
-
-@snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayerFixture snippet
-
-@code{.cpp}
-TEST_SUITE(CL)
-TEST_SUITE(ActivationLayer)
-TEST_SUITE(Float)
-TEST_SUITE(FP16)
-@endcode
-@snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayer Test snippet
-@code{.cpp}
-TEST_SUITE_END()
-TEST_SUITE_END()
-TEST_SUITE_END()
-TEST_SUITE_END()
-@endcode
-
-This will produce a set of tests that can be filtered with "CL/ReshapeLayer/Float/FP16/RunSmall". Each test produced from the cartesian product of the dataset is associated to a number and can be filtered specifying all the parameters.
-*/
-} // namespace arm_compute
diff --git a/docs/05_contribution_guidelines.dox b/docs/05_contribution_guidelines.dox
deleted file mode 100644
index 35b9f49dbc..0000000000
--- a/docs/05_contribution_guidelines.dox
+++ /dev/null
@@ -1,452 +0,0 @@
-///
-/// Copyright (c) 2019 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/**
-@page contribution_guidelines Contribution guidelines
-
-@tableofcontents
-
-If you want to contribute to Arm Compute Library, be sure to review the following guidelines.
-
-The development is structured in the following way:
-- Release repository: https://github.com/arm-software/ComputeLibrary
-- Development repository: https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary
-- Please report issues here: https://github.com/ARM-software/ComputeLibrary/issues
-
-@section S5_1_coding_standards Coding standards and guidelines
-
-Best practices (as suggested by clang-tidy):
-
-- No uninitialised values
-
-Helps to prevent undefined behaviour and allows to declare variables const if they are not changed after initialisation. See http://clang.llvm.org/extra/clang-tidy/checks/cppcoreguidelines-pro-type-member-init.html
-
-@code{.cpp}
-const float32x4_t foo = vdupq_n_f32(0.f);
-const float32x4_t bar = foo;
-
-const int32x4x2_t i_foo = {{
-	vconvq_s32_f32(foo),
-    vconvq_s32_f32(foo)
-}};
-const int32x4x2_t i_bar = i_foo;
-@endcode
-
-- No C-style casts (in C++ source code)
-
-Only use static_cast, dynamic_cast, and (if required) reinterpret_cast and const_cast. See http://en.cppreference.com/w/cpp/language/explicit_cast for more information when to use which type of cast. C-style casts do not differentiate between the different cast types and thus make it easy to violate type safety. Also, due to the prefix notation it is less clear which part of an expression is going to be casted. See http://clang.llvm.org/extra/clang-tidy/checks/cppcoreguidelines-pro-type-cstyle-cast.html
-
-- No implicit casts to bool
-
-Helps to increase readability and might help to catch bugs during refactoring. See http://clang.llvm.org/extra/clang-tidy/checks/readability-implicit-bool-cast.html
-
-@code{.cpp}
-extern int *ptr;
-if(ptr){} // Bad
-if(ptr != nullptr) {} // Good
-
-extern int foo;
-if(foo) {} // Bad
-if(foo != 0) {} // Good
-@endcode
-
-- Use nullptr instead of NULL or 0
-
-The nullptr literal is type-checked and is therefore safer to use. See http://clang.llvm.org/extra/clang-tidy/checks/modernize-use-nullptr.html
-
-- No need to explicitly initialise std::string with an empty string
-
-The default constructor of std::string creates an empty string. In general it is therefore not necessary to specify it explicitly. See http://clang.llvm.org/extra/clang-tidy/checks/readability-redundant-string-init.html
-
-@code{.cpp}
-// Instead of
-std::string foo("");
-std::string bar = "";
-
-// The following has the same effect
-std::string foo;
-std::string bar;
-@endcode
-
-- Braces for all control blocks and loops (which have a body)
-
-To increase readability and protect against refactoring errors the body of control block and loops must be wrapped in braces. See http://clang.llvm.org/extra/clang-tidy/checks/readability-braces-around-statements.html
-
-For now loops for which the body is empty do not have to add empty braces. This exception might be revoked in the future. Anyway, situations in which this exception applies should be rare.
-
-@code{.cpp}
-Iterator it;
-while(it.next()); // No need for braces here
-
-// Make more use of it
-@endcode
-
-- Only one declaration per line
-
-Increase readability and thus prevent errors.
-
-@code{.cpp}
-int a, b; // BAD
-int c, *d; // EVEN WORSE
-
-int e = 0; // GOOD
-int *p = nullptr; // GOOD
-@endcode
-
-- Pass primitive types (and those that are cheap to copy or move) by value
-
-For primitive types it is more efficient to pass them by value instead of by const reference because:
-
- - the data type might be smaller than the "reference type"
- - pass by value avoids aliasing and thus allows for better optimisations
- - pass by value is likely to avoid one level of indirection (references are often implemented as auto dereferenced pointers)
-
-This advice also applies to non-primitive types that have cheap copy or move operations and the function needs a local copy of the argument anyway.
-
-More information:
-
- - http://stackoverflow.com/a/14013189
- - http://stackoverflow.com/a/270435
- - http://web.archive.org/web/20140113221447/http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/
-
-@code{.cpp}
-void foo(int i, long l, float32x4_t f); // Pass-by-value for builtin types
-void bar(const float32x4x4_t &f); // As this is a struct pass-by-const-reference is probably better
-void foobar(const MyLargeCustomTypeClass &m); // Definitely better as const-reference except if a copy has to be made anyway.
-@endcode
-
-- Don't use unions
-
-Unions cannot be used to convert values between different types because (in C++) it is undefined behaviour to read from a member other than the last one that has been assigned to. This limits the use of unions to a few corner cases and therefor the general advice is not to use unions. See http://releases.llvm.org/3.8.0/tools/clang/tools/extra/docs/clang-tidy/checks/cppcoreguidelines-pro-type-union-access.html
-
-- Use pre-increment/pre-decrement whenever possible
-
-In contrast to the pre-incerement the post-increment has to make a copy of the incremented object. This might not be a problem for primitive types like int but for class like objects that overload the operators, like iterators, it can have a huge impact on the performance. See http://stackoverflow.com/a/9205011
-
-To be consistent across the different cases the general advice is to use the pre-increment operator unless post-increment is explicitly required. The same rules apply for the decrement operator.
-
-@code{.cpp}
-for(size_t i = 0; i < 9; i++); // BAD
-for(size_t i = 0; i < 9; ++i); // GOOD
-@endcode
-
-- Don't use uint in C/C++
-
-The C and C++ standards don't define a uint type. Though some compilers seem to support it by default it would require to include the header sys/types.h. Instead we use the slightly more verbose unsigned int type.
-
-- Don't use unsigned int in function's signature
-
-Unsigned integers are good for representing bitfields and modular arithmetic. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. Mixing signedness of integer types is responsible for an equally large class of problems.
-
-- No "Yoda-style" comparisons
-
-As compilers are now able to warn about accidental assignments if it is likely that the intention has been to compare values it is no longer required to place literals on the left-hand side of the comparison operator. Sticking to the natural order increases the readability and thus prevents logical errors (which cannot be spotted by the compiler). In the rare case that the desired result is to assign a value and check it the expression has to be surrounded by parentheses.
-
-@code{.cpp}
-if(nullptr == ptr || false == cond) // BAD
-{
-	//...
-}
-
-if(ptr == nullptr || cond == false) // GOOD
-{
-	//...
-}
-
-if(ptr = nullptr || cond = false) // Most likely a mistake. Will cause a compiler warning
-{
-	//...
-}
-
-if((ptr = nullptr) || (cond = false)) // Trust me, I know what I'm doing. No warning.
-{
-	//...
-}
-@endcode
-
-@subsection S5_1_1_rules Rules
-
- - Use spaces for indentation and alignment. No tabs! Indentation should be done with 4 spaces.
- - Unix line returns in all the files.
- - Pointers and reference symbols attached to the variable name, not the type (i.e. char \&foo;, and not char& foo).
- - No trailing spaces or tabs at the end of lines.
- - No spaces or tabs on empty lines.
- - Put { and } on a new line and increase the indentation level for code inside the scope (except for namespaces).
- - Single space before and after comparison operators ==, <, >, !=.
- - No space around parenthesis.
- - No space before, one space after ; (unless it is at the end of a line).
-
-@code{.cpp}
-for(int i = 0; i < width * height; ++i)
-{
-	void *d = foo(ptr, i, &addr);
-	static_cast<uint8_t *>(data)[i] = static_cast<uint8_t *>(d)[0];
-}
-@endcode
-
- - Put a comment after \#else, \#endif, and namespace closing brace indicating the related name
-
-@code{.cpp}
-namespace mali
-{
-#ifdef MALI_DEBUG
-	...
-#else // MALI_DEBUG
-	...
-#endif // MALI_DEBUG
-} // namespace mali
-@endcode
-
-- CamelCase for class names only and lower case words separated with _ (snake_case) for all the functions / methods / variables / arguments / attributes.
-
-@code{.cpp}
-class ClassName
-{
-    public:
-        void my_function();
-        int my_attribute() const; // Accessor = attribute name minus '_', const if it's a simple type
-    private:
-        int _my_attribute; // '_' in front of name
-};
-@endcode
-
-- Use quotes instead of angular brackets to include local headers. Use angular brackets for system headers.
-- Also include the module header first, then local headers, and lastly system headers. All groups should be separated by a blank line and sorted lexicographically within each group.
-- Where applicable the C++ version of system headers has to be included, e.g. cstddef instead of stddef.h.
-- See http://llvm.org/docs/CodingStandards.html#include-style
-
-@code{.cpp}
-#include "MyClass.h"
-
-#include "arm_cv/core/Helpers.h"
-#include "arm_cv/core/Types.h"
-
-#include <cstddef>
-#include <numeric>
-@endcode
-
-- Only use "auto" when the type can be explicitly deduced from the assignment.
-
-@code{.cpp}
-auto a = static_cast<float*>(bar); // OK: there is an explicit cast
-auto b = std::make_unique<Image>(foo); // OK: we can see it's going to be an std::unique_ptr<Image>
-auto c = img.ptr(); // NO: Can't tell what the type is without knowing the API.
-auto d = vdup_n_u8(0); // NO: It's not obvious what type this function returns.
-@endcode
-
-- OpenCL:
-    - Use __ in front of the memory types qualifiers and kernel: __kernel, __constant, __private, __global, __local.
-    - Indicate how the global workgroup size / offset / local workgroup size are being calculated.
-
-    - Doxygen:
-
-        - No '*' in front of argument names
-        - [in], [out] or [in,out] *in front* of arguments
-        - Skip a line between the description and params and between params and @return (If there is a return)
-        - Align params names and params descriptions (Using spaces), and with a single space between the widest column and the next one.
-        - Use an upper case at the beginning of the description
-
-@snippet arm_compute/runtime/NEON/functions/NEActivationLayer.h NEActivationLayer snippet
-
-@subsection S5_1_2_how_to_check_the_rules How to check the rules
-
-astyle (http://astyle.sourceforge.net/) and clang-format (https://clang.llvm.org/docs/ClangFormat.html) can check and help you apply some of these rules.
-
-@subsection S5_1_3_library_size_guidelines Library size: best practices and guidelines
-
-@subsubsection S5_1_3_1_template_suggestions Template suggestions
-
-When writing a new patch we should also have in mind the effect it will have in the final library size. We can try some of the following things:
-
- - Place non-dependent template code in a different non-templated class/method
-
-@code{.cpp}
-template<typename T>
-class Foo
-{
-public:
-    enum { v1, v2 };
-    // ...
-};
-@endcode
-
-    can be converted to:
-
-@code{.cpp}
-struct Foo_base
-{
-    enum { v1, v2 };
-    // ...
-};
-
-template<typename T>
-class Foo : public Foo_base
-{
-public:
-    // ...
-};
-@endcode
-
- - In some cases it's preferable to use runtime switches instead of template parameters
-
- - Sometimes we can rewrite the code without templates and without any (significant) performance loss. Let's say that we've written a function where the only use of the templated argument is used for casting:
-
-@code{.cpp}
-template <typename T>
-void NETemplatedKernel::run(const Window &window)
-{
-...
- *(reinterpret_cast<T *>(out.ptr())) = *(reinterpret_cast<const T *>(in.ptr()));
-...
-}
-@endcode
-
-The above snippet can be transformed to:
-
-@code{.cpp}
-void NENonTemplatedKernel::run(const Window &window)
-{
-...
-std::memcpy(out.ptr(), in.ptr(), element_size);
-...
-}
-@endcode
-
-@subsection S5_1_4_secure_coding_practices Secure coding practices
-
-@subsubsection S5_1_4_1_general_coding_practices General Coding Practices
-
-- **Use tested and approved managed code** rather than creating new unmanaged code for common tasks.
-- **Utilize locking to prevent multiple simultaneous requests** or use a synchronization mechanism to prevent race conditions.
-- **Protect shared variables and resources** from inappropriate concurrent access.
-- **Explicitly initialize all your variables and other data stores**, either during declaration or just before the first usage.
-- **In cases where the application must run with elevated privileges, raise privileges as late as possible, and drop them as soon as possible**.
-- **Avoid calculation errors** by understanding your programming language's underlying representation and how it interacts with numeric calculation. Pay close attention to byte size discrepancies, precision, signed/unsigned distinctions, truncation, conversion and casting between types, "not-a-number" calculations, and how your language handles numbers that are too large or too small for its underlying representation.
-- **Restrict users from generating new code** or altering existing code.
-
-
-@subsubsection S5_1_4_2_secure_coding_best_practices Secure Coding Best Practices
-
-- **Validate input**. Validate input from all untrusted data sources. Proper input validation can eliminate the vast majority of software vulnerabilities. Be suspicious of most external data sources, including command line arguments, network interfaces, environmental variables, and user controlled files.
-- **Heed compiler warnings**. Compile code using the default compiler flags that exist in the SConstruct file.
-- Use **static analysis tools** to detect and eliminate additional security flaws.
-- **Keep it simple**. Keep the design as simple and small as possible. Complex designs increase the likelihood that errors will be made in their implementation, configuration, and use. Additionally, the effort required to achieve an appropriate level of assurance increases dramatically as security mechanisms become more complex.
-- **Default deny**. Base access decisions on permission rather than exclusion. This means that, by default, access is denied and the protection scheme identifies conditions under which access is permitted
-- **Adhere to the principle of least privilege**. Every process should execute with the least set of privileges necessary to complete the job. Any elevated permission should only be accessed for the least amount of time required to complete the privileged task. This approach reduces the opportunities an attacker has to execute arbitrary code with elevated privileges.
-- **Sanitize data sent to other systems**. Sanitize all data passed to complex subsystems such as command shells, relational databases, and commercial off-the-shelf (COTS) components. Attackers may be able to invoke unused functionality in these components through the use of various injection attacks. This is not necessarily an input validation problem because the complex subsystem being invoked does not understand the context in which the call is made. Because the calling process understands the context, it is responsible for sanitizing the data before invoking the subsystem.
-- **Practice defense in depth**. Manage risk with multiple defensive strategies, so that if one layer of defense turns out to be inadequate, another layer of defense can prevent a security flaw from becoming an exploitable vulnerability and/or limit the consequences of a successful exploit. For example, combining secure programming techniques with secure runtime environments should reduce the likelihood that vulnerabilities remaining in the code at deployment time can be exploited in the operational environment.
-
-@subsection S5_1_5_guidelines_for_stable_api_abi Guidelines for stable API/ABI
-
-The Application Programming Interface (API) and Application Binary Interface (ABI) are the interfaces exposed
-to users so their programs can interact with the library efficiently and effectively. Even though changing API/ABI
-in a way that does not give backward compatibility is not necessarily bad if it can improve other users' experience and the library,
-contributions should be made with the awareness of API/ABI stability. If you'd like to make changes that affects
-the library's API/ABI, please review and follow the guidelines shown in this section. Also, please note that
-these guidelines are not exhaustive list but discussing things that might be easily overlooked.
-
-@subsubsection S5_1_5_1_guidelines_for_api Guidelines for API
-
-- When adding new arguments, consider grouping arguments (including the old ones) into a struct rather than adding arguments with default values.
-Introducing a new struct might break the API/ABI once, but it will be helpful to keep the stability.
-- When new member variables are added, please make sure they are initialized.
-- Avoid adding enum elements in the middle.
-- When removing arguments, follow the deprecation process described in the following section.
-- When changing behavior affecting API contracts, follow the deprecation process described in the following section.
-
-@subsubsection S5_1_5_2_guidelines_for_abi Guidelines for ABI
-
-We recommend to read through <a href="https://community.kde.org/Policies/Binary_Compatibility_Issues_With_C%2B%2B">this page</a>
-and double check your contributions to see if they include the changes listed.
-
-Also, for classes that requires strong ABI stability, consider using <a href="https://en.cppreference.com/w/cpp/language/pimpl">pImpl idiom</a>.
-
-@subsubsection S5_1_5_3_api_deprecation_process API deprecation process
-
-In order to deprecate an existing API, these rules should be followed.
-
-- Removal of a deprecated API should wait at least for one official release.
-- Deprecation of runtime APIs should strictly follow the aforementioned period, whereas core APIs can have more flexibility as they are mostly used internally rather than user-facing.
-- Any API changes (update, addition and deprecation) in all components should be well documented by the contribution itself.
-
-Also, it is recommended to use the following utility macros which is designed to work with both clang and gcc using C++14 and later.
-
-- ARM_COMPUTE_DEPRECATED: Just deprecate the wrapped function
-- ARM_COMPUTE_DEPRECATED_REL: Deprecate the wrapped function and also capture the release that was deprecated
-- ARM_COMPUTE_DEPRECATED_REL_REPLACE: Deprecate the wrapped function and also capture the release that was deprecated along with a possible replacement candidate
-
-@code{.cpp}
-ARM_COMPUTE_DEPRECATED_REL_REPLACE(20.08, DoNewThing)
-void DoOldThing();
-
-void DoNewThing();
-@endcode
-
-@section S5_2_how_to_submit_a_patch How to submit a patch
-
-To be able to submit a patch to our development repository you need to have a GitHub account. With that, you will be able to sign in to Gerrit where your patch will be reviewed.
-
-Next step is to clone the Compute Library repository:
-
-	git clone "ssh://<your-github-id>@review.mlplatform.org:29418/ml/ComputeLibrary"
-
-If you have cloned from GitHub or through HTTP, make sure you add a new git remote using SSH:
-
-	git remote add acl-gerrit "ssh://<your-github-id>@review.mlplatform.org:29418/ml/ComputeLibrary"
-
-After that, you will need to upload an SSH key to https://review.mlplatform.org/#/settings/ssh-keys
-
-Then, make sure to install the commit-msg Git hook in order to add a change-ID to the commit message of your patch:
-
-	cd "ComputeLibrary" && mkdir -p .git/hooks && curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://review.mlplatform.org/tools/hooks/commit-msg; chmod +x `git rev-parse --git-dir`/hooks/commit-msg)
-
-When your patch is ready, remember to sign off your contribution by adding a line with your name and e-mail address to every git commit message:
-
-	Signed-off-by: John Doe <john.doe@example.org>
-
-You must use your real name, no pseudonyms or anonymous contributions are accepted.
-
-You can add this to your patch with:
-
-	git commit -s --amend
-
-You are now ready to submit your patch for review:
-
-	git push acl-gerrit HEAD:refs/for/master
-
-@section S5_3_code_review Patch acceptance and code review
-
-Once a patch is uploaded for review, there is a pre-commit test that runs on a Jenkins server for continuos integration tests. In order to be merged a patch needs to:
-
-- get a "+1 Verified" from the pre-commit job
-- get a "+1 Comments-Addressed", in case of comments from reviewers the committer has to address them all. A comment is considered addressed when the first line of the reply contains the word "Done"
-- get a "+2" from a reviewer, that means the patch has the final approval
-
-At the moment, the Jenkins server is not publicly accessible and for security reasons patches submitted by non-whitelisted committers do not trigger the pre-commit tests. For this reason, one of the maintainers has to manually trigger the job.
-
-If the pre-commit test fails, the Jenkins job will post a comment on Gerrit with the details about the failure so that the committer will be able to reproduce the error and fix the issue, if any (sometimes there can be infrastructure issues, a test platform disconnecting for example, where the job needs to be retriggered).
-
-*/
-} // namespace arm_compute
diff --git a/docs/07_errata.dox b/docs/07_errata.dox
deleted file mode 100644
index 0c8d684017..0000000000
--- a/docs/07_errata.dox
+++ /dev/null
@@ -1,76 +0,0 @@
-///
-/// Copyright (c) 2019-2020 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/**
-@page errata Errata
-
-@tableofcontents
-
-@section S7_1_errata Errata
-
-- Under certain conditions, CLFullyConnectedLayer quantized tests may fail due to an issue in the test framework.
-    - Versions Affected: 21.02
-    - OSs Affected: Linux
-    - Conditions:
-        - armv7a architecture
-        - release mode
-        - asserts enabled
-
-- A wrong test configuration has been found in CLGEMMMatrixMultiplyReshapedOnlyRHS set of tests.
-    - Versions Affected: >= 20.11
-    - Conditions:
-        - Data type input: F32/F16
-        - Fused bounded relu activation with coefficient 'a' being negative
-
-- Under certain conditions, the validation test case 'CL/DirectConvolutionLayer/Float/FP32/RunSmall9x9\@InputShape=32x37x3x4:StrideX=1:StrideY=1:PadX=0:PadY=0:KernelSize=9:NumKernels=1:DataType=F32:ActivationInfo=LU_BOUNDED_RELU:DataLayout=NHWC' may fail.
-    - Versions Affected: >= v20.08
-    - Conditions:
-        - The validation suite has to run in nightly mode and execute 40k+ test cases before the test mentioned above
-
-- Under certain conditions, benchmark examples can hang when OpenCL profiling queues are enabled.
-    - Versions Affected: >= v19.11
-    - OSs Affected: Linux
-    - Conditions:
-        - Arm® Mali™ DDK r1p0 - r8p0, and
-        - Linux kernel >= 4.4
-
-- On Android with arm64-v8a/arm64-v8.2-a architecture, Arm® Neon™ validation tests can fail when compiled using Android Ndk
-  >= r18b in debug mode (https://github.com/android/ndk/issues/1135).
-    - Versions Affected: >= v19.11
-    - OSs Affected: Android
-    - Conditions:
-        - arm64-v8a/arm64-v8.2-a architecture, and
-        - Compiled using Android NDK >= r18b in debug mode.
-
-- An issue has been identified with CLCast.
-    - Versions Affected: >= 18.11
-    - Conditions:
-        - Data type input: F32
-        - Data type output: All integer types
-        - Conversion policy: SATURATE
-    - Result: OpenCL backend will always wrap around instead of saturating for out-of-range inputs
-
-*/
-} // namespace
diff --git a/docs/08_api.dox b/docs/08_api.dox
deleted file mode 100644
index 39282046a9..0000000000
--- a/docs/08_api.dox
+++ /dev/null
@@ -1,135 +0,0 @@
-///
-/// Copyright (c) 2021 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/**
-@page api Application Programming Interface
-
-@tableofcontents
-
-@section api_overview Overview
-
-In this section we present Compute Library's application programming interface (API) architecture along with
-a detailed explanation of its components. Compute Library's API consists of multiple high-level operators and
-even more internally distinct computational blocks that can be executed on a command queue.
-Operators can be bound to multiple Tensor objects and executed concurrently or asynchronously if needed.
-All operators and associated objects are encapsulated in a Context-based mechanism, which provides all related
-construction services.
-
-@section api_objects Fundamental objects
-
-Compute Library consists of a list of fundamental objects that are responsible for creating and orchestrating operator execution.
-Below we present these objects in more detail.
-
-@subsection api_objects_context AclContext or Context
-
-AclContext or Context acts as a central creational aggregate service. All other objects are bound to or created from a context.
-It provides, internally, common facilities such as
-- allocators for object creation or backing memory allocation
-- serialization interfaces
-- any other modules that affect the construction of objects (e.g., program cache for OpenCL).
-
-The followings sections will describe parameters that can be given on the creation of Context.
-
-@subsubsection api_object_context_target AclTarget
-Context is initialized with a backend target (AclTarget) as different backends might have a different subset of services.
-Currently the following targets are supported:
-- #AclCpu: a generic CPU target that accelerates primitives through SIMD technologies
-- #AclGpuOcl: a target for GPU acceleration using OpenCL
-
-@subsubsection api_object_context_execution_mode AclExecutionMode
-An execution mode (AclExecutionMode) can be passed as an argument that affects the operator creation.
-At the moment the following execution modes are supported:
-- #AclPreferFastRerun: Provides faster re-run. It can be used when the operators are expected to be executed multiple
-times under the same execution context
-- #AclPreferFastStart: Provides faster single execution. It can be used when the operators will be executed only once,
-thus reducing their latency is important (Currently, it is not implemented)
-
-@subsubsection api_object_context_capabilitys AclTargetCapabilities
-Context creation can also have a list of capabilities of hardware as one of its parameters. This is currently
-available only for the CPU backend. A list of architecture capabilities can be passed to influence the selection
-of the underlying kernels. Such capabilities can be for example the enablement of SVE or the dot product
-instruction explicitly.
-@note The underlying hardware should support the given capability list.
-
-@subsubsection api_object_context_allocator Allocator
-An allocator object that implements @ref AclAllocator can be passed to the Context upon its creation.
-This user-provided allocator will be used for allocation of any internal backing memory.
-
-@note To enable interoperability with OpenCL, additional entrypoints are provided
-to extract (@ref AclGetClContext) or set (@ref AclSetClContext) the internal OpenCL context.
-
-@subsection api_objects_tensor AclTensor or Tensor
-
-A tensor is a mathematical object that can describe physical properties like matrices.
-It can be also considered a generalization of matrices that can represent arbitrary
-dimensionalities. AclTensor is an abstracted interface that represents a tensor.
-
-AclTensor, in addition to the elements of the physical properties they represent,
-also contains the information such as shape, data type, data layout and strides to not only
-fully describe the characteristics of the physical properties but also provide information
-how the object stored in memory should be traversed. @ref AclTensorDescriptor is a dedicated
-object to represent such metadata.
-
-@note The allocation of an AclTensor can be deferred until external memory is imported
-as backing memory to accomplish a zero-copy context.
-
-@note To enable interoperability with OpenCL, additional entrypoints are provided
-to extract (@ref AclGetClMem) the internal OpenCL memory object.
-
-As Tensors can reside in different memory spaces, @ref AclMapTensor and @ref AclUnmapTensor entrypoints
-are provided to map Tensors in and out of the host memory system, respectively.
-
-@subsection api_objects_queue AclQueue or Queue
-
-AclQueue acts as a runtime aggregate service. It provides facilities to schedule
-and execute operators using underlying hardware. It also contains services like
-tuning mechanisms (e.g., Local workgroup size tuning for OpenCL) that can be specified
-during operator execution.
-
-@note To enable interoperability with OpenCL, additional entrypoints are provided
-to extract (@ref AclGetClQueue) or set (@ref AclSetClQueue) the internal OpenCL queue.
-
-@section api_internal Internal
-@subsection api_internal_operator_vs_kernels Operators vs Kernels
-
-Internally, Compute Library separates the executable primitives in two categories: kernels and operators
-which operate in a hierarchical way.
-
-A kernel is the lowest-level computation block whose responsibility is performing a task on a given group of data.
-For design simplicity, kernels computation does NOT involve the following:
-
-- Memory allocation: All the memory manipulation should be handled by the caller.
-- Multi-threading: The information on how the workload can be split is provided by kernels,
-so the caller can effectively distribute the workload to multiple threads.
-
-On the other hand, operators combine one or multiple kernels to achieve more complex calculations.
-The responsibilities of the operators can be summarized as follows:
-
-- Defining the scheduling policy and dispatching of the underlying kernels to the hardware backend
-- Providing information to the caller required by the computation (e.g., memory requirements)
-- Allocation of any required auxiliary memory if it isn't given by its caller explicitly
-
-*/
-} // namespace arm_compute
diff --git a/docs/Doxyfile b/docs/Doxyfile
index 6fb5de7020..5a76c0538f 100644
--- a/docs/Doxyfile
+++ b/docs/Doxyfile
@@ -687,7 +687,7 @@ FILE_VERSION_FILTER    =
 # DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE
 # tag is left empty.
 
-LAYOUT_FILE            = 
+LAYOUT_FILE            = ./docs/DoxygenLayout.xml
 
 # The CITE_BIB_FILES tag can be used to specify one or more bib files containing
 # the reference definitions. This must be a list of .bib files. The .bib
@@ -768,16 +768,20 @@ WARN_LOGFILE           =
 # spaces.
 # Note: If this tag is empty the current directory is searched.
 
-INPUT                  = ./docs/00_introduction.dox \
-                         ./docs/01_library.dox \
-                         ./docs/02_tests.dox \
-                         ./docs/03_scripts.dox \
-                         ./docs/04_adding_operator.dox \
-                         ./docs/05_contribution_guidelines.dox \
-                         ./docs/06_functions_list.dox \
-                         ./docs/07_errata.dox \
-                         ./docs/08_api.dox \
-                         ./docs/09_operators_list.dox \
+INPUT                  = ./docs/user_guide/introduction.dox \
+                         ./docs/user_guide/how_to_build_and_run_examples.dox \
+                         ./docs/user_guide/library.dox \
+                         ./docs/user_guide/programming_model.dox \
+                         ./docs/user_guide/api.dox \
+                         ./docs/user_guide/data_type.dox \
+                         ./docs/user_guide/data_layout.dox \
+                         ./docs/user_guide/tests.dox \
+                         ./docs/user_guide/advanced.dox \
+                         ./docs/user_guide/release_version_and_change_log.dox \
+                         ./docs/user_guide/errata.dox \
+                         ./docs/contributor_guide/contribution_guidelines.dox \
+                         ./docs/contributor_guide/adding_operator.dox \
+                         ./docs/contributor_guide/implementation_topics.dox \
                          ./docs/ComputeLibrary.dir \
                          ./arm_compute/ \
                          ./src/ \
diff --git a/docs/DoxygenLayout.xml b/docs/DoxygenLayout.xml
new file mode 100644
index 0000000000..fe3844b60d
--- /dev/null
+++ b/docs/DoxygenLayout.xml
@@ -0,0 +1,212 @@
+<doxygenlayout version="1.0">
+  <!-- Generated by doxygen 1.8.15 -->
+  <!-- Navigation index tabs for HTML output -->
+  <navindex>
+    <tab type="usergroup" title="User Guide" url="[none]">
+        <tab type="user" url="@ref introduction" title="Introduction"/>
+        <tab type="user" url="@ref how_to_build" title="How to Build and Run Examples"/>
+        <tab type="user" url="@ref architecture" title="Library Architecture"/>
+        <tab type="user" url="@ref programming_model" title="Programming Model"/>
+        <tab type="user" url="@ref api" title="Application Programming Interface"/>
+        <tab type="user" url="@ref data_type_support" title="Data Type Support"/>
+        <tab type="user" url="@ref data_layout_support" title="Data Layout Support"/>
+        <tab type="user" url="@ref tests" title="Validation and benchmarks"/>
+        <tab type="user" url="@ref advanced" title="Advanced"/>
+        <tab type="user" url="@ref versions_changelogs" title="Release Versions and Changelog"/>
+        <tab type="user" url="@ref errata" title="Errata"/>
+    </tab>
+    <tab type="usergroup" title="Contributor Guide" url="[none]"> 
+        <tab type="user" url="@ref contribution_guidelines" title="Contribution Guidelines"/>
+        <tab type="user" url="@ref adding_operator" title="How to Add a New Operator"/>
+        <tab type="user" url="@ref implementation_topic" title="Implementation Topics"/>
+    </tab>
+    <tab type="mainpage" visible="no" title=""/>
+    <tab type="pages" visible="no" title="" intro=""/>
+    <tab type="modules" visible="yes" title="" intro=""/>
+    <tab type="namespaces" visible="yes" title="">
+      <tab type="namespacelist" visible="yes" title="" intro=""/>
+      <tab type="namespacemembers" visible="yes" title="" intro=""/>
+    </tab>
+    <tab type="classes" visible="yes" title="">
+      <tab type="classlist" visible="yes" title="" intro=""/>
+      <tab type="classindex" visible="$ALPHABETICAL_INDEX" title=""/> 
+      <tab type="hierarchy" visible="yes" title="" intro=""/>
+      <tab type="classmembers" visible="yes" title="" intro=""/>
+    </tab>
+    <tab type="files" visible="yes" title="">
+      <tab type="filelist" visible="yes" title="" intro=""/>
+      <tab type="globals" visible="yes" title="" intro=""/>
+    </tab>
+    <tab type="examples" visible="yes" title="" intro=""/>  
+  </navindex>
+
+  <!-- Layout definition for a class page -->
+  <class>
+    <briefdescription visible="yes"/>
+    <includes visible="$SHOW_INCLUDE_FILES"/>
+    <inheritancegraph visible="$CLASS_GRAPH"/>
+    <collaborationgraph visible="$COLLABORATION_GRAPH"/>
+    <memberdecl>
+      <nestedclasses visible="yes" title=""/>
+      <publictypes title=""/>
+      <services title=""/>
+      <interfaces title=""/>
+      <publicslots title=""/>
+      <signals title=""/>
+      <publicmethods title=""/>
+      <publicstaticmethods title=""/>
+      <publicattributes title=""/>
+      <publicstaticattributes title=""/>
+      <protectedtypes title=""/>
+      <protectedslots title=""/>
+      <protectedmethods title=""/>
+      <protectedstaticmethods title=""/>
+      <protectedattributes title=""/>
+      <protectedstaticattributes title=""/>
+      <packagetypes title=""/>
+      <packagemethods title=""/>
+      <packagestaticmethods title=""/>
+      <packageattributes title=""/>
+      <packagestaticattributes title=""/>
+      <properties title=""/>
+      <events title=""/>
+      <privatetypes title=""/>
+      <privateslots title=""/>
+      <privatemethods title=""/>
+      <privatestaticmethods title=""/>
+      <privateattributes title=""/>
+      <privatestaticattributes title=""/>
+      <friends title=""/>
+      <related title="" subtitle=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <inlineclasses title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <services title=""/>
+      <interfaces title=""/>
+      <constructors title=""/>
+      <functions title=""/>
+      <related title=""/>
+      <variables title=""/>
+      <properties title=""/>
+      <events title=""/>
+    </memberdef>
+    <allmemberslink visible="yes"/>
+    <usedfiles visible="$SHOW_USED_FILES"/>
+    <authorsection visible="yes"/>
+  </class>
+
+  <!-- Layout definition for a namespace page -->
+  <namespace>
+    <briefdescription visible="yes"/>
+    <memberdecl>
+      <nestednamespaces visible="yes" title=""/>
+      <constantgroups visible="yes" title=""/>
+      <classes visible="yes" title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <inlineclasses title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+    </memberdef>
+    <authorsection visible="yes"/>
+  </namespace>
+
+  <!-- Layout definition for a file page -->
+  <file>
+    <briefdescription visible="yes"/>
+    <includes visible="$SHOW_INCLUDE_FILES"/>
+    <includegraph visible="$INCLUDE_GRAPH"/>
+    <includedbygraph visible="$INCLUDED_BY_GRAPH"/>
+    <sourcelink visible="yes"/>
+    <memberdecl>
+      <classes visible="yes" title=""/>
+      <namespaces visible="yes" title=""/>
+      <constantgroups visible="yes" title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <inlineclasses title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+    </memberdef>
+    <authorsection/>
+  </file>
+
+  <!-- Layout definition for a group page -->
+  <group>
+    <briefdescription visible="yes"/>
+    <groupgraph visible="$GROUP_GRAPHS"/>
+    <memberdecl>
+      <nestedgroups visible="yes" title=""/>
+      <dirs visible="yes" title=""/>
+      <files visible="yes" title=""/>
+      <namespaces visible="yes" title=""/>
+      <classes visible="yes" title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <enumvalues title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <signals title=""/>
+      <publicslots title=""/>
+      <protectedslots title=""/>
+      <privateslots title=""/>
+      <events title=""/>
+      <properties title=""/>
+      <friends title=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <pagedocs/>
+      <inlineclasses title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <enumvalues title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <signals title=""/>
+      <publicslots title=""/>
+      <protectedslots title=""/>
+      <privateslots title=""/>
+      <events title=""/>
+      <properties title=""/>
+      <friends title=""/>
+    </memberdef>
+    <authorsection visible="yes"/>
+  </group>
+
+  <!-- Layout definition for a directory page -->
+  <directory>
+    <briefdescription visible="yes"/>
+    <directorygraph visible="yes"/>
+    <memberdecl>
+      <dirs visible="yes"/>
+      <files visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+  </directory>
+</doxygenlayout>
diff --git a/docs/contributor_guide/adding_operator.dox b/docs/contributor_guide/adding_operator.dox
new file mode 100644
index 0000000000..697cddb235
--- /dev/null
+++ b/docs/contributor_guide/adding_operator.dox
@@ -0,0 +1,334 @@
+///
+/// Copyright (c) 2018-2019 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+
+namespace arm_compute
+{
+/**
+@page adding_operator How to Add a New Operator
+
+@tableofcontents
+
+@section S4_0_introduction Adding new operators
+
+@section S4_1_introduction Introduction
+In Compute Library there are two main parts or modules:
+- The core library consists of a low-level collection of algorithms implemented in C++ and optimized for Arm CPUs and GPUs. The core module is designed to be embedded in other projects and it doesn't perform any memory management or scheduling.
+- The runtime library is a wrapper of the core library and provides other additional features like memory management, multithreaded execution of workloads and allocation of the intermediate tensors.
+
+The library can be integrated in an existing external library or application that provides its own scheduler or a specific memory manager. In that case, the right solution is to use only the core library which means that the user must also manage all the memory allocation not only for the input/output tensor but also for the intermediate tensors/variables necessary. On the other hand, if the user doesn't want to care about allocation and multithreading then the right choice is to use the functions from the runtime library.
+
+Apart from these components that get linked into the application, the sources also include the validation test suite and the C++ reference implementations against which all the operators are validated.
+
+
+@section S4_1_supporting_new_operators Supporting new operators
+
+Following are the steps involved in adding support for a new operator in Compute Library
+- Add new data types (if required)
+- Add the kernel to the core library.
+- Add the function to the runtime library.
+- Add validation tests.
+    - Add the reference implementation.
+    - Add the fixture
+    - register the tests.
+
+@subsection S4_1_1_add_datatypes Adding new data types
+
+Compute Library declares a few new datatypes related to its domain, kernels, and functions in the library process Tensors and Images (Computer Vision functions). Tensors are multi-dimensional arrays with a maximum of Coordinates::num_max_dimensions dimensions; depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as a one-dimensional tensor. Furthermore, an image is just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
+All the datatype classes or structures are grouped in the core library folder arm_compute/core  like the @ref ITensor, @ref ITensorInfo (all the information of a tensor), TensorShape and simpler types are in arm_compute/core/Types.h.
+
+If an operator handles a new datatype, it must be added to the library. While adding a new data type to the library, it's necessary to implement the function to enable printing, the to_string() method and the output stream insertion (<<) operator. Every datatype implements these two functions in utils/TypePrinter.h
+
+A quick example, in <a href="https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/core/Types.h">Types.h</a> we add:
+
+@snippet arm_compute/core/Types.h DataLayout enum definition
+
+And for printing:
+
+@snippet utils/TypePrinter.h Print DataLayout type
+
+In Compute Library, we use namespaces to group all the operators, functions, classes and interfaces. The main namespace to use is arm_compute. In the test suite, the test framework and the individual tests use nested namespaces like @ref test::validation or @ref test::benchmark to group the different purposes of various parts of the suite.
+Utility functions like conversion or type cast operators, that are shared by multiple operators are in arm_compute/core/Utils.h. Non-inlined function definitions go in the corresponding .cpp files in the src folder.
+Similarly, all common functions that process shapes, like calculating output shapes of an operator or shape conversions etc are in arm_compute/core/utils/misc/ShapeCalculator.h.
+
+
+@subsection S4_1_2_add_kernel Add a kernel
+As we mentioned at the beginning, the kernel is the implementation of the operator or algorithm partially using a specific programming language related to the backend we want to use. Adding a kernel in the library means implementing the algorithm in a SIMD technology like Arm® Neon™ or OpenCL. All kernels in Compute Library must implement a common interface IKernel or one of the specific subinterfaces.
+IKernel is the common interface for all the kernels in the core library, it contains the main methods for configure and run the kernel itself, such as window()  that return the maximum window the kernel can be executed on or is_parallelisable() for indicate whether or not the kernel is parallelizable. If the kernel is parallelizable then the window returned by the window() method can be split into sub-windows which can then be run in parallel, in the other case, only the window returned by window() can be passed to the run method.
+There are specific interfaces for OpenCL and Neon: @ref ICLKernel, INEKernel (using INEKernel = @ref ICPPKernel).
+
+- @ref ICLKernel is the common interface for all the OpenCL kernels. It implements the inherited methods and adds all the methods necessary to configure the CL kernel, such as set/return the Local-Workgroup-Size hint, add single, array or tensor argument, set the targeted GPU architecture according to the CL device. All these methods are used during the configuration and the run of the operator.
+- INEKernel inherits from @ref IKernel as well and it's the common interface for all kernels implemented in Neon, it adds just the run and the name methods.
+
+There are two others implementation of @ref IKernel called @ref ICLSimpleKernel and INESimpleKernel, they are the interface for simple kernels that have just one input tensor and one output tensor.
+Creating a new kernel implies adding new files:
+- src/core/CL/kernels/CLReshapeLayerKernel.h
+- src/core/CL/cl_kernels/reshape_layer.cl
+- src/core/CL/kernels/CLReshapeLayerKernel.cpp
+- src/core/CL/CLKernelLibrary.cpp
+
+Neon kernel
+- arm_compute/core/NEON/kernels/NEReshapeLayerKernel.h
+- src/core/NEON/kernels/NEReshapeLayerKernel.cpp
+
+We must register the new layer in the respective libraries:
+- src/core/CL/CLKernels.h
+- arm_compute/core/NEON/NEKernels.h
+
+These files contain the list of all kernels available in the corresponding Compute Library's backend, for example CLKernels:
+@code{.cpp}
+... 
+#include "src/core/CL/kernels/CLMinMaxLayerKernel.h"
+#include "src/core/CL/kernels/CLMinMaxLocationKernel.h"
+... 
+#include "src/core/CL/kernels/CLReshapeLayerKernel.h"
+... 
+
+@endcode
+
+For OpenCL we need to update the CLKernelLibrary.cpp, adding the appropriate code to embed the .cl kernel in the library. The OpenCL code can be compiled offline and embed in the library as binary.
+The essential operation we want to do with a kernel will be
+- create the kernel object
+- initialize the kernel with the input/output and any other parameters that may be required
+- retrieve the execution window of the kernel and run the whole kernel window in the current thread or use the multithreading.
+
+Each kernel will have to implement the method:
+- %validate: is a static function that checks if the given info will lead to a valid configuration of the kernel.
+- configure: configure the kernel, its window, accessor, valid region, etc for the given set of tensors and other parameters.
+- run: execute the kernel in the window
+
+The structure of the kernel .cpp file should be similar to the next ones.
+For OpenCL:
+@snippet src/core/gpu/cl/kernels/ClReshapeKernel.cpp ClReshapeKernel Kernel
+The run will call the function defined in the .cl file.
+
+For the Arm® Neon™ backend case:
+@snippet src/core/cpu/kernels/CpuReshapeKernel.cpp NEReshapeLayerKernel Kernel
+
+In the Arm® Neon™ case, there is no need to add an extra file and we implement the kernel in the same NEReshapeLayerKernel.cpp file.
+If the tests are already in place, the new kernel can be tested using the existing tests by adding the configure and run of the kernel to the compute_target() in the fixture.
+
+
+@subsection S4_1_3_add_function Add a function
+
+%Memory management and scheduling the underlying kernel(s) must be handled by the function implementation. A kernel class must support window() API which return the execute window for the configuration that the kernel is configured for. A window specifies the dimensions of a workload. It has a start and end on each of the dimension. A maximum of Coordinates::num_max_dimensions is supported. The run time layer is expected to query the kernel for the window size and schedule the window as it sees fit. It could choose to split the window into sub windows so that it could be run in parallel. The split must adhere to the following rules
+
+- max[n].start() <= sub[n].start() < max[n].end()
+- sub[n].start() < sub[n].end() <= max[n].end()
+- max[n].step() == sub[n].step()
+- (sub[n].start() - max[n].start()) % max[n].step() == 0
+- (sub[n].end() - sub[n].start()) % max[n].step() == 0
+
+@ref CPPScheduler::schedule provides a sample implementation that is used for Arm® Neon™ kernels.
+%Memory management is the other aspect that the runtime layer is supposed to handle. %Memory management of the tensors is abstracted using TensorAllocator. Each tensor holds a pointer to a TensorAllocator object, which is used to allocate and free the memory at runtime. The implementation that is currently supported in Compute Library allows memory blocks, required to be fulfilled for a given operator, to be grouped together under a @ref MemoryGroup. Each group can be acquired and released. The underlying implementation of memory groups vary depending on whether Arm® Neon™ or CL is used. The memory group class uses memory pool to provide the required memory. It also uses the memory manager to manage the lifetime and a IPoolManager to manage the memory pools registered with the memory manager.
+
+
+We have seen the various interfaces for a kernel in the core library, the same structure the same file structure design exists in the runtime module. IFunction is the base class for all the functions, it has two child interfaces: ICLSimpleFunction and INESimpleFunction that are used as base class for functions which call a single kernel.
+
+The new operator has to implement %validate(), configure() and run(), these methods will call the respective function in the kernel considering that the multi-threading is used for the kernels which are parallelizable, by default std::thread::hardware_concurrency() threads are used. For Arm® Neon™ function can be used CPPScheduler::set_num_threads() to manually set the number of threads, whereas for OpenCL kernels all the kernels are enqueued on the queue associated with CLScheduler and the queue is then flushed.
+For the runtime functions, there is an extra method implemented: prepare(), this method prepares the function for the run, it does all the heavy operations that are done only once (reshape the weight, release the memory not necessary after the reshape, etc). The prepare method can be called standalone or in the first run, if not called before, after then the function will be marked as prepared.
+The files we add are:
+
+OpenCL function
+- arm_compute/runtime/CL/functions/CLReshapeLayer.h
+- src/runtime/CL/functions/CLReshapeLayer.cpp
+
+Neon function
+- arm_compute/runtime/NEON/functions/NEReshapeLayer.h
+- src/runtime/NEON/functions/NEReshapeLayer.cpp
+
+As we did in the kernel we have to edit the runtime libraries to register the new operator modifying the relative library file:
+- arm_compute/runtime/CL/CLFunctions.h
+- arm_compute/runtime/NEON/NEFunctions.h
+
+For the special case where the new function calls only one kernel, we could use as base class ICLSimpleFunction or INESimpleFunction. The configure and the validate methods will simply call the corresponding functions. The structure will be:
+@snippet src/runtime/CL/functions/CLReshapeLayer.cpp CLReshapeLayer snippet
+
+
+If the function is more complicated and calls more than one kernel we have to use the memory manager to manage the intermediate tensors; in the configure() method we call the manage() function passing the tensor to keep track, in the run method we will have to acquire all the buffer managed and released at the end.
+For OpenCL if we want to add two tensor input and reshape the result:
+
+@code{.cpp}
+using namespace arm_compute;
+
+CLAddReshapeLayer:: CLAddReshapeLayer(std::shared_ptr<IMemoryManager> memory_manager)
+    : _memory_group(std::move(memory_manager))
+{
+}
+
+void CLAddReshapeLayer::configure(const ICLTensor *input1, const ICLTensor *input2, ICLTensor *output)
+{
+    // Allocate memory
+    TensorInfo info();
+    add_output.allocator()->init(info);
+
+    // Manage intermediate buffers
+    memory_group.manage(&_addOutput);
+
+    // Initialise kernel
+    _add_kernel.configure(input1, input2, &add_output);
+    _reshape_kernel.configure(&add_output, output);
+
+    // Allocate intermediate tensors
+    add_output.allocator()->allocate();
+}
+
+Status CLAddReshapeLayer::validate(const ITensorInfo *input1, const ITensorInfo *input2, const ITensorInfo *output)
+{
+    TensorInfo add_output();
+    ARM_COMPUTE_RETURN_ERROR_ON(CLAddLayerKernel::validate(input1, input2, add_output));
+    ARM_COMPUTE_RETURN_ERROR_ON(CLReshapeLayerKernel::validate(add_output, output));
+    return Status{};
+}
+
+void CLAddReshapeLayer::run()
+{
+    memory_group.acquire();
+
+    // Run Add
+    add_kernel.run();
+
+    // Run Reshape
+    CLScheduler::get().enqueue(reshape_kernel);
+
+    memory_group.release();
+}
+
+@endcode
+
+For Neon:
+
+@code{.cpp}
+using namespace arm_compute;
+
+NEAddReshapeLayer:: NEAddReshapeLayer (std::shared_ptr<IMemoryManager> memory_manager)
+    : _memory_group(std::move(memory_manager))
+{
+}
+
+void NEAddReshapeLayer::configure(const ITensor *input1, const ITensor *input2, ITensor *output)
+{
+    // Allocate memory
+    TensorInfo info();
+    add_output.allocator()->init(info);
+
+    // Manage intermediate buffers
+    memory_group.manage(&_addOutput);
+
+    // Initialise kernel
+    add_kernel.configure(input1, input2, &addOutput);
+    reshape_kernel.configure(&addOutput, output);
+
+    // Allocate intermediate tensors
+    add_output.allocator()->allocate();
+}
+
+void NEAddReshapeLayer::run()
+{
+    memory_group.acquire();
+
+    // Run Add
+    add_kernel.run();
+
+    // Run Reshape
+    NEScheduler::get().schedule(_reshape_kernel.get(), Window::DimY);
+
+    memory_group.release();
+}
+@endcode
+
+
+At this point, everything is in place at the library level. If you are following an tests driven implementation and all the tests are already in place, we can call the function configuration in the fixture and remove any redundant code like the allocation of the intermediate tensors since it's done in the function. Run the final tests to check the results match with the expected results from the reference implementation.
+
+@subsection S4_1_4_add_validation Add validation artifacts
+
+@subsubsection S4_1_4_1_add_reference Add the reference implementation and the tests
+As mentioned in the introduction, the reference implementation is a pure C++ implementation without any optimization or backend specific instruction.
+The refence implementation consist of two files into the folder tests/validation/reference:
+- tests/validation/reference/ReshapeLayer.h
+- tests/validation/reference/ReshapeLayer.cpp
+
+where we will put respectively the declaration and definition of the new operator.
+All the utility functions that are used ONLY in the tests are in test/validation/helpers.h, for all the others, as mentioned before, there are helpers in the library.
+Compute Library and the tests do use templates, the reference implementation is a generic implementation independent from the datatype and we use the templates to generalize the datatype concept.
+Following the example, let's have a look at the ReshapeLayer operator:
+
+- tests/validation/reference/ReshapeLayer.h
+
+@snippet tests/validation/reference/ReshapeLayer.h ReshapeLayer
+
+- tests/validation/reference/ReshapeLayer.cpp
+
+@snippet tests/validation/reference/ReshapeLayer.cpp ReshapeLayer
+
+An explicit instantiation of the template for the required datatypes must be added in the .cpp file.
+
+@subsubsection S4_1_4_2_add_dataset Add dataset
+One of the parameters of the tests is the dataset, it will be used to generate versions of the test case with different inputs.
+To pass the dataset at the fixture data test case we have three cases
+- the operator dataset is simple so it can be added directly in the test case data declaration
+- we can create a class that return tuples at the test framework
+
+@snippet tests/datasets/PoolingTypesDataset.h PoolingTypes datasets
+
+- if we want to create dynamically the dataset combining different parameter, we can create the dataset using iterators.
+For example, dataset for ReshapeLayer:
+
+@snippet tests/datasets/ReshapeLayerDataset.h ReshapeLayer datasets
+
+@subsubsection S4_1_4_3_add_fixture  Add a fixture and a data test case
+
+Benchmark and validation tests are based on the same framework to setup and run the tests. In addition to running simple, self-contained test functions the framework supports fixtures and data test cases.
+Fixtures can be used to share common setup, teardown or even run tasks among multiple test cases, for that purpose a fixture can define a "setup", "teardown" and "run" method.
+Adding tests for the new operator in the runtime library we need to implement at least the setup method, that is used to call two methods for configure, run and return the output respectively of the target (CL or Neon) and the reference (C++ implementation).
+
+For example let's have a look at Reshape Layer Fixture :
+
+@snippet tests/validation/fixtures/ReshapeLayerFixture.h ReshapeLayer fixture
+
+In the fixture class above we can see that the setup method computes the target and reference and store them in the two members _target and _reference which will be used later to check for correctness.
+The compute_target method reflects the exact behavior expected when we call a function. The input and output tensor must be declared, function configured, tensors allocated, the input tensor filled with required data, and finally, the function must be run and the results returned.
+This fixture is used in the test case, that is a parameterized test case that inherits from a fixture. The test case will have access to all public and protected members of the fixture. Only the setup and teardown methods of the fixture will be used. The setup method of the fixture needs to be a template and must accept inputs from the dataset as arguments.
+The body of this function will be used as a test function.
+For the fixture test case the first argument is the name of the test case (has to be unique within the enclosing test suite), the second argument is the class name of the fixture, the third argument is the dataset mode in which the test will be active (PRECOMMIT or NIGTHLY) and the fourth argument is the dataset.
+For example:
+
+@snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayerFixture snippet
+
+@code{.cpp}
+TEST_SUITE(CL)
+TEST_SUITE(ActivationLayer)
+TEST_SUITE(Float)
+TEST_SUITE(FP16)
+@endcode
+@snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayer Test snippet
+@code{.cpp}
+TEST_SUITE_END()
+TEST_SUITE_END()
+TEST_SUITE_END()
+TEST_SUITE_END()
+@endcode
+
+This will produce a set of tests that can be filtered with "CL/ReshapeLayer/Float/FP16/RunSmall". Each test produced from the cartesian product of the dataset is associated to a number and can be filtered specifying all the parameters.
+*/
+} // namespace arm_compute
diff --git a/docs/contributor_guide/contribution_guidelines.dox b/docs/contributor_guide/contribution_guidelines.dox
new file mode 100644
index 0000000000..9d854136bd
--- /dev/null
+++ b/docs/contributor_guide/contribution_guidelines.dox
@@ -0,0 +1,452 @@
+///
+/// Copyright (c) 2019 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page contribution_guidelines Contribution Guidelines
+
+@tableofcontents
+
+If you want to contribute to Arm Compute Library, be sure to review the following guidelines.
+
+The development is structured in the following way:
+- Release repository: https://github.com/arm-software/ComputeLibrary
+- Development repository: https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary
+- Please report issues here: https://github.com/ARM-software/ComputeLibrary/issues
+
+@section S5_1_coding_standards Coding standards and guidelines
+
+Best practices (as suggested by clang-tidy):
+
+- No uninitialised values
+
+Helps to prevent undefined behaviour and allows to declare variables const if they are not changed after initialisation. See http://clang.llvm.org/extra/clang-tidy/checks/cppcoreguidelines-pro-type-member-init.html
+
+@code{.cpp}
+const float32x4_t foo = vdupq_n_f32(0.f);
+const float32x4_t bar = foo;
+
+const int32x4x2_t i_foo = {{
+	vconvq_s32_f32(foo),
+    vconvq_s32_f32(foo)
+}};
+const int32x4x2_t i_bar = i_foo;
+@endcode
+
+- No C-style casts (in C++ source code)
+
+Only use static_cast, dynamic_cast, and (if required) reinterpret_cast and const_cast. See http://en.cppreference.com/w/cpp/language/explicit_cast for more information when to use which type of cast. C-style casts do not differentiate between the different cast types and thus make it easy to violate type safety. Also, due to the prefix notation it is less clear which part of an expression is going to be casted. See http://clang.llvm.org/extra/clang-tidy/checks/cppcoreguidelines-pro-type-cstyle-cast.html
+
+- No implicit casts to bool
+
+Helps to increase readability and might help to catch bugs during refactoring. See http://clang.llvm.org/extra/clang-tidy/checks/readability-implicit-bool-cast.html
+
+@code{.cpp}
+extern int *ptr;
+if(ptr){} // Bad
+if(ptr != nullptr) {} // Good
+
+extern int foo;
+if(foo) {} // Bad
+if(foo != 0) {} // Good
+@endcode
+
+- Use nullptr instead of NULL or 0
+
+The nullptr literal is type-checked and is therefore safer to use. See http://clang.llvm.org/extra/clang-tidy/checks/modernize-use-nullptr.html
+
+- No need to explicitly initialise std::string with an empty string
+
+The default constructor of std::string creates an empty string. In general it is therefore not necessary to specify it explicitly. See http://clang.llvm.org/extra/clang-tidy/checks/readability-redundant-string-init.html
+
+@code{.cpp}
+// Instead of
+std::string foo("");
+std::string bar = "";
+
+// The following has the same effect
+std::string foo;
+std::string bar;
+@endcode
+
+- Braces for all control blocks and loops (which have a body)
+
+To increase readability and protect against refactoring errors the body of control block and loops must be wrapped in braces. See http://clang.llvm.org/extra/clang-tidy/checks/readability-braces-around-statements.html
+
+For now loops for which the body is empty do not have to add empty braces. This exception might be revoked in the future. Anyway, situations in which this exception applies should be rare.
+
+@code{.cpp}
+Iterator it;
+while(it.next()); // No need for braces here
+
+// Make more use of it
+@endcode
+
+- Only one declaration per line
+
+Increase readability and thus prevent errors.
+
+@code{.cpp}
+int a, b; // BAD
+int c, *d; // EVEN WORSE
+
+int e = 0; // GOOD
+int *p = nullptr; // GOOD
+@endcode
+
+- Pass primitive types (and those that are cheap to copy or move) by value
+
+For primitive types it is more efficient to pass them by value instead of by const reference because:
+
+ - the data type might be smaller than the "reference type"
+ - pass by value avoids aliasing and thus allows for better optimisations
+ - pass by value is likely to avoid one level of indirection (references are often implemented as auto dereferenced pointers)
+
+This advice also applies to non-primitive types that have cheap copy or move operations and the function needs a local copy of the argument anyway.
+
+More information:
+
+ - http://stackoverflow.com/a/14013189
+ - http://stackoverflow.com/a/270435
+ - http://web.archive.org/web/20140113221447/http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/
+
+@code{.cpp}
+void foo(int i, long l, float32x4_t f); // Pass-by-value for builtin types
+void bar(const float32x4x4_t &f); // As this is a struct pass-by-const-reference is probably better
+void foobar(const MyLargeCustomTypeClass &m); // Definitely better as const-reference except if a copy has to be made anyway.
+@endcode
+
+- Don't use unions
+
+Unions cannot be used to convert values between different types because (in C++) it is undefined behaviour to read from a member other than the last one that has been assigned to. This limits the use of unions to a few corner cases and therefor the general advice is not to use unions. See http://releases.llvm.org/3.8.0/tools/clang/tools/extra/docs/clang-tidy/checks/cppcoreguidelines-pro-type-union-access.html
+
+- Use pre-increment/pre-decrement whenever possible
+
+In contrast to the pre-incerement the post-increment has to make a copy of the incremented object. This might not be a problem for primitive types like int but for class like objects that overload the operators, like iterators, it can have a huge impact on the performance. See http://stackoverflow.com/a/9205011
+
+To be consistent across the different cases the general advice is to use the pre-increment operator unless post-increment is explicitly required. The same rules apply for the decrement operator.
+
+@code{.cpp}
+for(size_t i = 0; i < 9; i++); // BAD
+for(size_t i = 0; i < 9; ++i); // GOOD
+@endcode
+
+- Don't use uint in C/C++
+
+The C and C++ standards don't define a uint type. Though some compilers seem to support it by default it would require to include the header sys/types.h. Instead we use the slightly more verbose unsigned int type.
+
+- Don't use unsigned int in function's signature
+
+Unsigned integers are good for representing bitfields and modular arithmetic. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. Mixing signedness of integer types is responsible for an equally large class of problems.
+
+- No "Yoda-style" comparisons
+
+As compilers are now able to warn about accidental assignments if it is likely that the intention has been to compare values it is no longer required to place literals on the left-hand side of the comparison operator. Sticking to the natural order increases the readability and thus prevents logical errors (which cannot be spotted by the compiler). In the rare case that the desired result is to assign a value and check it the expression has to be surrounded by parentheses.
+
+@code{.cpp}
+if(nullptr == ptr || false == cond) // BAD
+{
+	//...
+}
+
+if(ptr == nullptr || cond == false) // GOOD
+{
+	//...
+}
+
+if(ptr = nullptr || cond = false) // Most likely a mistake. Will cause a compiler warning
+{
+	//...
+}
+
+if((ptr = nullptr) || (cond = false)) // Trust me, I know what I'm doing. No warning.
+{
+	//...
+}
+@endcode
+
+@subsection S5_1_1_rules Rules
+
+ - Use spaces for indentation and alignment. No tabs! Indentation should be done with 4 spaces.
+ - Unix line returns in all the files.
+ - Pointers and reference symbols attached to the variable name, not the type (i.e. char \&foo;, and not char& foo).
+ - No trailing spaces or tabs at the end of lines.
+ - No spaces or tabs on empty lines.
+ - Put { and } on a new line and increase the indentation level for code inside the scope (except for namespaces).
+ - Single space before and after comparison operators ==, <, >, !=.
+ - No space around parenthesis.
+ - No space before, one space after ; (unless it is at the end of a line).
+
+@code{.cpp}
+for(int i = 0; i < width * height; ++i)
+{
+	void *d = foo(ptr, i, &addr);
+	static_cast<uint8_t *>(data)[i] = static_cast<uint8_t *>(d)[0];
+}
+@endcode
+
+ - Put a comment after \#else, \#endif, and namespace closing brace indicating the related name
+
+@code{.cpp}
+namespace mali
+{
+#ifdef MALI_DEBUG
+	...
+#else // MALI_DEBUG
+	...
+#endif // MALI_DEBUG
+} // namespace mali
+@endcode
+
+- CamelCase for class names only and lower case words separated with _ (snake_case) for all the functions / methods / variables / arguments / attributes.
+
+@code{.cpp}
+class ClassName
+{
+    public:
+        void my_function();
+        int my_attribute() const; // Accessor = attribute name minus '_', const if it's a simple type
+    private:
+        int _my_attribute; // '_' in front of name
+};
+@endcode
+
+- Use quotes instead of angular brackets to include local headers. Use angular brackets for system headers.
+- Also include the module header first, then local headers, and lastly system headers. All groups should be separated by a blank line and sorted lexicographically within each group.
+- Where applicable the C++ version of system headers has to be included, e.g. cstddef instead of stddef.h.
+- See http://llvm.org/docs/CodingStandards.html#include-style
+
+@code{.cpp}
+#include "MyClass.h"
+
+#include "arm_cv/core/Helpers.h"
+#include "arm_cv/core/Types.h"
+
+#include <cstddef>
+#include <numeric>
+@endcode
+
+- Only use "auto" when the type can be explicitly deduced from the assignment.
+
+@code{.cpp}
+auto a = static_cast<float*>(bar); // OK: there is an explicit cast
+auto b = std::make_unique<Image>(foo); // OK: we can see it's going to be an std::unique_ptr<Image>
+auto c = img.ptr(); // NO: Can't tell what the type is without knowing the API.
+auto d = vdup_n_u8(0); // NO: It's not obvious what type this function returns.
+@endcode
+
+- OpenCL:
+    - Use __ in front of the memory types qualifiers and kernel: __kernel, __constant, __private, __global, __local.
+    - Indicate how the global workgroup size / offset / local workgroup size are being calculated.
+
+    - Doxygen:
+
+        - No '*' in front of argument names
+        - [in], [out] or [in,out] *in front* of arguments
+        - Skip a line between the description and params and between params and @return (If there is a return)
+        - Align params names and params descriptions (Using spaces), and with a single space between the widest column and the next one.
+        - Use an upper case at the beginning of the description
+
+@snippet arm_compute/runtime/NEON/functions/NEActivationLayer.h NEActivationLayer snippet
+
+@subsection S5_1_2_how_to_check_the_rules How to check the rules
+
+astyle (http://astyle.sourceforge.net/) and clang-format (https://clang.llvm.org/docs/ClangFormat.html) can check and help you apply some of these rules.
+
+@subsection S5_1_3_library_size_guidelines Library size: best practices and guidelines
+
+@subsubsection S5_1_3_1_template_suggestions Template suggestions
+
+When writing a new patch we should also have in mind the effect it will have in the final library size. We can try some of the following things:
+
+ - Place non-dependent template code in a different non-templated class/method
+
+@code{.cpp}
+template<typename T>
+class Foo
+{
+public:
+    enum { v1, v2 };
+    // ...
+};
+@endcode
+
+    can be converted to:
+
+@code{.cpp}
+struct Foo_base
+{
+    enum { v1, v2 };
+    // ...
+};
+
+template<typename T>
+class Foo : public Foo_base
+{
+public:
+    // ...
+};
+@endcode
+
+ - In some cases it's preferable to use runtime switches instead of template parameters
+
+ - Sometimes we can rewrite the code without templates and without any (significant) performance loss. Let's say that we've written a function where the only use of the templated argument is used for casting:
+
+@code{.cpp}
+template <typename T>
+void NETemplatedKernel::run(const Window &window)
+{
+...
+ *(reinterpret_cast<T *>(out.ptr())) = *(reinterpret_cast<const T *>(in.ptr()));
+...
+}
+@endcode
+
+The above snippet can be transformed to:
+
+@code{.cpp}
+void NENonTemplatedKernel::run(const Window &window)
+{
+...
+std::memcpy(out.ptr(), in.ptr(), element_size);
+...
+}
+@endcode
+
+@subsection S5_1_4_secure_coding_practices Secure coding practices
+
+@subsubsection S5_1_4_1_general_coding_practices General Coding Practices
+
+- **Use tested and approved managed code** rather than creating new unmanaged code for common tasks.
+- **Utilize locking to prevent multiple simultaneous requests** or use a synchronization mechanism to prevent race conditions.
+- **Protect shared variables and resources** from inappropriate concurrent access.
+- **Explicitly initialize all your variables and other data stores**, either during declaration or just before the first usage.
+- **In cases where the application must run with elevated privileges, raise privileges as late as possible, and drop them as soon as possible**.
+- **Avoid calculation errors** by understanding your programming language's underlying representation and how it interacts with numeric calculation. Pay close attention to byte size discrepancies, precision, signed/unsigned distinctions, truncation, conversion and casting between types, "not-a-number" calculations, and how your language handles numbers that are too large or too small for its underlying representation.
+- **Restrict users from generating new code** or altering existing code.
+
+
+@subsubsection S5_1_4_2_secure_coding_best_practices Secure Coding Best Practices
+
+- **Validate input**. Validate input from all untrusted data sources. Proper input validation can eliminate the vast majority of software vulnerabilities. Be suspicious of most external data sources, including command line arguments, network interfaces, environmental variables, and user controlled files.
+- **Heed compiler warnings**. Compile code using the default compiler flags that exist in the SConstruct file.
+- Use **static analysis tools** to detect and eliminate additional security flaws.
+- **Keep it simple**. Keep the design as simple and small as possible. Complex designs increase the likelihood that errors will be made in their implementation, configuration, and use. Additionally, the effort required to achieve an appropriate level of assurance increases dramatically as security mechanisms become more complex.
+- **Default deny**. Base access decisions on permission rather than exclusion. This means that, by default, access is denied and the protection scheme identifies conditions under which access is permitted
+- **Adhere to the principle of least privilege**. Every process should execute with the least set of privileges necessary to complete the job. Any elevated permission should only be accessed for the least amount of time required to complete the privileged task. This approach reduces the opportunities an attacker has to execute arbitrary code with elevated privileges.
+- **Sanitize data sent to other systems**. Sanitize all data passed to complex subsystems such as command shells, relational databases, and commercial off-the-shelf (COTS) components. Attackers may be able to invoke unused functionality in these components through the use of various injection attacks. This is not necessarily an input validation problem because the complex subsystem being invoked does not understand the context in which the call is made. Because the calling process understands the context, it is responsible for sanitizing the data before invoking the subsystem.
+- **Practice defense in depth**. Manage risk with multiple defensive strategies, so that if one layer of defense turns out to be inadequate, another layer of defense can prevent a security flaw from becoming an exploitable vulnerability and/or limit the consequences of a successful exploit. For example, combining secure programming techniques with secure runtime environments should reduce the likelihood that vulnerabilities remaining in the code at deployment time can be exploited in the operational environment.
+
+@subsection S5_1_5_guidelines_for_stable_api_abi Guidelines for stable API/ABI
+
+The Application Programming Interface (API) and Application Binary Interface (ABI) are the interfaces exposed
+to users so their programs can interact with the library efficiently and effectively. Even though changing API/ABI
+in a way that does not give backward compatibility is not necessarily bad if it can improve other users' experience and the library,
+contributions should be made with the awareness of API/ABI stability. If you'd like to make changes that affects
+the library's API/ABI, please review and follow the guidelines shown in this section. Also, please note that
+these guidelines are not exhaustive list but discussing things that might be easily overlooked.
+
+@subsubsection S5_1_5_1_guidelines_for_api Guidelines for API
+
+- When adding new arguments, consider grouping arguments (including the old ones) into a struct rather than adding arguments with default values.
+Introducing a new struct might break the API/ABI once, but it will be helpful to keep the stability.
+- When new member variables are added, please make sure they are initialized.
+- Avoid adding enum elements in the middle.
+- When removing arguments, follow the deprecation process described in the following section.
+- When changing behavior affecting API contracts, follow the deprecation process described in the following section.
+
+@subsubsection S5_1_5_2_guidelines_for_abi Guidelines for ABI
+
+We recommend to read through <a href="https://community.kde.org/Policies/Binary_Compatibility_Issues_With_C%2B%2B">this page</a>
+and double check your contributions to see if they include the changes listed.
+
+Also, for classes that requires strong ABI stability, consider using <a href="https://en.cppreference.com/w/cpp/language/pimpl">pImpl idiom</a>.
+
+@subsubsection S5_1_5_3_api_deprecation_process API deprecation process
+
+In order to deprecate an existing API, these rules should be followed.
+
+- Removal of a deprecated API should wait at least for one official release.
+- Deprecation of runtime APIs should strictly follow the aforementioned period, whereas core APIs can have more flexibility as they are mostly used internally rather than user-facing.
+- Any API changes (update, addition and deprecation) in all components should be well documented by the contribution itself.
+
+Also, it is recommended to use the following utility macros which is designed to work with both clang and gcc using C++14 and later.
+
+- ARM_COMPUTE_DEPRECATED: Just deprecate the wrapped function
+- ARM_COMPUTE_DEPRECATED_REL: Deprecate the wrapped function and also capture the release that was deprecated
+- ARM_COMPUTE_DEPRECATED_REL_REPLACE: Deprecate the wrapped function and also capture the release that was deprecated along with a possible replacement candidate
+
+@code{.cpp}
+ARM_COMPUTE_DEPRECATED_REL_REPLACE(20.08, DoNewThing)
+void DoOldThing();
+
+void DoNewThing();
+@endcode
+
+@section S5_2_how_to_submit_a_patch How to submit a patch
+
+To be able to submit a patch to our development repository you need to have a GitHub account. With that, you will be able to sign in to Gerrit where your patch will be reviewed.
+
+Next step is to clone the Compute Library repository:
+
+	git clone "ssh://<your-github-id>@review.mlplatform.org:29418/ml/ComputeLibrary"
+
+If you have cloned from GitHub or through HTTP, make sure you add a new git remote using SSH:
+
+	git remote add acl-gerrit "ssh://<your-github-id>@review.mlplatform.org:29418/ml/ComputeLibrary"
+
+After that, you will need to upload an SSH key to https://review.mlplatform.org/#/settings/ssh-keys
+
+Then, make sure to install the commit-msg Git hook in order to add a change-ID to the commit message of your patch:
+
+	cd "ComputeLibrary" && mkdir -p .git/hooks && curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://review.mlplatform.org/tools/hooks/commit-msg; chmod +x `git rev-parse --git-dir`/hooks/commit-msg)
+
+When your patch is ready, remember to sign off your contribution by adding a line with your name and e-mail address to every git commit message:
+
+	Signed-off-by: John Doe <john.doe@example.org>
+
+You must use your real name, no pseudonyms or anonymous contributions are accepted.
+
+You can add this to your patch with:
+
+	git commit -s --amend
+
+You are now ready to submit your patch for review:
+
+	git push acl-gerrit HEAD:refs/for/master
+
+@section S5_3_code_review Patch acceptance and code review
+
+Once a patch is uploaded for review, there is a pre-commit test that runs on a Jenkins server for continuos integration tests. In order to be merged a patch needs to:
+
+- get a "+1 Verified" from the pre-commit job
+- get a "+1 Comments-Addressed", in case of comments from reviewers the committer has to address them all. A comment is considered addressed when the first line of the reply contains the word "Done"
+- get a "+2" from a reviewer, that means the patch has the final approval
+
+At the moment, the Jenkins server is not publicly accessible and for security reasons patches submitted by non-whitelisted committers do not trigger the pre-commit tests. For this reason, one of the maintainers has to manually trigger the job.
+
+If the pre-commit test fails, the Jenkins job will post a comment on Gerrit with the details about the failure so that the committer will be able to reproduce the error and fix the issue, if any (sometimes there can be infrastructure issues, a test platform disconnecting for example, where the job needs to be retriggered).
+
+*/
+} // namespace arm_compute
diff --git a/docs/contributor_guide/implementation_topics.dox b/docs/contributor_guide/implementation_topics.dox
new file mode 100644
index 0000000000..4afaa6d6a1
--- /dev/null
+++ b/docs/contributor_guide/implementation_topics.dox
@@ -0,0 +1,143 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page implementation_topic Implementation Topics
+
+@section implementation_topic_windows Windows
+
+A @ref Window represents a workload to execute, it can handle up to @ref Coordinates::num_max_dimensions dimensions.
+Each dimension is defined by a start, end and step.
+
+It can split into subwindows as long as *all* the following rules remain true for all the dimensions:
+
+- max[n].start() <= sub[n].start() < max[n].end()
+- sub[n].start() < sub[n].end() <= max[n].end()
+- max[n].step() == sub[n].step()
+- (sub[n].start() - max[n].start()) % max[n].step() == 0
+- (sub[n].end() - sub[n].start()) % max[n].step() == 0
+
+@section implementation_topic_kernels Kernels
+
+Each implementation of the @ref IKernel interface (base class of all the kernels in the core library) works in the same way:
+
+OpenCL kernels:
+
+@code{.cpp}
+// Initialize the CLScheduler with the default context and default command queue
+// Implicitly initializes the CLKernelLibrary to use ./cl_kernels as location for OpenCL kernels files and sets a default device for which OpenCL programs are built.
+CLScheduler::get().default_init();
+
+cl::CommandQueue q = CLScheduler::get().queue();
+//Create a kernel object:
+MyKernel kernel;
+// Initialize the kernel with the input/output and options you want to use:
+kernel.configure( input, output, option0, option1);
+// Retrieve the execution window of the kernel:
+const Window& max_window = kernel.window();
+// Run the whole kernel in the current thread:
+kernel.run( q, max_window ); // Enqueue the kernel to process the full window on the default queue
+
+// Wait for the processing to complete:
+q.finish();
+@endcode
+
+Neon / CPP kernels:
+
+@code{.cpp}
+//Create a kernel object:
+MyKernel kernel;
+// Initialize the kernel with the input/output and options you want to use:
+kernel.configure( input, output, option0, option1);
+// Retrieve the execution window of the kernel:
+const Window& max_window = kernel.window();
+// Run the whole kernel in the current thread:
+kernel.run( max_window ); // Run the kernel on the full window
+@endcode
+
+@section implementation_topic_multithreading Multi-threading
+
+The previous section shows how to run a Arm® Neon™ / CPP kernel in the current thread, however if your system has several CPU cores, you will probably want the kernel to use several cores. Here is how this can be done:
+
+@code{.cpp}
+    ThreadInfo info;
+    info.cpu_info = &_cpu_info;
+
+    const Window      &max_window     = kernel->window();
+    const unsigned int num_iterations = max_window.num_iterations(split_dimension);
+    info.num_threads                  = std::min(num_iterations, _num_threads);
+
+    if(num_iterations == 0)
+    {
+        return;
+    }
+
+    if(!kernel->is_parallelisable() || info.num_threads == 1)
+    {
+        kernel->run(max_window, info);
+    }
+    else
+    {
+        int  t         = 0;
+        auto thread_it = _threads.begin();
+
+        for(; t < info.num_threads - 1; ++t, ++thread_it)
+        {
+            Window win     = max_window.split_window(split_dimension, t, info.num_threads);
+            info.thread_id = t;
+            thread_it->start(kernel, win, info);
+        }
+
+        // Run last part on main thread
+        Window win     = max_window.split_window(split_dimension, t, info.num_threads);
+        info.thread_id = t;
+        kernel->run(win, info);
+
+        try
+        {
+            for(auto &thread : _threads)
+            {
+                thread.wait();
+            }
+        }
+        catch(const std::system_error &e)
+        {
+            std::cerr << "Caught system_error with code " << e.code() << " meaning " << e.what() << '\n';
+        }
+    }
+@endcode
+
+This is a very basic implementation which was originally used in the Arm® Neon™ runtime library by all the Arm® Neon™ functions.
+
+@sa CPPScheduler
+
+@note Some kernels need some local temporary buffer to perform their calculations. In order to avoid memory corruption between threads, the local buffer must be of size: ```memory_needed_per_thread * num_threads``` and a unique thread_id between 0 and num_threads must be assigned to the @ref ThreadInfo object passed to the ```run``` function.
+
+
+@section implementation_topic_cl_scheduler OpenCL kernel library
+
+All OpenCL kernels used by the library are built and stored in @ref CLKernelLibrary.
+If the library is compiled with embed_kernels=0 the application can set the path to the OpenCL kernels by calling @ref CLKernelLibrary::init(), by default the path is set to "./cl_kernels"
+*/
+} // namespace arm_compute
\ No newline at end of file
diff --git a/docs/user_guide/advanced.dox b/docs/user_guide/advanced.dox
new file mode 100644
index 0000000000..86ee2ce756
--- /dev/null
+++ b/docs/user_guide/advanced.dox
@@ -0,0 +1,114 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page advanced Advanced
+
+@tableofcontents
+
+@section S1_8_cl_tuner OpenCL Tuner
+
+The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
+The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
+The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
+In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
+
+If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:
+
+https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice
+
+Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.
+
+CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.
+
+    #Example: 2 unique Matrix Multiply configurations
+@code{.cpp}
+    TensorShape a0 = TensorShape(32,32);
+    TensorShape b0 = TensorShape(32,32);
+    TensorShape c0 = TensorShape(32,32);
+    TensorShape a1 = TensorShape(64,64);
+    TensorShape b1 = TensorShape(64,64);
+    TensorShape c1 = TensorShape(64,64);
+
+    Tensor a0_tensor;
+    Tensor b0_tensor;
+    Tensor c0_tensor;
+    Tensor a1_tensor;
+    Tensor b1_tensor;
+    Tensor c1_tensor;
+
+    a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
+    b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
+    c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
+    a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
+    b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
+    c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
+
+    CLGEMM gemm0;
+    CLGEMM gemm1;
+
+    // Configuration 0
+    gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
+
+    // Configuration 1
+    gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);
+@endcode
+
+@subsection S1_8_1_cl_tuner_how_to How to use it
+
+All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file
+
+    #Enable CL tuner
+    ./graph_mobilenet --enable-tuner –-target=CL
+    ./arm_compute_benchmark --enable-tuner
+
+    #Export/Import to/from a file
+    ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
+    ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv
+
+If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.
+
+Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:
+
+    -# Disable the power management
+    -# Keep the GPU frequency constant
+    -# Run multiple times the network (i.e. 10).
+
+If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.
+
+@code{.cpp}
+CLTuner tuner;
+
+// Setup Scheduler
+CLScheduler::get().default_init(&tuner);
+@endcode
+
+After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".
+- tuner.save_to_file("results.csv");
+
+This file can be also imported using the method "load_from_file("results.csv")".
+- tuner.load_from_file("results.csv");
+
+*/
+} // namespace
\ No newline at end of file
diff --git a/docs/user_guide/api.dox b/docs/user_guide/api.dox
new file mode 100644
index 0000000000..39282046a9
--- /dev/null
+++ b/docs/user_guide/api.dox
@@ -0,0 +1,135 @@
+///
+/// Copyright (c) 2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page api Application Programming Interface
+
+@tableofcontents
+
+@section api_overview Overview
+
+In this section we present Compute Library's application programming interface (API) architecture along with
+a detailed explanation of its components. Compute Library's API consists of multiple high-level operators and
+even more internally distinct computational blocks that can be executed on a command queue.
+Operators can be bound to multiple Tensor objects and executed concurrently or asynchronously if needed.
+All operators and associated objects are encapsulated in a Context-based mechanism, which provides all related
+construction services.
+
+@section api_objects Fundamental objects
+
+Compute Library consists of a list of fundamental objects that are responsible for creating and orchestrating operator execution.
+Below we present these objects in more detail.
+
+@subsection api_objects_context AclContext or Context
+
+AclContext or Context acts as a central creational aggregate service. All other objects are bound to or created from a context.
+It provides, internally, common facilities such as
+- allocators for object creation or backing memory allocation
+- serialization interfaces
+- any other modules that affect the construction of objects (e.g., program cache for OpenCL).
+
+The followings sections will describe parameters that can be given on the creation of Context.
+
+@subsubsection api_object_context_target AclTarget
+Context is initialized with a backend target (AclTarget) as different backends might have a different subset of services.
+Currently the following targets are supported:
+- #AclCpu: a generic CPU target that accelerates primitives through SIMD technologies
+- #AclGpuOcl: a target for GPU acceleration using OpenCL
+
+@subsubsection api_object_context_execution_mode AclExecutionMode
+An execution mode (AclExecutionMode) can be passed as an argument that affects the operator creation.
+At the moment the following execution modes are supported:
+- #AclPreferFastRerun: Provides faster re-run. It can be used when the operators are expected to be executed multiple
+times under the same execution context
+- #AclPreferFastStart: Provides faster single execution. It can be used when the operators will be executed only once,
+thus reducing their latency is important (Currently, it is not implemented)
+
+@subsubsection api_object_context_capabilitys AclTargetCapabilities
+Context creation can also have a list of capabilities of hardware as one of its parameters. This is currently
+available only for the CPU backend. A list of architecture capabilities can be passed to influence the selection
+of the underlying kernels. Such capabilities can be for example the enablement of SVE or the dot product
+instruction explicitly.
+@note The underlying hardware should support the given capability list.
+
+@subsubsection api_object_context_allocator Allocator
+An allocator object that implements @ref AclAllocator can be passed to the Context upon its creation.
+This user-provided allocator will be used for allocation of any internal backing memory.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClContext) or set (@ref AclSetClContext) the internal OpenCL context.
+
+@subsection api_objects_tensor AclTensor or Tensor
+
+A tensor is a mathematical object that can describe physical properties like matrices.
+It can be also considered a generalization of matrices that can represent arbitrary
+dimensionalities. AclTensor is an abstracted interface that represents a tensor.
+
+AclTensor, in addition to the elements of the physical properties they represent,
+also contains the information such as shape, data type, data layout and strides to not only
+fully describe the characteristics of the physical properties but also provide information
+how the object stored in memory should be traversed. @ref AclTensorDescriptor is a dedicated
+object to represent such metadata.
+
+@note The allocation of an AclTensor can be deferred until external memory is imported
+as backing memory to accomplish a zero-copy context.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClMem) the internal OpenCL memory object.
+
+As Tensors can reside in different memory spaces, @ref AclMapTensor and @ref AclUnmapTensor entrypoints
+are provided to map Tensors in and out of the host memory system, respectively.
+
+@subsection api_objects_queue AclQueue or Queue
+
+AclQueue acts as a runtime aggregate service. It provides facilities to schedule
+and execute operators using underlying hardware. It also contains services like
+tuning mechanisms (e.g., Local workgroup size tuning for OpenCL) that can be specified
+during operator execution.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClQueue) or set (@ref AclSetClQueue) the internal OpenCL queue.
+
+@section api_internal Internal
+@subsection api_internal_operator_vs_kernels Operators vs Kernels
+
+Internally, Compute Library separates the executable primitives in two categories: kernels and operators
+which operate in a hierarchical way.
+
+A kernel is the lowest-level computation block whose responsibility is performing a task on a given group of data.
+For design simplicity, kernels computation does NOT involve the following:
+
+- Memory allocation: All the memory manipulation should be handled by the caller.
+- Multi-threading: The information on how the workload can be split is provided by kernels,
+so the caller can effectively distribute the workload to multiple threads.
+
+On the other hand, operators combine one or multiple kernels to achieve more complex calculations.
+The responsibilities of the operators can be summarized as follows:
+
+- Defining the scheduling policy and dispatching of the underlying kernels to the hardware backend
+- Providing information to the caller required by the computation (e.g., memory requirements)
+- Allocation of any required auxiliary memory if it isn't given by its caller explicitly
+
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/data_layout.dox b/docs/user_guide/data_layout.dox
new file mode 100644
index 0000000000..48f15acd63
--- /dev/null
+++ b/docs/user_guide/data_layout.dox
@@ -0,0 +1,41 @@
+///
+/// Copyright (c) 2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+
+namespace arm_compute
+{
+/**
+@page data_layout_support Data Layout Support
+
+@section data_layout_support_supported_data_layout Supported Data Layouts
+
+Compute Library supports the follwing data layouts and
+the right-most letter represents the fastest changing dimension:
+
+- NHWC: The native layout of Compute Library that delivers the best performance where channels are in the fastest changing dimension
+- NCHW: Legacy layout where width is in the fastest changing dimension
+
+, where N = batch, C = channel, H = height, W = width.
+
+*/
+} // namespace
diff --git a/docs/user_guide/data_type.dox b/docs/user_guide/data_type.dox
new file mode 100644
index 0000000000..7083270a07
--- /dev/null
+++ b/docs/user_guide/data_type.dox
@@ -0,0 +1,47 @@
+///
+/// Copyright (c) 2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page data_type_support Data Type Support
+
+@tableofcontents
+
+@section data_type_support_supported_data_type Supported Data Types
+
+Compute Library supports the following list of data types. More detailed information
+can be found from the documentation of each operator since the data types supported
+by each operator vary.
+
+- BFLOAT16: 16-bit non-standard brain floating point
+- QASYMM8: 8-bit unsigned asymmetric quantized
+- QASYMM8_SIGNED: 8-bit signed asymmetric quantized
+- QSYMM8_PER_CHANNEL: 8-bit signed symmetric quantized (Used for the weights)
+- QSYMM8: 8-bit unsigned symmetric quantized
+- QSYMM16: 16-bit unsigned symmetric quantized
+- F32: 32-bit single precision floating point
+- F16: 16-bit half precision floating point
+- S32: 32-bit signed integer
+*/
+} // namespace
diff --git a/docs/user_guide/errata.dox b/docs/user_guide/errata.dox
new file mode 100644
index 0000000000..0c8d684017
--- /dev/null
+++ b/docs/user_guide/errata.dox
@@ -0,0 +1,76 @@
+///
+/// Copyright (c) 2019-2020 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page errata Errata
+
+@tableofcontents
+
+@section S7_1_errata Errata
+
+- Under certain conditions, CLFullyConnectedLayer quantized tests may fail due to an issue in the test framework.
+    - Versions Affected: 21.02
+    - OSs Affected: Linux
+    - Conditions:
+        - armv7a architecture
+        - release mode
+        - asserts enabled
+
+- A wrong test configuration has been found in CLGEMMMatrixMultiplyReshapedOnlyRHS set of tests.
+    - Versions Affected: >= 20.11
+    - Conditions:
+        - Data type input: F32/F16
+        - Fused bounded relu activation with coefficient 'a' being negative
+
+- Under certain conditions, the validation test case 'CL/DirectConvolutionLayer/Float/FP32/RunSmall9x9\@InputShape=32x37x3x4:StrideX=1:StrideY=1:PadX=0:PadY=0:KernelSize=9:NumKernels=1:DataType=F32:ActivationInfo=LU_BOUNDED_RELU:DataLayout=NHWC' may fail.
+    - Versions Affected: >= v20.08
+    - Conditions:
+        - The validation suite has to run in nightly mode and execute 40k+ test cases before the test mentioned above
+
+- Under certain conditions, benchmark examples can hang when OpenCL profiling queues are enabled.
+    - Versions Affected: >= v19.11
+    - OSs Affected: Linux
+    - Conditions:
+        - Arm® Mali™ DDK r1p0 - r8p0, and
+        - Linux kernel >= 4.4
+
+- On Android with arm64-v8a/arm64-v8.2-a architecture, Arm® Neon™ validation tests can fail when compiled using Android Ndk
+  >= r18b in debug mode (https://github.com/android/ndk/issues/1135).
+    - Versions Affected: >= v19.11
+    - OSs Affected: Android
+    - Conditions:
+        - arm64-v8a/arm64-v8.2-a architecture, and
+        - Compiled using Android NDK >= r18b in debug mode.
+
+- An issue has been identified with CLCast.
+    - Versions Affected: >= 18.11
+    - Conditions:
+        - Data type input: F32
+        - Data type output: All integer types
+        - Conversion policy: SATURATE
+    - Result: OpenCL backend will always wrap around instead of saturating for out-of-range inputs
+
+*/
+} // namespace
diff --git a/docs/user_guide/how_to_build_and_run_examples.dox b/docs/user_guide/how_to_build_and_run_examples.dox
new file mode 100644
index 0000000000..e57183e891
--- /dev/null
+++ b/docs/user_guide/how_to_build_and_run_examples.dox
@@ -0,0 +1,541 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page how_to_build How to Build and Run Examples
+
+@tableofcontents
+
+@section S1_1_build_options Build options
+
+scons 2.3 or above is required to build the library.
+To see the build options available simply run ```scons -h```:
+
+        debug: Debug (yes|no)
+            default: False
+
+        asserts: Enable asserts (this flag is forced to 1 for debug=1) (yes|no)
+            default: False
+
+        logging: Logging (this flag is forced to 1 for debug=1) (yes|no)
+            default: False
+
+        arch: Target Architecture (armv7a|arm64-v8a|arm64-v8.2-a|arm64-v8.2-a-sve|arm64-v8.2-a-sve2|x86_32|x86_64|armv8a|armv8.2-a|armv8.2-a-sve|armv8.6-a|armv8.6-a-sve|armv8.6-a-sve2|armv8r64|x86)
+            default: armv7a
+
+        estate: Execution State (auto|32|64)
+            default: auto
+
+        os: Target OS (linux|android|macos|tizen|bare_metal)
+            default: linux
+
+        build: Build type (native|cross_compile|embed_only)
+            default: cross_compile
+
+        examples: Build example programs (yes|no)
+            default: True
+
+        gemm_tuner: Build gemm_tuner programs (yes|no)
+            default: True
+
+        Werror: Enable/disable the -Werror compilation flag (yes|no)
+            default: True
+
+        standalone: Builds the tests as standalone executables, links statically with libgcc, libstdc++ and libarm_compute (yes|no)
+            default: False
+
+        opencl: Enable OpenCL support (yes|no)
+            default: True
+
+        neon: Enable Arm® Neon™ support (yes|no)
+            default: False
+
+        embed_kernels: Embed OpenCL kernels in library binary (yes|no)
+            default: True
+
+        compress_kernels: Compress embedded OpenCL kernels in library binary. Note embed_kernels should be enabled as well (yes|no)
+            default: False
+
+        set_soname: Set the library's soname and shlibversion (requires SCons 2.4 or above) (yes|no)
+            default: False
+
+        openmp: Enable OpenMP backend (yes|no)
+            default: False
+
+        cppthreads: Enable C++11 threads backend (yes|no)
+            default: True
+
+        build_dir: Specify sub-folder for the build ( /path/to/build_dir )
+            default: .
+
+        install_dir: Specify sub-folder for the install ( /path/to/install_dir )
+            default:
+
+        exceptions: Enable/disable C++ exception support (yes|no)
+            default: True
+
+        linker_script: Use an external linker script ( /path/to/linker_script )
+            default:
+
+        custom_options: Custom options that can be used to turn on/off features
+            (all|none|comma-separated list of names)
+            allowed names: disable_mmla_fp
+            default: none
+
+        data_type_support: Enable a list of data types to support
+            (all|none|comma-separated list of names)
+            allowed names: qasymm8 qasymm8_signed qsymm16 fp16 fp32
+            default: all
+
+        toolchain_prefix: Override the toolchain prefix
+            default:
+
+        compiler_prefix: Override the compiler prefix
+            default:
+
+        extra_cxx_flags: Extra CXX flags to be appended to the build command
+            default:
+
+        extra_link_flags: Extra LD flags to be appended to the build command
+            default:
+
+        compiler_cache: Command to prefix to the C and C++ compiler (e.g ccache)
+            default:
+
+        specs_file: Specs file to use
+            default: rdimon.specs
+
+        benchmark_examples: Build benchmark examples programs (yes|no)
+            default: False
+
+        validate_examples: Build validate examples programs (yes|no)
+            default: False
+
+        reference_openmp: Build reference validation with openmp (yes|no)
+            default: True
+
+        validation_tests: Build validation test programs (yes|no)
+            default: False
+
+        benchmark_tests: Build benchmark test programs (yes|no)
+            default: False
+
+        test_filter: Pattern to specify the tests' filenames to be compiled
+            default: *.cpp
+
+        pmu: Enable PMU counters (yes|no)
+            default: False
+
+        mali: Enable Arm® Mali™ hardware counters (yes|no)
+            default: False
+
+        external_tests_dir: Add examples, benchmarks and tests to the tests suite from an external path ( /path/to/external_tests_dir )
+            default:
+
+@b debug / @b asserts:
+ - With debug=1 asserts are enabled, and the library is built with symbols and no optimisations enabled.
+ - With debug=0 and asserts=1: Optimisations are enabled and symbols are removed, however all the asserts are still present (This is about 20% slower than the release build)
+ - With debug=0 and asserts=0: All optimisations are enable and no validation is performed, if the application misuses the library it is likely to result in a crash. (Only use this mode once you are sure your application is working as expected).
+
+@b arch: The x86_32 and x86_64 targets can only be used with neon=0 and opencl=1.
+
+@b os: Choose the operating system you are targeting: Linux, Android or bare metal.
+@note bare metal can only be used for Arm® Neon™ (not OpenCL), only static libraries get built and Neon's multi-threading support is disabled.
+
+@b build: you can either build directly on your device (native) or cross compile from your desktop machine (cross-compile). In both cases make sure the compiler is available in your path.
+
+@note If you want to natively compile for 32bit on a 64bit Arm device running a 64bit OS then you will have to use cross-compile too.
+
+There is also an 'embed_only' option which will generate all the .embed files for the OpenCL kernels. This might be useful if using a different build system to compile the library.
+
+In addittion the option 'compress_kernels' will compress the embedded OpenCL kernel files using zlib and inject them in the library. This is useful for reducing the binary size. Note, this option is only available for Android when 'embed_kernels' is enabled.
+
+@b Werror: If you are compiling using the same toolchains as the ones used in this guide then there shouldn't be any warning and therefore you should be able to keep Werror=1. If with a different compiler version the library fails to build because of warnings interpreted as errors then, if you are sure the warnings are not important, you might want to try to build with Werror=0 (But please do report the issue on Github).
+
+@b opencl / @b neon: Choose which SIMD technology you want to target. (Neon for Arm Cortex-A CPUs or OpenCL for Arm® Mali™ GPUs)
+
+@b embed_kernels: For OpenCL only: set embed_kernels=1 if you want the OpenCL kernels to be built in the library's binaries instead of being read from separate ".cl" / ".cs" files. If embed_kernels is set to 0 then the application can set the path to the folder containing the OpenCL kernel files by calling CLKernelLibrary::init(). By default the path is set to "./cl_kernels".
+
+@b set_soname: Do you want to build the versioned version of the library ?
+
+If enabled the library will contain a SONAME and SHLIBVERSION and some symlinks will automatically be created between the objects.
+Example:
+  libarm_compute_core.so -> libarm_compute_core.so.1.0.0
+  libarm_compute_core.so.1 -> libarm_compute_core.so.1.0.0
+  libarm_compute_core.so.1.0.0
+
+@note This options is disabled by default as it requires SCons version 2.4 or above.
+
+@b extra_cxx_flags: Custom CXX flags which will be appended to the end of the build command.
+
+@b build_dir: Build the library in a subfolder of the "build" folder. (Allows to build several configurations in parallel).
+
+@b examples: Build or not the examples
+
+@b validation_tests: Enable the build of the validation suite.
+
+@b benchmark_tests: Enable the build of the benchmark tests
+
+@b pmu: Enable the PMU cycle counter to measure execution time in benchmark tests. (Your device needs to support it)
+
+@b mali: Enable the collection of Arm® Mali™ hardware counters to measure execution time in benchmark tests. (Your device needs to have a Arm® Mali™ driver that supports it)
+
+@b openmp Build in the OpenMP scheduler for Neon.
+
+@note Only works when building with g++ not clang++
+
+@b cppthreads Build in the C++11 scheduler for Neon.
+
+@sa Scheduler::set
+
+@b external_tests_dir Add examples, benchmarks and tests to the tests suite from an external path ( /path/to/external_tests_dir )
+
+In order to use this option, the external tests directory must have the following structure:
+
+    EXTERNAL_TESTS_DIR:
+    └── tests
+        ├── benchmark
+        │   ├── CL
+        │   ├── datasets
+        │   ├── fixtures
+        │   └── Neon
+        └── validation
+            ├── CL
+            ├── datasets
+            ├── fixtures
+            └── Neon
+
+Then, build the library with `external_tests_dir=<PATH_TO_EXTERNAL_TESTS_DIR>`.
+
+@section S1_2_linux Building for Linux
+
+@subsection S1_2_1_library How to build the library ?
+
+For Linux, the library was successfully built and tested using the following Linaro GCC toolchain:
+
+ - gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
+ - gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
+
+To cross-compile the library in debug mode, with Arm® Neon™ only support, for Linux 32bit:
+
+	scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
+
+To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit:
+
+	scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a
+
+You can also compile the library natively on an Arm device by using <b>build=native</b>:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=arm64-v8a build=native
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native
+
+@note g++ for Arm is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler.
+
+For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b>
+
+	apt-get install g++-arm-linux-gnueabihf
+
+Then run
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile
+
+or simply remove the build parameter as build=cross_compile is the default value:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a
+
+@subsection S1_2_2_examples How to manually build the examples ?
+
+The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
+
+@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
+
+To cross compile a Arm® Neon™ example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -larm_compute_core -o neon_convolution
+
+To cross compile a Arm® Neon™ example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -larm_compute_core -o neon_convolution
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+To cross compile an OpenCL example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
+
+To cross compile an OpenCL example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
+
+i.e. to cross compile the "graph_lenet" example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+i.e. to cross compile the "graph_lenet" example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
+
+To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 32bit:
+
+	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -larm_compute -larm_compute_core -o neon_convolution
+
+To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 64bit:
+
+	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -larm_compute_core -o neon_convolution
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
+
+To compile natively (i.e directly on an Arm device) for OpenCL for Linux 32bit or Linux 64bit:
+
+	g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
+
+To compile natively the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
+
+i.e. to natively compile the "graph_lenet" example for Linux 32bit:
+
+	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+i.e. to natively compile the "graph_lenet" example for Linux 64bit:
+
+	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
+
+@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
+
+@note These two commands assume libarm_compute.so is available in your library path, if not add the path to it using -L (e.g. -Llib/linux-arm64-v8a-neon-cl-asserts/)
+@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
+
+To run the built executable simply run:
+
+	LD_LIBRARY_PATH=build ./neon_convolution
+
+or
+
+	LD_LIBRARY_PATH=build ./cl_convolution
+
+@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
+
+For example:
+
+	LD_LIBRARY_PATH=. ./graph_lenet --help
+
+Below is a list of the common parameters among the graph examples :
+@snippet utils/CommonGraphOptions.h Common graph examples parameters
+
+@subsection S1_2_3_sve Build for SVE or SVE2
+
+In order to build for SVE or SVE2 you need a compiler that supports them. You can find more information in the following these links:
+    -# GCC: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/sve-support
+    -# LLVM: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/llvm-toolchain/sve-support
+
+@note You the need to indicate the toolchains using the scons "toolchain_prefix" parameter.
+
+An example build command with SVE is:
+
+        scons arch=arm64-v8.2-a-sve os=linux build_dir=arm64 -j55 standalone=0 opencl=0 openmp=0 validation_tests=1 neon=1 cppthreads=1 toolchain_prefix=aarch64-none-linux-gnu-
+
+@section S1_3_android Building for Android
+
+For Android, the library was successfully built and tested using Google's standalone toolchains:
+ - clang++ from NDK r18b for armv7a
+ - clang++ from NDK r20b for arm64-v8a
+ - clang++ from NDK r20b for arm64-v8.2-a with FP16 support
+
+For NDK r18 or older, here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a>:
+- Download the NDK r18b from here: https://developer.android.com/ndk/downloads/index.html to directory $NDK
+- Make sure you have Python 2.7 installed on your machine.
+- Generate the 32 and/or 64 toolchains by running the following commands to your toolchain dirctory $MY_TOOLCHAINS:
+
+	$NDK/build/tools/make_standalone_toolchain.py --arch arm64 --install-dir $MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b --stl libc++ --api 21
+	$NDK/build/tools/make_standalone_toolchain.py --arch arm --install-dir $MY_TOOLCHAINS/arm-linux-android-ndk-r18b --stl libc++ --api 21
+
+For NDK r19 or newer, you can directly <a href="https://developer.android.com/ndk/downloads">Download</a> the NDK package for your development platform, without the need to launch the make_standalone_toolchain.py script. You can find all the prebuilt binaries inside $NDK/toolchains/llvm/prebuilt/$OS_ARCH/bin/.
+@attention the building script will look for a binary named "aarch64-linux-android-clang++", while the prebuilt binaries will have their API version as a suffix to their filename (e.g. "aarch64-linux-android21-clang++"). You should copy/rename the binary removing this suffix, or - alternatively - create an alias for it.
+
+@attention We used to use gnustl but as of NDK r17 it is deprecated so we switched to libc++
+
+@note Make sure to add the toolchains to your PATH:
+
+	export PATH=$PATH:$MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b/bin:$MY_TOOLCHAINS/arm-linux-android-ndk-r18b/bin
+
+@subsection S1_3_1_library How to build the library ?
+
+To cross-compile the library in debug mode, with Arm® Neon™ only support, for Android 32bit:
+
+	CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a
+
+To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit:
+
+	CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a
+
+@subsection S1_3_2_examples How to manually build the examples ?
+
+The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
+
+@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
+
+Once you've got your Android standalone toolchain built and added to your path you can do the following:
+
+To cross compile a Arm® Neon™ example:
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_arm -static-libstdc++ -pie
+	#64 bit:
+	aarch64-linux-android-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_aarch64 -static-libstdc++ -pie
+
+To cross compile an OpenCL example:
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
+	#64 bit:
+	aarch64-linux-android-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
+
+To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also.
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
+	#64 bit:
+	aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
+
+@note Due to some issues in older versions of the Arm® Mali™ OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android.
+@note When linked statically the arm_compute_graph library currently needs the --whole-archive linker flag in order to work properly
+
+Then you need to do is upload the executable and the shared library to the device using ADB:
+
+	adb push neon_convolution_arm /data/local/tmp/
+	adb push cl_convolution_arm /data/local/tmp/
+	adb push gc_absdiff_arm /data/local/tmp/
+	adb shell chmod 777 -R /data/local/tmp/
+
+And finally to run the example:
+
+	adb shell /data/local/tmp/neon_convolution_arm
+	adb shell /data/local/tmp/cl_convolution_arm
+	adb shell /data/local/tmp/gc_absdiff_arm
+
+For 64bit:
+
+	adb push neon_convolution_aarch64 /data/local/tmp/
+	adb push cl_convolution_aarch64 /data/local/tmp/
+	adb push gc_absdiff_aarch64 /data/local/tmp/
+	adb shell chmod 777 -R /data/local/tmp/
+
+And finally to run the example:
+
+	adb shell /data/local/tmp/neon_convolution_aarch64
+	adb shell /data/local/tmp/cl_convolution_aarch64
+	adb shell /data/local/tmp/gc_absdiff_aarch64
+
+@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
+
+For example:
+	adb shell /data/local/tmp/graph_lenet --help
+
+In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on Neon, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run.
+
+@section S1_4_macos Building for macOS
+
+The library was successfully natively built for Apple Silicon under macOS 11.1 using clang v12.0.0.
+
+To natively compile the library with accelerated CPU support:
+
+	scons Werror=1 -j8 neon=1 opencl=0 os=macos arch=arm64-v8a build=native
+
+@note Initial support disables feature discovery through HWCAPS and thread scheduling affinity controls
+
+@section S1_5_bare_metal Building for bare metal
+
+For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains:
+ - arm-eabi for armv7a
+ - aarch64-elf for arm64-v8a
+
+Download linaro for <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/arm-eabi/">armv7a</a> and <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/aarch64-elf/">arm64-v8a</a>.
+
+@note Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin
+
+@subsection S1_5_1_library How to build the library ?
+
+To cross-compile the library with Arm® Neon™ support for baremetal arm64-v8a:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=arm64-v8a build=cross_compile cppthreads=0 openmp=0 standalone=1
+
+@subsection S1_5_2_examples How to manually build the examples ?
+
+Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a/index.html">here</a>.
+
+@section S1_6_windows_host Building on a Windows host system
+
+Using `scons` directly from the Windows command line is known to cause
+problems. The reason seems to be that if `scons` is setup for cross-compilation
+it gets confused about Windows style paths (using backslashes). Thus it is
+recommended to follow one of the options outlined below.
+
+@subsection S1_6_1_ubuntu_on_windows Bash on Ubuntu on Windows
+
+The best and easiest option is to use
+<a href="https://msdn.microsoft.com/en-gb/commandline/wsl/about">Ubuntu on Windows</a>.
+This feature is still marked as *beta* and thus might not be available.
+However, if it is building the library is as simple as opening a *Bash on
+Ubuntu on Windows* shell and following the general guidelines given above.
+
+@subsection S1_6_2_cygwin Cygwin
+
+If the Windows subsystem for Linux is not available <a href="https://www.cygwin.com/">Cygwin</a>
+can be used to install and run `scons`, the minimum Cygwin version must be 3.0.7 or later. In addition
+to the default packages installed by Cygwin `scons` has to be selected in the installer. (`git` might
+also be useful but is not strictly required if you already have got the source
+code of the library.) Linaro provides pre-built versions of
+<a href="http://releases.linaro.org/components/toolchain/binaries/">GCC cross-compilers</a>
+that can be used from the Cygwin terminal. When building for Android the
+compiler is included in the Android standalone toolchain. After everything has
+been set up in the Cygwin terminal the general guide on building the library
+can be followed.
+
+@section S1_7_cl_requirements OpenCL DDK Requirements
+
+@subsection S1_7_1_cl_hard_requirements Hard Requirements
+
+Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Arm® Mali™ OpenCL DDK r8p0 and above as an extension (respective extension flag is \a -cl-arm-non-uniform-work-group-size).
+
+Enabling 16-bit floating point calculations require \a cl_khr_fp16 extension to be supported. All Arm® Mali™ GPUs with compute capabilities have native support for half precision floating points.
+
+@subsection S1_7_2_cl_performance_requirements Performance improvements
+
+Integer dot product built-in function extensions (and therefore optimized kernels) are available with Arm® Mali™ OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are \a cl_arm_integer_dot_product_int8, \a cl_arm_integer_dot_product_accumulate_int8 and \a cl_arm_integer_dot_product_accumulate_int16.
+
+OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported.
+
+SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.
+
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/introduction.dox b/docs/user_guide/introduction.dox
new file mode 100644
index 0000000000..a956d7dd52
--- /dev/null
+++ b/docs/user_guide/introduction.dox
@@ -0,0 +1,74 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page introduction Introduction
+
+@tableofcontents
+
+The Compute Library is a collection of low-level machine learning functions optimized for both Arm CPUs and GPUs using SIMD technologies.
+
+Several builds of the library are available using various configurations:
+ - OS: Linux, Android, macOS or bare metal.
+ - Architecture: armv7a (32bit) or arm64-v8a (64bit).
+ - Technology: Arm® Neon™ / OpenCL / Arm® Neon™ and OpenCL.
+ - Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance.
+
+@section S0_1_contact Contact / Support
+
+Please create an issue on <a href="https://github.com/ARM-software/ComputeLibrary/issues">Github</a>.
+
+In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run:
+
+    $ strings android-armv7a-cl-asserts/libarm_compute.so | grep arm_compute_version
+    arm_compute_version=v16.12 Build options: {'embed_kernels': '1', 'opencl': '1', 'arch': 'armv7a', 'neon': '0', 'asserts': '1', 'debug': '0', 'os': 'android', 'Werror': '1'} Git hash=f51a545d4ea12a9059fe4e598a092f1fd06dc858
+
+@section S0_2_prebuilt_binaries Pre-built binaries
+
+For each release we provide some pre-built binaries of the library [here](https://github.com/ARM-software/ComputeLibrary/releases)
+
+These binaries have been built using the following toolchains:
+            - Linux armv7a: gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf
+            - Linux arm64-v8a: gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu
+            - Android armv7a: clang++ / libc++ NDK r18b
+            - Android am64-v8a: clang++ / libc++ NDK r20b
+
+@warning Make sure to use a compatible toolchain to build your application or you will get some std::bad_alloc errors at runtime.
+
+@section S0_3_file_organisation File organisation
+
+This archive contains:
+ - The arm_compute header and source files
+ - The latest Khronos OpenCL 1.2 C headers from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a>
+ - The latest Khronos cl2.hpp from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> (API version 2.1 when this document was written)
+ - The latest Khronos EGL 1.5 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos EGL registry</a>
+ - The sources for a stub version of libOpenCL.so, libGLESv1_CM.so, libGLESv2.so and libEGL.so to help you build your application.
+ - An examples folder containing a few examples to compile and link against the library.
+ - A @ref utils folder containing headers with some boiler plate code used by the examples.
+ - This documentation.
+
+ For detailed information about file organization, please refer to Files -> File List section of this documentation.
+
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/library.dox b/docs/user_guide/library.dox
new file mode 100644
index 0000000000..2e3cc967ea
--- /dev/null
+++ b/docs/user_guide/library.dox
@@ -0,0 +1,402 @@
+///
+/// Copyright (c) 2017-2020 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page architecture Library Architecture
+
+@tableofcontents
+
+@section S4_1_1 Core vs Runtime libraries
+
+The Core library is a low level collection of algorithms implementations, it is designed to be embedded in existing projects and applications:
+
+- It doesn't allocate any memory (All the memory allocations/mappings have to be handled by the caller).
+- It doesn't perform any kind of multi-threading (but provide information to the caller about how the workload can be split).
+
+The Runtime library is a very basic wrapper around the Core library which can be used for quick prototyping, it is basic in the sense that:
+
+- It allocates images and tensors by using standard malloc().
+- It multi-threads Arm® Neon™ code in a very basic way using a very simple pool of threads.
+- For OpenCL it uses the default CLScheduler command queue for all mapping operations and kernels.
+
+For maximum performance, it is expected that the users would re-implement an equivalent to the runtime library which suits better their needs (With a more clever multi-threading strategy, load-balancing between Arm® Neon™ and OpenCL, etc.)
+
+@section S4_1_3 Fast-math support
+
+Compute Library supports different types of convolution methods, fast-math flag is only used for the Winograd algorithm.
+When the fast-math flag is enabled, both Arm® Neon™ and CL convolution layers will try to dispatch the fastest implementation available, which may introduce a drop in accuracy as well. The different scenarios involving the fast-math flag are presented below:
+- For FP32:
+    - no-fast-math: Only supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7
+    - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
+- For fp16:
+    - no-fast-math: No Winograd support
+    - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
+
+@section S4_1_4 Thread-safety
+
+Although the library supports multi-threading during workload dispatch, thus parallelizing the execution of the workload at multiple threads, the current runtime module implementation is not thread-safe in the sense of executing different functions from separate threads.
+This lies to the fact that the provided scheduling mechanism wasn't designed with thread-safety in mind.
+As it is true with the rest of the runtime library a custom scheduling mechanism can be re-implemented to account for thread-safety if needed and be injected as the library's default scheduler.
+
+@section S4_5_algorithms Algorithms
+
+All computer vision algorithms in this library have been implemented following the [OpenVX 1.1 specifications](https://www.khronos.org/registry/vx/specs/1.1/html/). Please refer to the Khronos documentation for more information.
+
+@section S4_6_images_tensors Images, padding, border modes and tensors
+
+Most kernels and functions in the library process images, however, in order to be future proof most of the kernels actually accept tensors. See below for more information about how they are related.
+
+@attention Each memory object can be written by only one kernel, however it can be read by several kernels. Writing to the same object from several kernels will result in undefined behavior. The kernel writing to an object must be configured before the kernel(s) reading from it.
+
+@subsection S4_6_1_padding_and_border Padding and border modes
+
+Several algorithms require a neighborhood around the current pixel to compute it's value. This means the algorithm will not be able to process the borders of the image unless you give it more information about how those border pixels should be processed. The @ref BorderMode enum is used for this purpose.
+
+You have 3 types of @ref BorderMode :
+
+- @ref BorderMode::UNDEFINED : Neighbor pixels outside of the image are treated as undefined. As a result all the pixels which are on the border will have a value which is undefined.
+- @ref BorderMode::REPLICATE : Neighbor pixels outside of the image are treated as having the same value as the closest valid pixel.
+- @ref BorderMode::CONSTANT : Neighbor pixels outside of the image are treated as having the same constant value. (The user can choose what this value should be).
+
+Moreover both OpenCL and Arm® Neon™ use vector loads and stores instructions to access the data in buffers, so in order to avoid having special cases to handle for the borders all the images and tensors used in this library must be padded.
+
+@subsubsection padding Padding
+
+There are different ways padding can be calculated:
+
+- Accurate padding:
+
+@note It's important to call allocate @b after the function is configured: if the image / tensor is already allocated then the function will shrink its execution window instead of increasing the padding. (See below for more details).
+
+- Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions). In that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (which translates into a smaller valid region for the output). See also @ref valid_region).
+If you don't want to manually set the padding but still want to allocate your objects upfront then you can use auto_padding. It guarantees that the allocation will have enough padding to run any of the provided functions.
+
+@code{.cpp}
+Image     src, dst;
+
+// Use auto padding for the input:
+src.info()->init_auto_padding(TensorShape(640u,480u), Format::U8);
+
+// Use manual padding for the destination image
+dst.info()->init(src.info()->tensor_shape(), Format::U8, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);
+
+// Allocate all the images
+src.allocator()->allocate();
+dst.allocator()->allocate();
+// Fill the input image with the content of the PPM image if a filename was provided:
+fill_image(src);
+
+NEGaussian3x3 gauss;
+
+// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)
+gauss.configure(&src, &dst, BorderMode::UNDEFINED);
+
+//Execute the functions:
+gauss.run();
+@endcode
+
+@warning Some kernels need up to 3 neighbor values to calculate the value of a given pixel. Therefore, to be safe, we use a 4-pixel padding all around the image. In addition, some kernels read and write up to 32 pixels at the same time. To cover that case as well we add an extra 32 pixels of padding at the end of each row. As a result auto padded buffers waste a lot of memory and are less cache friendly. It is therefore recommended to use accurate padding or manual padding wherever possible.
+
+@subsubsection valid_region Valid regions
+
+Some kernels (like edge detectors for example) need to read values of neighboring pixels to calculate the value of a given pixel, it is therefore not possible to calculate the values of the pixels on the edges.
+
+Another case is: if a kernel processes 8 pixels per iteration and the image's dimensions are not a multiple of 8 and not enough padding is available then the kernel will not be able to process the pixels near the right edge. As a result these pixels will be left undefined.
+
+In order to know which pixels have been calculated, each kernel sets a valid region for each output image or tensor. See also @ref TensorInfo::valid_region(), @ref ValidRegion
+
+@subsection S4_6_2_tensors Tensors
+
+Tensors are multi-dimensional arrays with a maximum of @ref Coordinates::num_max_dimensions dimensions.
+
+Depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as an one-dimensional tensor. Further, an image is actually just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
+
+@note Most algorithms process images (i.e a 2D slice of the tensor), therefore only padding along the X and Y axes is required (2D slices can be stored contiguously in memory).
+
+@subsection S4_6_3_description_conventions Images and Tensors description conventions
+
+Image objects are defined by a @ref Format and dimensions expressed as [width, height, batch]
+
+Tensors are defined by a @ref DataType plus a number of channels (Always expected to be 1 for now) and their dimensions are expressed as [width, height, feature_maps, batch].
+
+In other words, the lower three dimensions of a tensor specify a single input in [width, height, feature_maps], while any other specified dimension represents a batch in the appropriate dimension space.
+For example, a tensor with dimensions [128, 128, 64, 16] represents a 1D batch space with 16 batches of 128 elements in width and height and 64 feature maps each.
+Each kernel specifies the expected layout of each of its tensors in its documentation.
+
+@note Unless specified otherwise in the kernel's or function's documentation all tensors and images parameters passed must have identical dimensions.
+
+@note Unless specified otherwise in the kernel's or function's documentation the number of channels for tensors is expected to be 1 (For images, the number of channels is inferred from the @ref Format).
+
+@attention Regardless of the @ref DataType used by a tensor the @ref ITensor::buffer() method will always return a uint8_t pointer, and all the metadata in @ref TensorInfo will be expressed in bytes. It is the user's responsibility to cast the pointer to the correct type.
+
+For example, to read the element located at the coordinates (x,y) of a float tensor:
+
+@code{.cpp}
+float value = *reinterpret_cast<float*>(input.buffer() + input.info()->offset_element_in_bytes(Coordinates(x,y)));
+@endcode
+
+@subsection S4_6_4_working_with_objects Working with Images and Tensors using iterators
+
+The library provides some iterators to access objects' data.
+Iterators are created by associating a data object (An image or a tensor for example) with an iteration window.
+
+Iteration windows are defined by an array of dimensions, each of which consists of a start, end and step.
+
+The @ref execute_window_loop function takes an execution window, a lambda function and one or more iterators.
+It will iterate through every element of the execution window and for each element it will update the iterators accordingly and call the lambda function.
+
+Here are a couple of examples of how to use the iterators to fill / read tensors:
+
+@snippet examples/neon_copy_objects.cpp Copy objects example
+
+@subsection S4_6_5_sub_tensors Sub-tensors
+
+Sub-tensors are aliases to existing Tensors, as a result creating a sub-tensor does not result in any underlying memory allocation.
+
+Sub-tensors can be used to access a sub-set of the parent tensor, something that can be useful in case different operations need to be performed on different parts of a tensor.
+
+Moreover, sub-tensors can be used to perform zero copy tensor concatenation.
+
+The API for creating a sub-tensor is the following:
+@code{.cpp}
+SubTensor(ITensor *parent, const TensorShape &tensor_shape, const Coordinates &coords)
+@endcode
+
+Where \a parent is the parent tensor which we want to create an alias for, \a tensor_shape is the shape of the sub-tensor and \a coords are the starting indexing coordinates of the sub-tensor within the parent tensor.
+
+@note Two sub-tensor concrete classes for different targets are currently supported : @ref CLSubTensor and @ref SubTensor
+
+@warning Limitation of the sub-tensor is that it cannot be extracted spatially, meaning sub-tensors should have the same width and height as the parent tensor. The main reasons for this is the fact that individual kernels might need to operate with a step size that is not a multiple of the sub-tensor spatial dimension. This could lead to elements being overwritten by different kernels operating on different sub-tensors of the same underlying tensor.
+
+@section S4_7_memory_manager MemoryManager
+
+@ref IMemoryManager is a memory managing interface that can be used to reduce the memory requirements of a given pipeline by recycling temporary buffers.
+
+@subsection S4_7_1_memory_manager_components MemoryGroup, MemoryPool and MemoryManager Components
+
+@subsubsection S4_7_1_1_memory_group MemoryGroup
+
+@ref IMemoryGroup defines the memory managing granularity.
+
+MemoryGroup binds a number of objects to a bucket of memory requirements that need to be fulfilled in order for an operation or list of operations to be executed.
+
+Requesting backing memory for a specific group can be done using @ref IMemoryGroup::acquire and releasing the memory back using @ref IMemoryGroup::release.
+
+@subsubsection S4_7_1_2_memory_pool MemoryPool
+
+@ref IMemoryPool defines a pool of memory that can be used to provide backing memory to a memory group.
+
+@note @ref BlobMemoryPool is currently implemented which models the memory requirements as a vector of distinct memory blobs.
+
+@subsubsection S4_7_1_2_memory_manager_components MemoryManager Components
+
+@ref IMemoryManager consists of two components:
+- @ref ILifetimeManager that keeps track of the lifetime of the registered objects of the memory groups and given an @ref IAllocator creates an appropriate memory pool that fulfils the memory requirements of all the registered memory groups.
+- @ref IPoolManager that safely manages the registered memory pools.
+
+@note @ref BlobLifetimeManager is currently implemented which models the memory requirements as a vector of distinct memory blobs.
+
+@subsection S4_7_2_working_with_memory_manager Working with the Memory Manager
+Using a memory manager to reduce the memory requirements of a pipeline can be summed in the following steps:
+
+Initially a memory manager must be set-up:
+@code{.cpp}
+Allocator  allocator{};                                                               // Create an allocator to use for the backing memory allocation
+auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
+auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
+auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
+@endcode
+
+Once done, memory groups can be registered to use the memory manager:
+@code{.cpp}
+MemoryGroup memory_group(mm); // Create a memory group and set the memory manager to use
+@endcode
+
+@note If a memory manager is not specified then all allocation will be immediate instead of deferred through the memory manager.
+
+Next step is to set objects to be managed by the memory group. It is important though to note that the lifetime of an object is tracked from the @ref MemoryGroup::manage() and the @ref TensorAllocator::allocate calls.
+@ref MemoryGroup::manage flags that the object will be needed starting now and when @ref TensorAllocator::allocate is called it signals the end of the object lifetime.
+@code{.cpp}
+Tensor tmp1, tmp2, tmp3;            // Create example tensors
+memory_group.manage(&tmp1);         // Start managing object tmp1 and start its lifetime
+memory_group.manage(&tmp2);         // Start managing object tmp2 and start its lifetime
+
+operation1.configure(&tmp1, &tmp2); // Configure a function/kernel using tmp1 and tmp2
+
+tmp1.allocator()->allocate();       // Flag that the lifetime of object tmp1 has ended
+
+memory_group.manage(&tmp3);         // Start managing object tmp3 and start its lifetime
+
+operation2.configure(&tmp2, &tmp3); // Configure a function/kernel using tmp2 and tmp3
+
+tmp2.allocator()->allocate();       // Flag that the lifetime of object tmp2 has ended
+tmp3.allocator()->allocate();       // Flag that the lifetime of object tmp3 has ended
+@endcode
+
+@warning The configuration step should be done sequentially by a single thread so that all the lifetimes are captured correclty.
+
+When configuration of all the operations is finished then the memory manager have to be populated:
+@code{.cpp}
+mm->populate(&allocator), 2 /* num_pools */); // Populate memory manager pools
+@endcode
+
+Finally, during execution of the pipeline the memory of the appropriate memory group should be requested before running:
+@code{.cpp}
+memory_group.acquire(); // Request memory for the group
+
+operation1.run();       // Run operation1
+operation2.run();       // Run operation2
+
+memory_group.release(); // Release memory so that it can be reused
+@endcode
+@note Execution of a pipeline can be done in a multi-threading environment as memory acquisition/release are thread safe.
+@note If you are handling sensitive data and it's required to zero out the memory buffers before freeing, make sure to also zero out the intermediate buffers. You can access the buffers through the memory group's mappings.
+
+@subsection S4_7_3_memory_manager_function_support Function support
+
+Most of the library's function have been ported to use @ref IMemoryManager for their internal temporary buffers.
+
+If that is the case, a memory manager can be passed to them during construction to reuse memory among these functions.
+@code{.cpp}
+// Setup Memory Manager
+CLBufferAllocator  allocator{};                                                       // Create an allocator to use for the backing memory allocation
+auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
+auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
+auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
+
+// Create two convolution layers and use the memory manager to manager their internal temporary buffers
+CLConvolutionLayer conv1(mm), conv2(mm);
+
+// Configure layers
+conv1.configure(...);
+conv2.configure(...);
+
+// Populate memory manager
+mm->populate(&allocator), 1 /* num_pools */); // Populate memory manager pools
+
+// Run layers (Memory will be recycled for internal buffers for conv1 and conv2
+conv1.run();
+conv2.run();
+@endcode
+
+@section S4_8_import_memory Import Memory Interface
+
+The implemented @ref TensorAllocator and @ref CLTensorAllocator objects provide an interface capable of importing existing memory to a tensor as backing memory.
+
+A simple Arm® Neon™ example can be the following:
+@code{.cpp}
+// External backing memory
+void* external_ptr = ...;
+
+// Create and initialize tensor
+Tensor tensor;
+tensor.allocator()->init(tensor_info);
+
+// Import existing pointer as backing memory
+tensor.allocator()->import_memory(external_ptr);
+@endcode
+
+It is important to note the following:
+- Ownership of the backing memory is not transferred to the tensor itself.
+- The tensor mustn't be memory managed.
+- Padding requirements should be accounted by the client code. In other words, if padding is required by the tensor after the function configuration step, then the imported backing memory should account for it. Padding can be checked through the @ref TensorInfo::padding() interface.
+
+@section S4_9_opencl_tuner OpenCL Tuner
+
+OpenCL kernels when dispatched to the GPU take two arguments:
+- The Global Workgroup Size (GWS): That's the number of times to run an OpenCL kernel to process all the elements we want to process.
+- The Local Workgroup Size (LWS): That's the number of elements we want to run in parallel on a GPU core at a given point in time.
+
+The LWS can be required by an algorithm (For example if it contains memory barriers or uses local memory) but it can also be used for performance reasons to tweak the performance of a kernel: the execution time of the overall kernel might vary significantly depending on how the GWS is broken down.
+
+However, there is no universal rule regarding which LWS is best for a given kernel, so instead we created the @ref CLTuner.
+
+When the @ref CLTuner is enabled ( Target = 2 for the graph examples), the first time an OpenCL kernel is executed the Compute Library will try to run it with a variety of LWS values and will remember which one performed best for subsequent runs. At the end of the run the @ref graph::Graph will try to save these tuning parameters to a file.
+
+However this process takes quite a lot of time, which is why it cannot be enabled all the time. @ref CLTuner supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
+
+But, when the @ref CLTuner is disabled ( Target = 1 for the graph examples), the @ref graph::Graph will try to reload the file containing the tuning parameters, then for each executed kernel the Compute Library will use the fine tuned LWS if it was present in the file or use a default LWS value if it's not.
+
+@section S4_10_cl_queue_prioritites OpenCL Queue Priorities
+
+OpenCL 2.1 exposes the `cl_khr_priority_hints` extensions that if supported by an underlying implementation allows the user to specify priority hints to the created command queues.
+Is important to note that this does not specify guarantees or the explicit scheduling behavior, this is something that each implementation needs to expose.
+
+In some cases, priority queues can be used when there is an implicit internal priority between graphics and compute queues and thus allow some level of priority control between them.
+At the moment three priority level can be specified:
+- CL_QUEUE_PRIORITY_HIGH_KHR
+- CL_QUEUE_PRIORITY_MED_KHR
+- CL_QUEUE_PRIORITY_LOW_KHR
+
+Compute Library allows extraction of the internal OpenCL queue or the ability to inject directly a user-defined queue to the @ref CLScheduler.
+This way the user can utilize this extension to define priorities between the queues and setup the OpenCL scheduler mechanism to utilize them.
+
+@code{.cpp}
+cl_queue_properties queue_properties[] = {CL_QUEUE_PRIORITY_KHR, CL_QUEUE_PRIORITY_HIGH_KHR, 0};
+cl_command_queue priority_queue = clCreateCommandQueueWithProperties(ctx, dev, queue_properties, &error);
+CLScheduler::get().set_queue(::cl::CommandQueue(priority_queue));
+@endcode
+
+@section S4_11_weights_manager Weights Manager
+
+@ref IWeightsManager is a weights managing interface that can be used to reduce the memory requirements of a given pipeline by reusing transformed weights across multiple function executions.
+@ref IWeightsManager is responsible for managing weight tensors alongside with their transformations.
+@ref ITransformWeights provides an interface for running the desired transform function. This interface is used by the weights manager.
+
+@subsection S4_10_1_working_with_weights_manager Working with the Weights Manager
+Following is a simple example that uses the weights manager:
+
+Initially a weights manager must be set-up:
+@code{.cpp}
+auto  wm = std::make_shared<IWeightsManager>(); // Create a weights manager
+@endcode
+
+Once done, weights can be managed, configured and run:
+@code{.cpp}
+wm->manage(weights); // Manage the weights
+wm->acquire(weights, &_reshape_weights_managed_function); // Acquire the address of the transformed weights based on the transform function
+wm->run(weights, &_reshape_weights_managed_function);     // Run the transpose function
+@endcode
+
+@section S5_0_experimental Experimental Features
+
+@subsection S5_1_run_time_context Run-time Context
+
+Some of the Compute Library components are modelled as singletons thus posing limitations to supporting some use-cases and ensuring a more client-controlled API.
+Thus, we are introducing an aggregate service interface @ref IRuntimeContext which will encapsulate the services that the singletons were providing and allow better control of these by the client code.
+Run-time context encapsulates a list of mechanisms, some of them are: scheduling, memory management, kernel caching and others.
+Consequently, this will allow finer control of these services among pipelines when Compute Library is integrated in higher level frameworks.
+
+This feature introduces some changes to our API.
+All the kernels/functions will now accept a Runtime Context object which will allow the function to use the mentioned services.
+
+Finally, we will try to adapt our code-base progressively to use the new mechanism but will continue supporting the legacy mechanism to allow a smooth transition. Changes will apply to all our three backends: Neon, OpenCL and OpenGL ES.
+
+@subsection S5_2_clvk CLVK
+
+Compute Library offers experimental support for [CLVK](https://github.com/kpet/clvk). If CLVK is installed in the system, users can select the backend when running a graph example with --target=clvk.
+If no target is specified and more that one OpenCL implementations are present, Compute Library will pick the first available.
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/programming_model.dox b/docs/user_guide/programming_model.dox
new file mode 100644
index 0000000000..7990231ba9
--- /dev/null
+++ b/docs/user_guide/programming_model.dox
@@ -0,0 +1,70 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page programming_model Programming Model
+
+@tableofcontents
+
+@section programming_model_functions Functions
+
+Functions will automatically allocate the temporary buffers mentioned above, and will automatically multi-thread kernels' executions using the very basic scheduler described in the previous section.
+
+Simple functions only call a single kernel (e.g NEConvolution3x3), while more complex ones consist of several kernels pipelined together (e.g @ref NEFullyConnectedLayer ). Check their documentation to find out which kernels are used by each function.
+
+@code{.cpp}
+//Create a function object:
+MyFunction function;
+// Initialize the function with the input/output and options you want to use:
+function.configure( input, output, option0, option1);
+// Execute the function:
+function.run();
+@endcode
+
+@warning The Compute Library requires Arm® Mali™ OpenCL DDK r8p0 or higher (OpenCL kernels are compiled using the -cl-arm-non-uniform-work-group-size flag)
+
+@note All OpenCL functions and objects in the runtime library use the command queue associated with CLScheduler for all operations, a real implementation would be expected to use different queues for mapping operations and kernels in order to reach a better GPU utilization.
+
+@section programming_model_scheduler OpenCL Scheduler
+
+The Compute Library runtime uses a single command queue and context for all the operations.
+
+The user can get / set this context and command queue through CLScheduler's interface.
+
+The user can get / set the target GPU device through the CLScheduler's interface.
+
+@attention Make sure the application is using the same context as the library as in OpenCL it is forbidden to share objects across contexts. This is done by calling @ref CLScheduler::init() or @ref CLScheduler::default_init() at the beginning of your application.
+
+@attention Make sure the scheduler's target is not changed after function classes are created.
+
+@section programming_model__events_sync OpenCL events and synchronization
+
+In order to block until all the jobs in the CLScheduler's command queue are done executing the user can call @ref CLScheduler::sync() or create a sync event using @ref CLScheduler::enqueue_sync_event()
+
+@section programming_model_cl_neon OpenCL / Arm® Neon™ interoperability
+
+You can mix OpenCL and Arm® Neon™ kernels and functions. However it is the user's responsibility to handle the mapping/unmapping of OpenCL objects.
+*/
+} // namespace arm_compute
diff --git a/docs/user_guide/release_version_and_change_log.dox b/docs/user_guide/release_version_and_change_log.dox
new file mode 100644
index 0000000000..b9e3b37263
--- /dev/null
+++ b/docs/user_guide/release_version_and_change_log.dox
@@ -0,0 +1,1389 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page versions_changelogs Release Versions and Changelog
+
+@tableofcontents
+
+@section S2_1_versions Release versions
+
+All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number.
+If there is more than one release in a month then an extra sequential number is appended at the end:
+
+	v17.03 (First release of March 2017)
+	v17.03.1 (Second release of March 2017)
+	v17.04 (First release of April 2017)
+
+@note We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.
+
+@section S2_2_changelog Changelog
+
+v21.05 Public major release
+ - Removed computer vision support from Arm® Neon™ backend
+ - Removed the following functions:
+   - NEAbsoluteDifference
+   - NEAccumulate
+   - NEBox3x3
+   - NECannyEdge
+   - NEChannelCombine
+   - NEChannelExtract
+   - NEColorConvert
+   - NEConvolution
+   - NEDerivative
+   - NEDilate
+   - NEEqualizeHistogram
+   - NEErode
+   - NEFastCorners
+   - NEGaussian3x3
+   - NEGaussian5x5
+   - NEGaussianPyramid
+   - NEHOGDescriptor
+   - NEHOGDetector
+   - NEHOGGradient
+   - NEHOGMultiDetection
+   - NEHarrisCorners
+   - NEHistogram
+   - NEIntegralImage
+   - NELaplacianPyramid
+   - NELaplacianReconstruct
+   - NEMagnitude
+   - NEMeanStdDev
+   - NEMedian3x3
+   - NEMinMaxLocation
+   - NENonLinearFilter
+   - NEOpticalFlow
+   - NEPhase
+   - NEScharr3x3
+   - NESobel3x3
+   - NESobel5x5
+   - NESobel7x7
+   - NETableLookup
+   - NEThreshold
+   - NEWarpAffine
+   - NEWarpPerspectiveKernel
+
+ - Remove all GLES kernels / functions / tests / examples
+ - Removed computer vision support from CL backend
+ - Removed the following functions:
+   - CLAbsoluteDifference
+   - CLAccumulate
+   - CLBox3x3
+   - CLCannyEdge
+   - CLChannelCombine
+   - CLChannelExtract
+   - CLColorConvert
+   - CLConvolution
+   - CLDerivative
+   - CLDilate
+   - CLEqualizeHistogram
+   - CLErode
+   - CLFastCorners
+   - CLGaussian3x3
+   - CLGaussian5x5
+   - CLGaussianPyramid
+   - CLHOGDescriptor
+   - CLHOGDetector
+   - CLHOGGradient
+   - CLHOGMultiDetection
+   - CLHarrisCorners
+   - CLHistogram
+   - CLIntegralImage
+   - CLLaplacianPyramid
+   - CLLaplacianReconstruct
+   - CLMagnitude
+   - CLMeanStdDev
+   - CLMedian3x3
+   - CLMinMaxLocation
+   - CLNonLinearFilter
+   - CLOpticalFlow
+   - CLPhase
+   - CLScharr3x3
+   - CLSobel3x3
+   - CLSobel5x5
+   - CLSobel7x7
+   - CLTableLookup
+   - CLThreshold
+   - CLWarpAffine
+   - CLWarpPerspective
+ 
+v21.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Upgrade C++ standard to C++14
+ - Add macOS support
+ - Add Armv8-R AArch64 architecture support
+ - Add SVE/SVE2 support for:
+   - NEScaleKernel
+   - @ref NEActivationLayer
+   - @ref NEArithmeticAddition
+   - @ref NEBatchNormalizationLayerKernel
+   - @ref cpu::kernels::CpuLogits1DSoftmaxKernel
+   - @ref cpu::kernels::CpuLogits1DMaxKernel
+   - @ref cpu::kernels::CpuElementwiseUnaryKernel
+ - Remove padding from OpenCL kernels:
+   - CLDirectConvolutionLayerKernel
+   - @ref CLArgMinMaxLayerKernel
+   - @ref CLPadLayerKernel
+   - @ref CLROIAlignLayerKernel
+   - @ref CLRangeKernel
+   - CLScaleKernel
+   - @ref CLSelectKernel
+   - @ref CLBitwiseKernel
+   - @ref opencl::kernels::ClFloorKernel
+   - CLTransposeKernel
+ - Deprecate functions in CLTuner:
+    - add_lws_to_table
+    - import_lws_table
+    - lws_table
+ - Remove functions:
+   - NELocallyConnectedLayer / CLLocallyConnectedLayer
+   - NEIm2Col
+   - NECol2Im
+   - NEGEMMInterleave4x4
+   - NEGEMMTranspose1xW
+   - NEComputeAllAnchors / CLComputeAllAnchors
+   - NEGEMMAssemblyDispatch
+   - NEUpsampleLayer / CLUpsampleLayer
+ - Remove kernels:
+   - NEGEMMMatrixVectorMultiplyKernel
+   - NELocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedMatrixMultiplyKernel
+   - NEUpsampleLayerKernel / CLUpsampleLayerKernel
+ - Extend OpenCL tuner with workgroup batch size support
+   - Experimental extension for the OpenCL tuner to tune the batches of work groups distribute to compute units
+ - Add functionality to load the OpenCL GEMM heuristics at runtime
+   - The GEMM heuristic file (MLGO) can be used to update the default GEMM heuristics available for OpenCL
+ - Note: there might be performance regressions against v20.08 in Inception v3 using int8 data types on Arm Mali-G77 GPUs. Currently under investigation
+ - Note: data-type decoupling is in progress and expiremental. Warning of unused symbols might be raised
+
+v20.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Performance regressions can be noted when executing Depthwise Convolution on Arm® Neon™ with a depth multiplier > 1 for quantized data type.
+   This is planned to be resolved in 21.02 release.
+ - Added new data type QASYMM8_SIGNED support for @ref NEROIAlignLayer.
+ - Added new data type S32 support for:
+   - NEArithmeticSubtraction
+   - NEArithmeticSubtractionKernel
+   - @ref NEPixelWiseMultiplication
+   - NEPixelWiseMultiplicationKernel
+   - NEElementwiseDivision
+   - NEDivisionOperationKernel
+ - Interface change
+   - Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension
+     on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5.
+     The supported value range of axis is [-rank, rank).
+     This change applies to the following functions:
+      - @ref NESoftmaxLayer
+      - @ref NELogSoftmaxLayer
+      - @ref CLSoftmaxLayer
+      - @ref CLLogSoftmaxLayer
+      - GCSoftmaxLayer
+ - New OpenCL kernels / functions:
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
+   - @ref CLLogicalNot
+   - @ref CLLogicalAnd
+   - @ref CLLogicalOr
+ - New Arm® Neon™ kernels / functions:
+   - @ref NELogicalNot
+   - @ref NELogicalAnd
+   - @ref NELogicalOr
+ - Removed padding from Arm® Neon™ kernels:
+   - NEComplexPixelWiseMultiplicationKernel
+   - NENonMaximaSuppression3x3Kernel
+   - @ref NERemapKernel
+   - @ref NEGEMMInterleave4x4Kernel
+   - NEDirectConvolutionLayerKernel
+   - NEScaleKernel
+   - NELocallyConnectedMatrixMultiplyKernel
+   - @ref NEGEMMLowpOffsetContributionKernel
+   - @ref NEGEMMTranspose1xWKernel
+   - NEPoolingLayerKernel
+   - NEConvolutionKernel
+   - NEDepthwiseConvolutionLayerNativeKernel
+   - @ref NEGEMMLowpMatrixMultiplyKernel
+   - @ref NEGEMMMatrixMultiplyKernel
+   - NEDirectConvolutionLayerOutputStageKernel
+   - @ref NEReductionOperationKernel
+   - @ref NEGEMMLowpMatrixAReductionKernel
+   - @ref NEGEMMLowpMatrixBReductionKernel
+ - Removed padding from OpenCL kernels:
+   - CLBatchConcatenateLayerKernel
+   - CLElementwiseOperationKernel
+   - @ref CLBatchNormalizationLayerKernel
+   - CLPoolingLayerKernel
+   - @ref CLWinogradInputTransformKernel
+   - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+   - @ref CLGEMMLowpMatrixAReductionKernel
+   - @ref CLGEMMLowpMatrixBReductionKernel
+   - @ref CLGEMMLowpOffsetContributionOutputStageKernel
+   - @ref CLGEMMLowpOffsetContributionKernel
+   - @ref CLWinogradOutputTransformKernel
+   - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
+   - @ref CLFuseBatchNormalizationKernel
+   - @ref CLDepthwiseConvolutionLayerNativeKernel
+   - @ref CLDepthConvertLayerKernel
+   - CLCopyKernel
+   - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
+   - CLActivationLayerKernel
+   - @ref CLWinogradFilterTransformKernel
+   - CLWidthConcatenateLayerKernel
+   - CLWidthConcatenate4TensorsKernel
+   - CLWidthConcatenate2TensorsKernel
+   - CLLogits1DMaxShiftExpSumKernel
+   - CLLogits1DNormKernel
+   - CLHeightConcatenateLayerKernel
+   - @ref CLGEMMMatrixMultiplyKernel
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
+   - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+   - CLDepthConcatenateLayerKernel
+   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
+ - Removed OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+   - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
+   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
+ - Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - CLLocallyConnectedLayer
+     - CLLocallyConnectedMatrixMultiplyKernel
+     - CLAbsoluteDifference
+     - CLAbsoluteDifferenceKernel
+     - CLAccumulate
+     - CLAccumulateKernel
+     - CLAccumulateSquared
+     - CLAccumulateSquaredKernel
+     - CLAccumulateWeighted
+     - CLAccumulateWeightedKernel
+     - CLAccumulateWeightedFP16Kernel
+     - CLBox3x3
+     - CLBox3x3Kernel
+     - CLBox3x3FP16Kernel
+     - CLCannyEdge
+     - CLChannelCombine
+     - CLChannelCombineKernel
+     - CLChannelExtract
+     - CLChannelExtractKernel
+     - CLColorConvert
+     - CLColorConvertKernel
+     - CLConvolution3x3
+     - CLConvolutionRectangle
+     - CLConvolutionRectangleKernel
+     - CLConvolutionSquare
+     - CLConvolutionKernel
+     - CLDerivative
+     - CLDerivativeKernel
+     - CLDilate
+     - CLDilateKernel
+     - CLEqualizeHistogram
+     - CLErode
+     - CLErodeKernel
+     - CLFastCorners
+     - CLFastCornersKernel
+     - CLGaussian3x3
+     - CLGaussian3x3Kernel
+     - CLGaussian5x5
+     - CLGaussian5x5HorKernel
+     - CLGaussian5x5VertKernel
+     - CLGaussianPyramid
+     - CLGaussianPyramidHalf
+     - CLGaussianPyramidOrb
+     - CLHarrisCorners
+     - CLHarrisScoreKernel
+     - CLHarrisScoreFP16Kernel
+     - CLHistogram
+     - CLHistogramKernel
+     - CLHOGOrientationBinningKernel
+     - CLHOGBlockNormalizationKernel
+     - CLHOGDetectorKernel
+     - CLHOGNonMaximaSuppressionKernel
+     - CLHOGDescriptor
+     - CLHOGDetector
+     - CLHOGGradient
+     - CLHOGMultiDetection
+     - CLHOGOrientationBinningKernel
+     - CLHOGBlockNormalizationKernel
+     - CLHOGDetectorKernel
+     - CLIntegralImage
+     - CLIntegralImageKernel
+     - CLLaplacianReconstruct
+     - CLLaplacianPyramid
+     - CLMagnitude
+     - CLMagnitudePhaseKernel
+     - CLMedian3x3
+     - CLMedian3x3Kernel
+     - CLMinMaxLocation
+     - CLMinMaxLocationKernel
+     - CLNonLinearFilter
+     - CLNonLinearFilterKernel
+     - CLNonMaximaSuppression3x3
+     - CLNonMaximaSuppression3x3FP16Kernel
+     - CLNonMaximaSuppression3x3Kernel
+     - CLOpticalFlow
+     - CLPhase
+     - CLRemap
+     - CLRemapKernel
+     - CLScharr3x3
+     - CLScharr3x3Kernel
+     - CLSobel3x3
+     - CLSobel3x3Kernel
+     - CLSobel5x5
+     - CLSobel5x5HorKernel
+     - CLSobel5x5VertKernel
+     - CLSobel7x7
+     - CLSobel7x7HorKernel
+     - CLSobel7x7VertKernel
+     - CLThreshold
+     - CLThresholdKernel
+     - CLWarpAffine
+     - CLWarpAffineKernel
+     - CLWarpPerspective
+     - CLWarpPerspectiveKernel
+ - Deprecated Arm® Neon™ kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - NELocallyConnectedLayer
+     - NELocallyConnectedMatrixMultiplyKernel
+     - NEAbsoluteDifference
+     - NEAbsoluteDifferenceKernel
+     - NEAccumulate
+     - NEAccumulateKernel
+     - NEAccumulateSquared
+     - NEAccumulateSquaredKernel
+     - NEAccumulateWeighted
+     - NEAccumulateWeightedKernel
+     - NEAccumulateWeightedFP16Kernel
+     - NEBox3x3
+     - NEBox3x3Kernel
+     - NEBox3x3FP16Kernel
+     - NECannyEdge
+     - NEChannelCombine
+     - NEChannelCombineKernel
+     - NEChannelExtract
+     - NEChannelExtractKernel
+     - NEColorConvert
+     - NEColorConvertKernel
+     - NEConvolution3x3
+     - NEConvolutionRectangle
+     - NEConvolutionRectangleKernel
+     - NEConvolutionSquare
+     - NEConvolutionKernel
+     - NEDerivative
+     - NEDerivativeKernel
+     - NEDilate
+     - NEDilateKernel
+     - NEEqualizeHistogram
+     - NEErode
+     - NEErodeKernel
+     - NEFastCorners
+     - NEFastCornersKernel
+     - NEGaussian3x3
+     - NEGaussian3x3Kernel
+     - NEGaussian5x5
+     - NEGaussian5x5HorKernel
+     - NEGaussian5x5VertKernel
+     - NEGaussianPyramid
+     - NEGaussianPyramidHalf
+     - NEGaussianPyramidOrb
+     - NEHarrisCorners
+     - NEHarrisScoreKernel
+     - NEHarrisScoreFP16Kernel
+     - NEHistogram
+     - NEHistogramKernel
+     - NEHOGOrientationBinningKernel
+     - NEHOGBlockNormalizationKernel
+     - NEHOGDetectorKernel
+     - NEHOGNonMaximaSuppressionKernel
+     - NEHOGDescriptor
+     - NEHOGDetector
+     - NEHOGGradient
+     - NEHOGMultiDetection
+     - NEHOGOrientationBinningKernel
+     - NEHOGBlockNormalizationKernel
+     - NEHOGDetectorKernel
+     - NEIntegralImage
+     - NEIntegralImageKernel
+     - NELaplacianReconstruct
+     - NELaplacianPyramid
+     - NEMagnitude
+     - NEMagnitudePhaseKernel
+     - NEMedian3x3
+     - NEMedian3x3Kernel
+     - NEMinMaxLocation
+     - NEMinMaxLocationKernel
+     - NENonLinearFilter
+     - NENonLinearFilterKernel
+     - NENonMaximaSuppression3x3
+     - NENonMaximaSuppression3x3FP16Kernel
+     - NENonMaximaSuppression3x3Kernel
+     - NEOpticalFlow
+     - NEPhase
+     - NERemap
+     - NERemapKernel
+     - NEScharr3x3
+     - NEScharr3x3Kernel
+     - NESobel3x3
+     - NESobel3x3Kernel
+     - NESobel5x5
+     - NESobel5x5HorKernel
+     - NESobel5x5VertKernel
+     - NESobel7x7
+     - NESobel7x7HorKernel
+     - NESobel7x7VertKernel
+     - NEThreshold
+     - NEThresholdKernel
+     - NEWarpAffine
+     - NEWarpAffineKernel
+     - NEWarpPerspective
+     - NEWarpPerspectiveKernel
+ - Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - GCAbsoluteDifference
+     - GCActivationLayer
+     - GCArithmeticAddition
+     - GCBatchNormalizationLayer
+     - GCConcatenateLayer
+     - GCConvolutionLayer
+     - GCDepthwiseConvolutionLayer
+     - GCDirectConvolutionLayer
+     - GCDropoutLayer
+     - GCFillBorder
+     - GCFullyConnectedLayer
+     - GCGEMM
+     - GCGEMMInterleave4x4
+     - GCGEMMTranspose1xW
+     - GCNormalizationLayer
+     - GCNormalizePlanarYUVLayer
+     - GCPixelWiseMultiplication
+     - GCPoolingLayer
+     - GCScale
+     - GCSoftmaxLayer
+     - GCTensorShift
+     - GCTranspose
+
+
+v20.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Added new data type QASYMM8_SIGNED support for:
+   - @ref CLArgMinMaxLayer
+   - @ref CLArgMinMaxLayerKernel
+ - Added new data type U8 support for:
+   - @ref NECropKernel
+   - CLCropKernel
+ - Added aligh_corner support for nearest neighbor interpolation in:
+   - NEScaleKernel
+   - CLScaleKernel
+ - New OpenCL kernels / functions:
+   - @ref CLMaxUnpoolingLayerKernel
+ - New Arm® Neon™ kernels / functions:
+   - @ref NEMaxUnpoolingLayerKernel
+ - New graph example:
+   - graph_yolov3_output_detector
+ - GEMMTuner improvements:
+   - Added fp16 support
+   - Output json files for easier integration
+   - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
+   - More robust script for running benchmarks
+ - Removed padding from:
+   - NEPixelWiseMultiplicationKernel
+   - NEHeightConcatenateLayerKernel
+   - NEThresholdKernel
+   - NEBatchConcatenateLayerKernel
+   - NETransposeKernel
+   - @ref NEBatchNormalizationLayerKernel
+   - NEArithmeticSubtractionKernel
+   - @ref NEBoundingBoxTransformKernel
+   - NELogits1DMaxKernel
+   - NELogits1DSoftmaxKernel
+   - @ref NEROIPoolingLayerKernel
+   - @ref NEROIAlignLayerKernel
+   - NEYOLOLayerKernel
+   - NEUpsampleLayerKernel
+   - NEFloorKernel
+   - NEWidthConcatenateLayerKernel
+   - NEDepthConcatenateLayerKernel
+   - @ref NENormalizationLayerKernel
+   - @ref NEL2NormalizeLayerKernel
+   - NEFillArrayKernel
+   - @ref NEDepthConvertLayerKernel
+   - @ref NERangeKernel
+   - @ref NEPriorBoxLayer
+ - Removed OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ToUint8Scale
+   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
+ - Removed Arm® Neon™ kernels / functions:
+   - NEGEMMLowpQuantizeDownInt32ToUint8Scale
+   - NEGEMMMatrixAccumulateBiasesKernel
+ - Deprecated functions / interfaces:
+   - Non-descriptor based interfaces for NEThreshold, CLThreshold
+   - Non-descriptor based interfaces for @ref NEScale, @ref CLScale and GCScale
+   - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer :
+      The default "axis" value for @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer is changed from 1 to 0.
+      Only axis 0 is supported.
+      The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0.
+      Only axis 0 is supported.
+ - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity.
+ - Removed padding requirement for the input (e.g. LHS of GEMM) and output in @ref CLGEMMMatrixMultiplyNativeKernel, @ref CLGEMMMatrixMultiplyReshapedKernel, @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only)
+   - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output.
+   - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation.
+   - Only on Arm® Mali™ Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since @ref CLGEMMMatrixMultiplyKernel is called and currently requires padding.
+ - Added support for exporting the OpenCL buffer object to the OpenCL image object in @ref CLGEMMMatrixMultiplyReshapedKernel and @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
+   - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
+   - The padding requirement for the OpenCL image object is considered into the @ref CLGEMMReshapeRHSMatrixKernel.
+   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate @ref CLGEMMConvolutionLayer.
+
+v20.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r18b.
+ - Updated recommended gcc version to Linaro 6.3.1.
+ - Added Bfloat16 type support
+ - Added Bfloat16 support in:
+     - @ref NEWeightsReshapeKernel
+     - @ref NEConvolutionLayerReshapeWeights
+     - @ref NEIm2ColKernel
+     - NEIm2Col
+     - @ref NEDepthConvertLayerKernel
+     - @ref NEDepthConvertLayer
+     - @ref NEGEMMConvolutionLayer
+     - NEGEMMAssemblyDispatch
+ - Added new data type QASYMM8_SIGNED support for:
+     - @ref CLDirectConvolutionLayer
+     - @ref CLDeconvolutionLayer
+     - @ref CLDirectDeconvolutionLayer
+     - @ref CLGEMMDeconvolutionLayer
+     - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
+     - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
+     - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
+     - @ref CLReductionOperation
+     - @ref CLReduceMean
+     - @ref NEScale
+     - NEScaleKernel
+     - NEUpsampleLayer
+     - @ref NECast
+     - @ref NEReductionOperation
+     - @ref NEReduceMean
+     - @ref NEArgMinMaxLayer
+     - @ref NEDeconvolutionLayer
+     - @ref NEGEMMLowpQuantizeDownInt32ScaleKernel
+     - @ref CPPBoxWithNonMaximaSuppressionLimit
+     - @ref CPPDetectionPostProcessLayer
+     - @ref CPPPermuteKernel
+     - @ref CPPPermute
+     - @ref CPPTopKVKernel
+     - @ref CPPTopKV
+     - @ref CPPUpsample
+     - @ref CPPUpsampleKernel
+ - New OpenCL kernels / functions:
+     - @ref CLQLSTMLayer
+     - @ref CLQLSTMLayerNormalizationKernel
+ - New Arm® Neon™ kernels / functions:
+     - @ref NEQLSTMLayer
+     - @ref NEQLSTMLayerNormalizationKernel
+ - Added HARD_SWISH support in:
+     - CLActivationLayerKernel
+     - NEActivationLayerKernel
+ - Deprecated OpenCL kernels / functions:
+     - CLGEMMLowpQuantizeDownInt32ToUint8Scale
+     - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
+ - Deprecated Arm® Neon™ kernels / functions:
+     - NEGEMMLowpQuantizeDownInt32ToUint8Scale
+ - Removed CPP kernels / functions:
+     - CPPFlipWeightsKernel
+ - Removed PoolingLayerInfo constructors without Data Layout.
+ - Removed CLDepthwiseConvolutionLayer3x3
+ - Removed NEDepthwiseConvolutionLayerOptimized
+ - Added support for Winograd 3x3,4x4 on Arm® Neon™ FP16:
+     - @ref NEWinogradConvolutionLayer
+     - @ref NEWinogradLayerTransformInputKernel
+     - @ref NEWinogradLayerTransformOutputKernel
+     - @ref NEWinogradLayerTransformWeightsKernel
+ - Added CLCompileContext
+ - Added Arm® Neon™ GEMM kernel with 2D window support
+
+v20.02.1 Maintenance release
+ - Added Android-NN build script.
+
+v20.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Added new data type QASYMM8_SIGNED support for:
+     - @ref CLDepthwiseConvolutionLayer
+     - CLDepthwiseConvolutionLayer3x3
+     - @ref CLGEMMConvolutionLayer
+     - @ref CLGEMMLowpMatrixMultiplyCore
+     - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+     - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+     - @ref NEActivationLayer
+     - NEComparisonOperationKernel
+     - @ref NEConvolutionLayer
+     - @ref NEDepthwiseConvolutionLayer
+     - NEDepthwiseConvolutionLayer3x3Kernel
+     - NEDirectConvolutionLayerOutputStageKernel
+     - @ref NEElementwiseComparison
+     - @ref NEElementwiseMax
+     - @ref NEElementwiseMin
+     - @ref NEElementwiseSquaredDiff
+     - @ref NEFullyConnectedLayer
+     - NEGEMMMatrixVectorMultiplyKernel
+     - @ref NEPixelWiseMultiplication
+     - @ref NEPoolingLayer
+     - @ref NEPReluLayer
+ - Added support for QSYMM8_PER_CHANNEL in:
+     - NEDepthwiseConvolutionLayer3x3Kernel
+ - Added support for split sizes in:
+     - @ref CLSplit
+     - @ref NESplit
+ - New OpenCL kernels / functions:
+     - @ref CLFill
+     - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
+ - New Arm® Neon™ kernels / functions:
+     - @ref NEFill
+     - @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
+ - Deprecated Arm® Neon™ functions / interfaces:
+     - CLDepthwiseConvolutionLayer3x3
+     - NEDepthwiseConvolutionLayerOptimized
+     - PoolingLayerInfo constructors without Data Layout.
+ - Added support for quantization with multiplier greater than 1 on Arm® Neon™ and CL.
+ - Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to @ref CLQuantizationLayer.
+ - Added the ability to build bootcode for bare metal.
+ - Added support for generating synthetic QASYMM8 graphs.
+ - Added support for F16 datatype in VGG16.
+ - Removed pre-built binaries for GLES.
+
+v19.11.1 Public maintenance release
+ - Fix offset calculation in NEReductionOperationKernel.
+ - Fix data layout in NEScaleKernel for nhwc.
+ - Retain configuration step data layout to avoid side-effects.
+ - Perform sqrt in double domain for L2 pooling.
+ - Fix output shape calculation for Reduce Mean
+ - Restrict cases where optimized NEPadLayer runs.
+
+v19.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r17c.
+ - Deprecated OpenCL kernels / functions:
+    - CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
+    - CLDepthwiseIm2ColKernel
+    - CLDepthwiseSeparableConvolutionLayer
+    - CLDepthwiseVectorToTensorKernel
+    - CLDirectConvolutionLayerOutputStageKernel
+ - Deprecated Arm® Neon™ kernels / functions:
+    - NEDepthwiseWeightsReshapeKernel
+    - NEDepthwiseIm2ColKernel
+    - NEDepthwiseSeparableConvolutionLayer
+    - NEDepthwiseVectorToTensorKernel
+    - NEDepthwiseConvolutionLayer3x3
+ - New OpenCL kernels / functions:
+    - @ref CLInstanceNormalizationLayerKernel / @ref CLInstanceNormalizationLayer
+    - @ref CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated
+      OpenCL kernels / functions)
+    - @ref CLLogSoftmaxLayer
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBoundingBoxTransformKernel / @ref NEBoundingBoxTransform
+    - @ref NEComputeAllAnchorsKernel / NEComputeAllAnchors
+    - @ref NEDetectionPostProcessLayer
+    - @ref NEGenerateProposalsLayer
+    - @ref NEInstanceNormalizationLayerKernel / @ref NEInstanceNormalizationLayer
+    - @ref NELogSoftmaxLayer
+    - @ref NEROIAlignLayerKernel / @ref NEROIAlignLayer
+ - Added QASYMM8 support for:
+    - @ref CLGenerateProposalsLayer
+    - @ref CLROIAlignLayer
+    - @ref CPPBoxWithNonMaximaSuppressionLimit
+ - Added QASYMM16 support for:
+    - @ref CLBoundingBoxTransform
+ - Added FP16 support for:
+    - @ref CLGEMMMatrixMultiplyReshapedKernel
+ - Added new data type QASYMM8_PER_CHANNEL support for:
+    - CLDequantizationLayer
+    - @ref NEDequantizationLayer
+ - Added new data type QSYMM8_PER_CHANNEL support for:
+    - @ref CLConvolutionLayer
+    - @ref NEConvolutionLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref NEDepthwiseConvolutionLayer
+ - Added FP16 mixed-precision support for:
+    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - CLPoolingLayerKernel
+ - Added FP32 and FP16 ELU activation for:
+    - @ref CLActivationLayer
+    - @ref NEActivationLayer
+ - Added asymmetric padding support for:
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+    - @ref NEDeconvolutionLayer
+ - Added SYMMETRIC and REFLECT modes for @ref CLPadLayerKernel / @ref CLPadLayer.
+ - Replaced the calls to NECopyKernel and NEMemsetKernel with @ref NEPadLayer in @ref NEGenerateProposalsLayer.
+ - Replaced the calls to CLCopyKernel and CLMemsetKernel with @ref CLPadLayer in @ref CLGenerateProposalsLayer.
+ - Improved performance for CL Inception V3 - FP16.
+ - Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
+ - Improved Arm® Neon™ performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
+ - Improved Arm® Neon™ performance for MobileNet-SSD by improving the output detection performance.
+ - Optimized @ref CLPadLayer.
+ - Optimized CL generic depthwise convolution layer by introducing @ref CLDepthwiseConvolutionLayerNativeKernel.
+ - Reduced memory consumption by implementing weights sharing.
+
+v19.08.1 Public maintenance release
+ - Fix offset calculation in NEReductionOperationKernel.
+ - Fix data layout in NEScaleKernel for nhwc.
+ - Retain configuration step data layout to avoid side-effects.
+ - Perform sqrt in double domain for L2 pooling.
+ - Fix output shape calculation for Reduce Mean
+ - Fix broadcast CLPixelwiseMultiplication with 5D tensors
+
+v19.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Deprecated Arm® Neon™ functions
+    - NEDepthConcatenateLayer
+    - NEWidthConcatenateLayer
+ - Deprecated OpenCL kernels / functions
+    - CLDepthConcatenateLayer
+    - CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
+    - CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
+    - CLWidthConcatenateLayer
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEAbsLayer
+    - @ref NECast
+    - @ref NEElementwisePower
+    - @ref NELogLayer
+    - @ref NELSTMLayerQuantized
+    - @ref NENegLayer
+    - @ref NEPReluLayer
+    - @ref NESinLayer
+    - NEBatchConcatenateLayerKernel
+    - @ref NEDepthToSpaceLayerKernel / @ref NEDepthToSpaceLayer
+    - NEDepthwiseConvolutionLayerNativeKernel
+    - @ref NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+    - @ref NEMeanStdDevNormalizationKernel / @ref NEMeanStdDevNormalizationLayer
+    - @ref NESpaceToDepthLayerKernel / @ref NESpaceToDepthLayer
+ - New OpenCL kernels / functions:
+    - @ref CLAbsLayer
+    - @ref CLElementwisePower
+    - @ref CLLogLayer
+    - @ref CLLSTMLayerQuantized
+    - @ref CLNegLayer
+    - @ref CLPReluLayer
+    - @ref CLSinLayer
+    - CLBatchConcatenateLayerKernel
+    - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer
+    - @ref CLGEMMLowpMatrixMultiplyNativeKernel
+    - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+    - @ref CLGEMMMatrixMultiplyNativeKernel
+    - CLMeanStdDevNormalizationKernel /CLMeanStdDevNormalizationLayer
+    - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer
+ - New examples:
+    - neon_opticalflow
+    - cl_cache
+    - neon_permute
+ - Added support for FP16 in @ref NEDeconvolutionLayer
+ - Added support for FP16 in @ref CLDeconvolutionLayer
+ - Added support for REDUCE_MIN and REDUCE_MAX in @ref ReductionOperation
+ - Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
+ - Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
+ - Re-factored the depthwise convolution layer kernel on Arm® Neon™ for generic cases
+ - Added an optimized depthwise convolution layer kernel for 5x5 filters (Neon only)
+ - Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
+ - Altered @ref QuantizationInfo interface to support per-channel quantization.
+ - The CLDepthwiseConvolutionLayer3x3 will be included by @ref CLDepthwiseConvolutionLayer to accommodate for future optimizations.
+ - The NEDepthwiseConvolutionLayerOptimized will be included by @ref NEDepthwiseConvolutionLayer to accommodate for future optimizations.
+ - Removed inner_border_right and inner_border_top parameters from @ref CLDeconvolutionLayer interface
+ - Removed inner_border_right and inner_border_top parameters from @ref NEDeconvolutionLayer interface
+ - Optimized the Arm® Neon™ assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel
+
+v19.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer
+    - NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication
+    - @ref NECropKernel / @ref NECropResize
+    - NEDepthwiseConvolutionAssemblyDispatch
+    - @ref NEFFTDigitReverseKernel
+    - @ref NEFFTRadixStageKernel
+    - @ref NEFFTScaleKernel
+    - @ref NEGEMMLowpOffsetContributionOutputStageKernel
+    - NEHeightConcatenateLayerKernel
+    - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer
+    - @ref NEFFT1D
+    - @ref NEFFT2D
+    - @ref NEFFTConvolutionLayer
+ - New OpenCL kernels / functions:
+    - CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
+    - CLCropKernel / @ref CLCropResize
+    - @ref CLDeconvolutionReshapeOutputKernel
+    - @ref CLFFTDigitReverseKernel
+    - @ref CLFFTRadixStageKernel
+    - @ref CLFFTScaleKernel
+    - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+    - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
+    - CLHeightConcatenateLayerKernel
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLFFT1D
+    - @ref CLFFT2D
+    - @ref CLFFTConvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+ - New OpenGLES kernels / functions:
+    - GCConcatenateLayer
+ - Deprecated functions/interfaces
+    - GCDepthConcatenateLayer
+    - NEWidthConcatenateLayer
+    - NEDepthConcatenateLayer
+    - CLWidthConcatenateLayer
+    - CLDepthConcatenateLayer
+    - CLGEMMInterleave4x4
+    - CLGEMMTranspose1xW
+ - Support different quantization info in CLConcatLayer.
+ - Add checks on different input/output quantization info were not supported.
+ - Tensors have different quantization information.
+ - Add FP16 support checks.
+ - Fix output quantization CLDeptwiseConv3x3 when activation is fused.
+ - New graph examples:
+     - graph_convolution
+     - graph_fully_connected
+     - graph_depthwise_convolution
+     - Deepspeech v0.4.1
+ - Add support for QASYMM8 in NEArithmeticSubtractionKernel.
+ - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
+ - Add support for QASYMM8 NEDeconvolution.
+ - Add support for DequantizationLayer for Neon/CL.
+ - Add support for dilation in CLDepthwiseConvolution.
+ - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
+ - Optimize CLDeconvolution.
+ - Add StackLayer to the graph API.
+ - Add support for "reflect" padding mode in NEPad.
+ - Winograd 7x7 NHWC on OpenCL.
+ - Rework CL ML layers to run exclusively on CL.
+ - Support different quantization info in PoolingLayer.
+ - Implement and test import memory interfaces.
+ - Added new tests and removed old ones.
+ - Various clang-tidy fixes.
+
+v19.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NETileKernel / @ref NETile
+    - @ref NEFuseBatchNormalizationKernel / @ref NEFuseBatchNormalization
+    - NEElementwiseOperationKernel
+    - @ref NEElementwiseMax
+    - @ref NEElementwiseMin
+    - @ref NEElementwiseSquaredDiff
+    - @ref NESelectKernel / @ref NESelect
+    - @ref NESplit
+    - @ref NESlice
+    - @ref NEUnstack
+    - @ref NEStridedSliceKernel / @ref NEStridedSlice
+    - NEElementwiseUnaryKernel
+    - @ref NERsqrtLayer
+    - @ref NEExpLayer
+    - @ref NEReverseKernel / @ref NEReverse
+    - @ref NEArgMinMaxLayer
+    - @ref NEStackLayerKernel / @ref NEStackLayer
+    - @ref NERangeKernel / @ref NERange
+    - @ref NEPadLayer
+    - NEMemsetKernel
+    - @ref NEGatherKernel / @ref NEGather
+    - @ref NEElementwiseComparison
+    - @ref NEElementwiseComparisonStatic
+    - NEComparisonOperationKernel
+    - @ref NEElementwiseDivision
+ - New OpenCL kernels / functions:
+    - @ref CLSelectKernel / @ref CLSelect
+    - @ref CLTileKernel / @ref CLTile
+    - @ref CLComparisonKernel / @ref CLComparison
+    - @ref CLArgMinMaxLayer
+    - @ref CLElementwiseMax
+    - @ref CLElementwiseMin
+    - @ref CLElementwiseSquaredDiff
+    - @ref CLStackLayerKernel / @ref CLStackLayer
+    - @ref CLReverse / @ref CLReverseKernel
+    - @ref CLRsqrtLayer
+    - @ref CLExpLayer
+    - CLElementWiseUnaryLayerKernel
+    - @ref CLGEMMReshapeLHSMatrixKernel
+    - @ref CLGEMMReshapeRHSMatrixKernel
+    - @ref CLGEMMMatrixMultiplyReshapedKernel
+    - @ref CLRangeKernel / @ref CLRange
+    - @ref CLUnstack
+    - @ref CLGatherKernel / @ref CLGather
+    - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
+ - New CPP kernels / functions:
+    - @ref CPPDetectionOutputLayer
+    - @ref CPPTopKV / @ref CPPTopKVKernel
+ - Added new examples:
+    - graph_ssd_mobilenet.cpp
+    - graph_mobilenet_v2.cpp
+    - graph_resnet12.cpp
+    - graph_srcnn955.cpp
+    - graph_vgg_vdsr.cpp
+    - graph_inception_resnet_v1.cpp
+ - Add 4D tensors support to
+    - @ref NESoftmaxLayer
+ - Fused activation in @ref CLWinogradConvolutionLayer
+ - Extented @ref NEPermute to support more cases
+ - Added Neon/SVE GEMM Hybrid kernels
+ - Added u8 and s8 hybrid assembly kernels
+ - Introduced GEMM strategy name in NEGEMMAssemblyWrapper
+ - Improved @ref CLTuner
+ - Fused the bias addition within @ref CLGEMM
+ - Added support for QASYMM8 LOGISTIC activation in @ref NEActivationLayer
+ - Added NHWC data layout support to:
+    - @ref NEScale for F16
+    - @ref CLNormalizationLayer IN_MAP_2D for FP32/FP16
+    - @ref NEL2NormalizeLayer for FP32/FP16
+    - @ref NENormalizationLayer IN_MAP_2D for FP32/FP16
+    - @ref CLROIAlignLayer
+    - @ref CLGenerateProposalsLayer
+ - Added QASYMM8 support to the following kernels:
+    - NEArithmeticAdditionKernel
+    - @ref NEScale
+ - Added new tests and improved validation and benchmarking suites.
+ - Deprecated functions/interfaces
+    - Usage of inner_border_right and inner_border_top has been deprecated in @ref CLDeconvolutionLayer and @ref NEDeconvolutionLayer
+
+v18.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEChannelShuffleLayer / @ref NEChannelShuffleLayerKernel
+    - @ref NEReduceMean
+    - @ref NEReorgLayer / @ref NEReorgLayerKernel
+    - @ref NEPriorBoxLayer / @ref NEPriorBoxLayerKernel
+    - NEUpsampleLayer / NEUpsampleLayerKernel
+    - NEYOLOLayer / NEYOLOLayerKernel
+ - New OpenCL kernels / functions:
+    - @ref CLBatchToSpaceLayer / @ref CLBatchToSpaceLayerKernel
+    - @ref CLBoundingBoxTransform / @ref CLBoundingBoxTransformKernel
+    - @ref CLComputeAllAnchorsKernel
+    - @ref CLGenerateProposalsLayer
+    - @ref CLNormalizePlanarYUVLayer / @ref CLNormalizePlanarYUVLayerKernel
+    - @ref CLReorgLayer / @ref CLReorgLayerKernel
+    - @ref CLSpaceToBatchLayer / @ref CLSpaceToBatchLayerKernel
+    - @ref CLPadLayer
+    - @ref CLReduceMean
+    - @ref CLPriorBoxLayer / @ref CLPriorBoxLayerKernel
+    - @ref CLROIAlignLayer / @ref CLROIAlignLayerKernel
+    - @ref CLSlice
+    - @ref CLSplit
+    - @ref CLStridedSlice / @ref CLStridedSliceKernel
+    - CLUpsampleLayer / CLUpsampleLayerKernel
+    - CLYOLOLayer / CLYOLOLayerKernel
+ - New CPP kernels / functions:
+    - @ref CPPBoxWithNonMaximaSuppressionLimit / @ref CPPBoxWithNonMaximaSuppressionLimitKernel
+ - Added the validate method in:
+    - @ref NEDepthConvertLayer
+    - @ref NEFloor / @ref CLFloor
+    - @ref NEGEMMMatrixAdditionKernel
+    - @ref NEReshapeLayer / @ref CLReshapeLayer
+    - @ref CLScale
+ - Added new examples:
+    - graph_shufflenet.cpp
+    - graph_yolov3.cpp
+ - Added documentation for add a new function or kernel.
+ - Improved doxygen documentation adding a list of the existing functions.
+ - Add 4D tensors support to
+    - CLWidthConcatenateLayer
+    - CLFlattenLayer
+    - @ref CLSoftmaxLayer
+ - Add dot product support for @ref CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
+ - Add SVE support
+ - Fused batch normalization into convolution layer weights in @ref CLFuseBatchNormalization
+ - Fuses activation in @ref CLDepthwiseConvolutionLayer3x3NCHWKernel, @ref CLDepthwiseConvolutionLayer3x3NHWCKernel and @ref NEGEMMConvolutionLayer
+ - Added NHWC data layout support to:
+    - @ref CLChannelShuffleLayer
+    - @ref CLDeconvolutionLayer
+    - @ref CLL2NormalizeLayer
+ - Added QASYMM8 support to the following kernels:
+    - CLScaleKernel
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - CLPixelWiseMultiplicationKernel
+ - Added FP16 support to the following kernels:
+    - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - @ref CLNormalizePlanarYUVLayerKernel
+    - @ref CLWinogradConvolutionLayer (5x5 kernel)
+ - More tests added to both validation and benchmarking suites.
+
+v18.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r17b.
+ - Removed support for QS8/QS16 data types.
+ - Added support for grouped convolution in @ref CLConvolutionLayer.
+ - Added NHWC data layout support to:
+    - NEDepthConcatenateLayer / CLDepthConcatenateLayer
+    - @ref NEWinogradConvolutionLayer / @ref CLWinogradConvolutionLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref CLDirectConvolutionLayer
+    - @ref CLConvolutionLayer
+    - @ref CLScale
+    - @ref CLIm2ColKernel
+ - New Arm® Neon™ kernels / functions:
+    - @ref NERNNLayer
+ - New OpenCL kernels / functions:
+    - @ref CLArithmeticDivision
+ - Introduced prepare() stage support in the graph API for GLES.
+ - Added support for memory reusage when trying to allocate smaller CLTensors.
+ - Enabled NHWC execution on graph examples.
+ - Added JPEG accessor for validation purposes.
+ - Added validate methods to some kernels / functions.
+
+v18.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Major redesign in the interface for the neon kernels implemented in assembly.
+ - Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
+ - Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in neon functions.
+ - Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
+ - Moved neon assembly kernels to the folder src/core/Neon/kernels/arm_gemm.
+ - Improved doxygen documentation.
+ - Improved memory management for layer's transitions.
+ - Added support for NHWC data layout in tensors.
+ - Added NHWC data layout support to:
+    - @ref NEGEMMConvolutionLayer
+    - @ref NEDirectConvolutionLayer
+    - @ref NEPoolingLayer / @ref CLPoolingLayer
+    - @ref NEBatchNormalizationLayer / @ref CLBatchNormalizationLayer
+    - @ref NEDepthwiseConvolutionLayer
+    - @ref NEScale
+    - NEIm2Col
+ - Added support for dilated convolutions in @ref NEConvolutionLayer and @ref CLConvolutionLayer.
+ - New OpenCL kernels / functions:
+    - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel
+    - CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights
+    - @ref CLCopy / CLCopyKernel
+    - @ref CLLSTMLayer
+    - @ref CLRNNLayer
+    - CLWidthConcatenateLayer / CLWidthConcatenateLayerKernel
+    - @ref CLWinogradFilterTransformKernel / @ref CLWinogradInputTransformKernel / @ref CLWinogradConvolutionLayer
+    - @ref CLWinogradInputTransformKernel / @ref CLWinogradInputTransform
+ - New Arm® Neon™ kernels / functions:
+    - NEConvertFullyConnectedWeightsKernel / @ref NEConvertFullyConnectedWeights.
+ - Created the validate method in @ref CLDepthwiseConvolutionLayer.
+ - Beta and gamma are no longer mandatory arguments in @ref NEBatchNormalizationLayer and @ref CLBatchNormalizationLayer.
+ - Added depth multiplier support in @ref NEDepthwiseConvolutionLayer and @ref CLDepthwiseConvolutionLayer.
+ - Added broadcast multiply support in @ref NEPixelWiseMultiplication / NEPixelWiseMultiplicationKernel.
+ - Port mobilenet example to NHWC data layout.
+ - Enabled Winograd method in @ref CLConvolutionLayer.
+ - Renamed NEWinogradLayer to @ref NEWinogradConvolutionLayer.
+ - Updated @ref NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/Neon/kernels/arm_gemm.
+ - Added memory manager support in GLES functions.
+ - Major refactoring of the graph API.
+ - Added GLES backend in the graph API.
+ - Added support for the memory manager in the graph API.
+ - Enabled Winograd Convolution method in the graph API.
+ - Added support for grouped convolutions in the graph API.
+ - Replaced NEDeconvolutionLayerUpsampleKernel with NEScaleKernel in @ref NEDeconvolutionLayer.
+ - Added fast maths flag in @ref CLConvolutionLayer.
+ - Added new tests and benchmarks in validation and benchmark frameworks
+ - Merge Activation layer with Convolution Layer (Neon. CL, GLES)
+ - Added support to OpenCL 2.0 SVM
+ - Added support to import memory in OpenCL tensors.
+ - Added the prepare() method to perform any one off pre-processing before running the function.
+ - Added new examples:
+    - graph_inception_v4.cpp
+    - graph_resnext50.cpp
+ - Added memory measurement instrument for CL.
+
+v18.03 Public maintenance release
+ - Various bug fixes.
+ - Fixed bug in @ref NEActivationLayer
+ - Fix in @ref CLTuner when using batches.
+ - Updated recommended NDK version to r16b (And fixed warnings).
+ - Fixed bug in validation code.
+ - Added Inception v4 graph example.
+ - Renamed NEWinogradLayer.cpp to @ref NEWinogradConvolutionLayer
+
+v18.02 Public major release
+ - Various Arm® Neon™ / OpenCL / GLES optimisations.
+ - Various bug fixes.
+ - Changed default number of threads on big LITTLE systems.
+ - Refactored examples and added:
+    - graph_mobilenet_qassym8
+    - graph_resnet
+    - graph_squeezenet_v1_1
+ - Renamed @ref CLConvolutionLayer into @ref CLGEMMConvolutionLayer and created a new @ref CLConvolutionLayer to select the fastest convolution method.
+ - Renamed @ref NEConvolutionLayer into @ref NEGEMMConvolutionLayer and created a new @ref NEConvolutionLayer to select the fastest convolution method.
+ - Added in place support to:
+    - @ref CLActivationLayer
+    - @ref CLBatchNormalizationLayer
+ - Added QASYMM8 support to:
+    - @ref CLActivationLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref NEDepthwiseConvolutionLayer
+    - @ref NESoftmaxLayer
+ - Added FP16 support to:
+    - CLDepthwiseConvolutionLayer3x3
+    - @ref CLDepthwiseConvolutionLayer
+ - Added broadcasting support to NEArithmeticAddition / @ref CLArithmeticAddition / @ref CLPixelWiseMultiplication
+ - Added fused batched normalization and activation to @ref CLBatchNormalizationLayer and @ref NEBatchNormalizationLayer
+ - Added support for non-square pooling to @ref NEPoolingLayer and @ref CLPoolingLayer
+ - New OpenCL kernels / functions:
+    - CLDirectConvolutionLayerOutputStageKernel
+ - New Arm® Neon™ kernels / functions
+    - Added name() method to all kernels.
+    - Added support for Winograd 5x5.
+    - NEPermuteKernel / @ref NEPermute
+    - @ref NEWinogradLayerTransformInputKernel / NEWinogradLayer
+    - @ref NEWinogradLayerTransformOutputKernel / NEWinogradLayer
+    - @ref NEWinogradLayerTransformWeightsKernel / NEWinogradLayer
+    - Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel
+ - New GLES kernels / functions:
+    - GCTensorShiftKernel / GCTensorShift
+
+v18.01 Public maintenance release
+ - Various bug fixes
+ - Added some of the missing validate() methods
+ - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample
+ - Added CLPermuteKernel / @ref CLPermute
+ - Added method to clean the programs cache in the CL Kernel library.
+ - Added GCArithmeticAdditionKernel / GCArithmeticAddition
+ - Added GCDepthwiseConvolutionLayer3x3Kernel / GCDepthwiseConvolutionLayer3x3
+ - Added GCNormalizePlanarYUVLayerKernel / GCNormalizePlanarYUVLayer
+ - Added GCScaleKernel / GCScale
+ - Added GCWeightsReshapeKernel / GCConvolutionLayer
+ - Added FP16 support to the following GLES compute kernels:
+    - GCCol2ImKernel
+    - GCGEMMInterleave4x4Kernel
+    - GCGEMMTranspose1xWKernel
+    - GCIm2ColKernel
+ - Refactored Arm® Neon™ Winograd (NEWinogradLayerKernel)
+ - Added NEDirectConvolutionLayerOutputStageKernel
+ - Added QASYMM8 support to the following Arm® Neon™ kernels:
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - @ref NEFillBorderKernel
+    - NEPoolingLayerKernel
+ - Added new examples:
+    - graph_cl_mobilenet_qasymm8.cpp
+    - graph_inception_v3.cpp
+    - gc_dc.cpp
+ - More tests added to both validation and benchmarking suites.
+
+v17.12 Public major release
+ - Most machine learning functions on OpenCL support the new data type QASYMM8
+ - Introduced logging interface
+ - Introduced opencl timer
+ - Reworked GEMMLowp interface
+ - Added new Arm® Neon™ assembly kernels for GEMMLowp, SGEMM and HGEMM
+ - Added validation method for most Machine Learning kernels / functions
+ - Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
+ - Added sgemm example for OpenCL
+ - Added absolute difference example for GLES compute
+ - Added new tests and benchmarks in validation and benchmark frameworks
+ - Added new kernels / functions for GLES compute
+
+ - New OpenGL ES kernels / functions
+    - GCAbsoluteDifferenceKernel / GCAbsoluteDifference
+    - GCActivationLayerKernel / GCActivationLayer
+    - GCBatchNormalizationLayerKernel / GCBatchNormalizationLayer
+    - GCCol2ImKernel
+    - GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
+    - GCDirectConvolutionLayerKernel / GCDirectConvolutionLayer
+    - GCDropoutLayerKernel / GCDropoutLayer
+    - GCFillBorderKernel / GCFillBorder
+    - GCGEMMInterleave4x4Kernel / GCGEMMInterleave4x4
+    - GCGEMMMatrixAccumulateBiasesKernel / GCGEMMMatrixAdditionKernel / GCGEMMMatrixMultiplyKernel / GCGEMM
+    - GCGEMMTranspose1xWKernel / GCGEMMTranspose1xW
+    - GCIm2ColKernel
+    - GCNormalizationLayerKernel / GCNormalizationLayer
+    - GCPixelWiseMultiplicationKernel / GCPixelWiseMultiplication
+    - GCPoolingLayerKernel / GCPoolingLayer
+    - GCLogits1DMaxKernel / GCLogits1DShiftExpSumKernel / GCLogits1DNormKernel / GCSoftmaxLayer
+    - GCTransposeKernel / GCTranspose
+
+ - New Arm® Neon™ kernels / functions
+    - arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
+    - arm_compute::NEHGEMMAArch64FP16Kernel
+    - NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / @ref NEDepthwiseConvolutionLayer
+    - @ref NEGEMMLowpOffsetContributionKernel / @ref NEGEMMLowpMatrixAReductionKernel / @ref NEGEMMLowpMatrixBReductionKernel / @ref NEGEMMLowpMatrixMultiplyCore
+    - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
+    - NEWinogradLayer / NEWinogradLayerKernel
+
+ - New OpenCL kernels / functions
+    - @ref CLGEMMLowpOffsetContributionKernel / @ref CLGEMMLowpMatrixAReductionKernel / @ref CLGEMMLowpMatrixBReductionKernel / @ref CLGEMMLowpMatrixMultiplyCore
+    - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
+
+ - New graph nodes for Arm® Neon™ and OpenCL
+    - graph::BranchLayer
+    - graph::DepthConvertLayer
+    - graph::DepthwiseConvolutionLayer
+    - graph::DequantizationLayer
+    - graph::FlattenLayer
+    - graph::QuantizationLayer
+    - graph::ReshapeLayer
+
+v17.10 Public maintenance release
+ - Bug fixes:
+    - Check the maximum local workgroup size supported by OpenCL devices
+    - Minor documentation updates (Fixed instructions to build the examples)
+    - Introduced a graph::GraphContext
+    - Added a few new Graph nodes, support for branches and grouping.
+    - Automatically enable cl_printf in debug builds
+    - Fixed bare metal builds for armv7a
+    - Added AlexNet and cartoon effect examples
+    - Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)
+
+v17.09 Public major release
+ - Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers.
+ - Memory Manager (@ref BlobLifetimeManager, @ref BlobMemoryPool, @ref ILifetimeManager, @ref IMemoryGroup, @ref IMemoryManager, @ref IMemoryPool, @ref IPoolManager, @ref MemoryManagerOnDemand, @ref PoolManager)
+ - New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework).
+ - Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both Arm® Neon™ and OpenCL.
+ - New Arm® Neon™ kernels / functions:
+    - arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel
+    - NEDequantizationLayerKernel / @ref NEDequantizationLayer
+    - NEFloorKernel / @ref NEFloor
+    - @ref NEL2NormalizeLayerKernel / @ref NEL2NormalizeLayer
+    - NEQuantizationLayerKernel @ref NEMinMaxLayerKernel / @ref NEQuantizationLayer
+    - @ref NEROIPoolingLayerKernel / @ref NEROIPoolingLayer
+    - @ref NEReductionOperationKernel / @ref NEReductionOperation
+    - NEReshapeLayerKernel / @ref NEReshapeLayer
+
+ - New OpenCL kernels / functions:
+    - @ref CLDepthwiseConvolutionLayer3x3NCHWKernel @ref CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 @ref CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer
+    - CLDequantizationLayerKernel / CLDequantizationLayer
+    - CLDirectConvolutionLayerKernel / @ref CLDirectConvolutionLayer
+    - CLFlattenLayer
+    - CLFloorKernel / @ref CLFloor
+    - CLGEMMTranspose1xW
+    - CLGEMMMatrixVectorMultiplyKernel
+    - @ref CLL2NormalizeLayerKernel / @ref CLL2NormalizeLayer
+    - CLQuantizationLayerKernel @ref CLMinMaxLayerKernel / @ref CLQuantizationLayer
+    - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer
+    - @ref CLReductionOperationKernel / @ref CLReductionOperation
+    - CLReshapeLayerKernel / @ref CLReshapeLayer
+
+v17.06 Public major release
+ - Various bug fixes
+ - Added support for fixed point 8 bit (QS8) to the various Arm® Neon™ machine learning kernels.
+ - Added unit tests and benchmarks (AlexNet, LeNet)
+ - Added support for sub tensors.
+ - Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
+ - Added @ref OMPScheduler (OpenMP) scheduler for Neon
+ - Added @ref SingleThreadScheduler scheduler for Arm® Neon™ (For bare metal)
+ - User can specify his own scheduler by implementing the @ref IScheduler interface.
+ - New OpenCL kernels / functions:
+    - @ref CLBatchNormalizationLayerKernel / @ref CLBatchNormalizationLayer
+    - CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
+    - CLHOGOrientationBinningKernel CLHOGBlockNormalizationKernel, CLHOGDetectorKernel / CLHOGDescriptor CLHOGDetector CLHOGGradient CLHOGMultiDetection
+    - CLLocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedLayer
+    - @ref CLWeightsReshapeKernel / @ref CLConvolutionLayerReshapeWeights
+ - New C++ kernels:
+    - CPPDetectionWindowNonMaximaSuppressionKernel
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBatchNormalizationLayerKernel / @ref NEBatchNormalizationLayer
+    - NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer
+    - NEDirectConvolutionLayerKernel / @ref NEDirectConvolutionLayer
+    - NELocallyConnectedMatrixMultiplyKernel / NELocallyConnectedLayer
+    - @ref NEWeightsReshapeKernel / @ref NEConvolutionLayerReshapeWeights
+
+v17.05 Public bug fixes release
+ - Various bug fixes
+ - Remaining of the functions ported to use accurate padding.
+ - Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
+ - Added "free" method to allocator.
+ - Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9
+
+v17.04 Public bug fixes release
+
+ The following functions have been ported to use the new accurate padding:
+ -  CLColorConvertKernel
+ -  CLEdgeNonMaxSuppressionKernel
+ -  CLEdgeTraceKernel
+ -  CLGaussianPyramidHorKernel
+ -  CLGaussianPyramidVertKernel
+ -  CLGradientKernel
+ -  NEChannelCombineKernel
+ -  NEFillArrayKernel
+ -  NEGaussianPyramidHorKernel
+ -  NEGaussianPyramidVertKernel
+ -  NEHarrisScoreFP16Kernel
+ -  NEHarrisScoreKernel
+ -  NEHOGDetectorKernel
+ -  NELogits1DMaxKernel
+ -  NELogits1DShiftExpSumKernel
+ -  NELogits1DNormKernel
+ -  NENonMaximaSuppression3x3FP16Kernel
+ -  NENonMaximaSuppression3x3Kernel
+
+v17.03.1 First Major public release of the sources
+ - Renamed the library to arm_compute
+ - New CPP target introduced for C++ kernels shared between Arm® Neon™ and CL functions.
+ - New padding calculation interface introduced and ported most kernels / functions to use it.
+ - New OpenCL kernels / functions:
+   - CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
+ - New Arm® Neon™ kernels / functions:
+   - @ref NENormalizationLayerKernel / @ref NENormalizationLayer
+   - NETransposeKernel / @ref NETranspose
+   - NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / @ref NESoftmaxLayer
+   - @ref NEIm2ColKernel, @ref NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / @ref NEConvolutionLayer
+   - NEGEMMMatrixAccumulateBiasesKernel / @ref NEFullyConnectedLayer
+   - @ref NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp
+
+v17.03 Sources preview
+ - New OpenCL kernels / functions:
+   - CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
+   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, @ref CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
+   - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer
+   - CLTransposeKernel / @ref CLTranspose
+   - CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow
+   - @ref CLNormalizationLayerKernel / @ref CLNormalizationLayer
+   - CLLaplacianPyramid, CLLaplacianReconstruct
+ - New Arm® Neon™ kernels / functions:
+   - NEActivationLayerKernel / @ref NEActivationLayer
+   - GEMM refactoring + FP16 support (Requires armv8.2 CPU): @ref NEGEMMInterleave4x4Kernel, @ref NEGEMMTranspose1xWKernel, @ref NEGEMMMatrixMultiplyKernel, @ref NEGEMMMatrixAdditionKernel / @ref NEGEMM
+   - NEPoolingLayerKernel / @ref NEPoolingLayer
+
+v17.02.1 Sources preview
+ - New OpenCL kernels / functions:
+   - CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, CLLogits1DNormKernel / @ref CLSoftmaxLayer
+   - CLPoolingLayerKernel / @ref CLPoolingLayer
+   - @ref CLIm2ColKernel, @ref CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / CLConvolutionLayer
+   - @ref CLRemapKernel / @ref CLRemap
+   - CLGaussianPyramidHorKernel, CLGaussianPyramidVertKernel / CLGaussianPyramid, CLGaussianPyramidHalf, CLGaussianPyramidOrb
+   - CLMinMaxKernel, CLMinMaxLocationKernel / CLMinMaxLocation
+   - CLNonLinearFilterKernel / CLNonLinearFilter
+ - New Arm® Neon™ FP16 kernels (Requires armv8.2 CPU)
+   - NEAccumulateWeightedFP16Kernel
+   - NEBox3x3FP16Kernel
+   - NENonMaximaSuppression3x3FP16Kernel
+
+v17.02 Sources preview
+ - New OpenCL kernels / functions:
+   - CLActivationLayerKernel / @ref CLActivationLayer
+   - CLChannelCombineKernel / CLChannelCombine
+   - CLDerivativeKernel / CLChannelExtract
+   - CLFastCornersKernel / CLFastCorners
+   - CLMeanStdDevKernel / CLMeanStdDev
+ - New Arm® Neon™ kernels / functions:
+   - HOG / SVM: NEHOGOrientationBinningKernel, NEHOGBlockNormalizationKernel, NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / NEHOGDescriptor, NEHOGDetector, NEHOGGradient, NEHOGMultiDetection
+   - NENonLinearFilterKernel / NENonLinearFilter
+ - Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
+ - Switched all the kernels / functions to use tensors instead of images.
+ - Updated documentation to include instructions to build the library from sources.
+
+v16.12 Binary preview release
+ - Original release
+
+ */
+} // namespace arm_compute
\ No newline at end of file
diff --git a/docs/user_guide/tests.dox b/docs/user_guide/tests.dox
new file mode 100644
index 0000000000..0d166b9693
--- /dev/null
+++ b/docs/user_guide/tests.dox
@@ -0,0 +1,385 @@
+///
+/// Copyright (c) 2017-2020 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+namespace test
+{
+/**
+@page tests Validation and Benchmarks
+
+@tableofcontents
+
+@section tests_overview Overview
+
+Benchmark and validation tests are based on the same framework to setup and run
+the tests. In addition to running simple, self-contained test functions the
+framework supports fixtures and data test cases. The former allows to share
+common setup routines between various backends thus reducing the amount of
+duplicated code. The latter can be used to parameterize tests or fixtures with
+different inputs, e.g. different tensor shapes. One limitation is that
+tests/fixtures cannot be parameterized based on the data type if static type
+information is needed within the test (e.g. to validate the results).
+
+@note By default tests are not built. To enable them you need to add validation_tests=1 and / or benchmark_tests=1 to your SCons line.
+
+@note Tests are not included in the pre-built binary archive, you have to build them from sources.
+
+@subsection tests_overview_fixtures Fixtures
+
+Fixtures can be used to share common setup, teardown or even run tasks among
+multiple test cases. For that purpose a fixture can define a `setup`,
+`teardown` and `run` method. Additionally the constructor and destructor might
+also be customized.
+
+An instance of the fixture is created immediately before the actual test is
+executed. After construction the @ref framework::Fixture::setup method is called. Then the test
+function or the fixtures `run` method is invoked. After test execution the
+@ref framework::Fixture::teardown method is called and lastly the fixture is destructed.
+
+@subsubsection tests_overview_fixtures_fixture Fixture
+
+Fixtures for non-parameterized test are straightforward. The custom fixture
+class has to inherit from @ref framework::Fixture and choose to implement any of the
+`setup`, `teardown` or `run` methods. None of the methods takes any arguments
+or returns anything.
+
+    class CustomFixture : public framework::Fixture
+    {
+        void setup()
+        {
+            _ptr = malloc(4000);
+        }
+
+        void run()
+        {
+            ARM_COMPUTE_ASSERT(_ptr != nullptr);
+        }
+
+        void teardown()
+        {
+            free(_ptr);
+        }
+
+        void *_ptr;
+    };
+
+@subsubsection tests_overview_fixtures_data_fixture Data fixture
+
+The advantage of a parameterized fixture is that arguments can be passed to the setup method at runtime. To make this possible the setup method has to be a template with a type parameter for every argument (though the template parameter doesn't have to be used). All other methods remain the same.
+
+    class CustomFixture : public framework::Fixture
+    {
+    #ifdef ALTERNATIVE_DECLARATION
+        template <typename ...>
+        void setup(size_t size)
+        {
+            _ptr = malloc(size);
+        }
+    #else
+        template <typename T>
+        void setup(T size)
+        {
+            _ptr = malloc(size);
+        }
+    #endif
+
+        void run()
+        {
+            ARM_COMPUTE_ASSERT(_ptr != nullptr);
+        }
+
+        void teardown()
+        {
+            free(_ptr);
+        }
+
+        void *_ptr;
+    };
+
+@subsection tests_overview_test_cases Test cases
+
+All following commands can be optionally prefixed with `EXPECTED_FAILURE_` or
+`DISABLED_`.
+
+@subsubsection tests_overview_test_cases_test_case Test case
+
+A simple test case function taking no inputs and having no (shared) state.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the dataset mode in which the test will be active.
+
+
+    TEST_CASE(TestCaseName, DatasetMode::PRECOMMIT)
+    {
+        ARM_COMPUTE_ASSERT_EQUAL(1 + 1, 2);
+    }
+
+@subsubsection tests_overview_test_cases_fixture_fixture_test_case Fixture test case
+
+A simple test case function taking no inputs that inherits from a fixture. The
+test case will have access to all public and protected members of the fixture.
+Only the setup and teardown methods of the fixture will be used. The body of
+this function will be used as test function.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            void setup() override
+            {
+                _one = 1;
+            }
+
+        protected:
+            int _one;
+    };
+
+    FIXTURE_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT)
+    {
+        ARM_COMPUTE_ASSERT_EQUAL(_one + 1, 2);
+    }
+
+@subsubsection tests_overview_test_cases_fixture_register_fixture_test_case Registering a fixture as test case
+
+Allows to use a fixture directly as test case. Instead of defining a new test
+function the run method of the fixture will be executed.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            void setup() override
+            {
+                _one = 1;
+            }
+
+            void run() override
+            {
+                ARM_COMPUTE_ASSERT_EQUAL(_one + 1, 2);
+            }
+
+        protected:
+            int _one;
+    };
+
+    REGISTER_FIXTURE_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT);
+
+
+@subsubsection tests_overview_test_cases_data_test_case Data test case
+
+A parameterized test case function that has no (shared) state. The dataset will
+be used to generate versions of the test case with different inputs.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the dataset mode in which the test will be active.
+- Third argument is the dataset.
+- Further arguments specify names of the arguments to the test function. The
+  number must match the arity of the dataset.
+
+
+    DATA_TEST_CASE(TestCaseName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}), num)
+    {
+        ARM_COMPUTE_ASSERT(num < 4);
+    }
+
+@subsubsection tests_overview_test_cases_fixture_data_test_case Fixture data test case
+
+A parameterized test case that inherits from a fixture. The test case will have
+access to all public and protected members of the fixture. Only the setup and
+teardown methods of the fixture will be used. The setup method of the fixture
+needs to be a template and has to accept inputs from the dataset as arguments.
+The body of this function will be used as test function. The dataset will be
+used to generate versions of the test case with different inputs.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+- Fourth argument is the dataset.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            template <typename T>
+            void setup(T num)
+            {
+                _num = num;
+            }
+
+        protected:
+            int _num;
+    };
+
+    FIXTURE_DATA_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}))
+    {
+        ARM_COMPUTE_ASSERT(_num < 4);
+    }
+
+@subsubsection tests_overview_test_cases_register_fixture_data_test_case Registering a fixture as data test case
+
+Allows to use a fixture directly as parameterized test case. Instead of
+defining a new test function the run method of the fixture will be executed.
+The setup method of the fixture needs to be a template and has to accept inputs
+from the dataset as arguments. The dataset will be used to generate versions of
+the test case with different inputs.
+
+- First argument is the name of the test case (has to be unique within the
+  enclosing test suite).
+- Second argument is the class name of the fixture.
+- Third argument is the dataset mode in which the test will be active.
+- Fourth argument is the dataset.
+
+
+    class FixtureName : public framework::Fixture
+    {
+        public:
+            template <typename T>
+            void setup(T num)
+            {
+                _num = num;
+            }
+
+            void run() override
+            {
+                ARM_COMPUTE_ASSERT(_num < 4);
+            }
+
+        protected:
+            int _num;
+    };
+
+    REGISTER_FIXTURE_DATA_TEST_CASE(TestCaseName, FixtureName, DatasetMode::PRECOMMIT, framework::make("Numbers", {1, 2, 3}));
+
+@section writing_tests Writing validation tests
+
+Before starting a new test case have a look at the existing ones. They should
+provide a good overview how test cases are structured.
+
+- The C++ reference needs to be added to `tests/validation/CPP/`. The
+  reference function is typically a template parameterized by the underlying
+  value type of the `SimpleTensor`. This makes it easy to specialise for
+  different data types.
+- If all backends have a common interface it makes sense to share the setup
+  code. This can be done by adding a fixture in
+  `tests/validation/fixtures/`. Inside of the `setup` method of a fixture
+  the tensors can be created and initialised and the function can be configured
+  and run. The actual test will only have to validate the results. To be shared
+  among multiple backends the fixture class is usually a template that accepts
+  the specific types (data, tensor class, function class etc.) as parameters.
+- The actual test cases need to be added for each backend individually.
+  Typically the will be multiple tests for different data types and for
+  different execution modes, e.g. precommit and nightly.
+
+@section tests_running_tests Running tests
+@subsection tests_running_tests_benchmark_and_validation Benchmarking and validation suites
+@subsubsection tests_running_tests_benchmarking_filter Filter tests
+All tests can be run by invoking
+
+    ./arm_compute_benchmark ./data
+
+where `./data` contains the assets needed by the tests.
+
+If only a subset of the tests has to be executed the `--filter` option takes a
+regular expression to select matching tests.
+
+    ./arm_compute_benchmark --filter='^NEON/.*AlexNet' ./data
+
+@note Filtering will be much faster if the regular expression starts from the start ("^") or end ("$") of the line.
+
+Additionally each test has a test id which can be used as a filter, too.
+However, the test id is not guaranteed to be stable when new tests are added.
+Only for a specific build the same the test will keep its id.
+
+    ./arm_compute_benchmark --filter-id=10 ./data
+
+All available tests can be displayed with the `--list-tests` switch.
+
+    ./arm_compute_benchmark --list-tests
+
+More options can be found in the `--help` message.
+
+@subsubsection tests_running_tests_benchmarking_runtime Runtime
+By default every test is run once on a single thread. The number of iterations
+can be controlled via the `--iterations` option and the number of threads via
+`--threads`.
+
+@subsubsection tests_running_tests_benchmarking_output Output
+By default the benchmarking results are printed in a human readable format on
+the command line. The colored output can be disabled via `--no-color-output`.
+As an alternative output format JSON is supported and can be selected via
+`--log-format=json`. To write the output to a file instead of stdout the
+`--log-file` option can be used.
+
+@subsubsection tests_running_tests_benchmarking_mode Mode
+Tests contain different datasets of different sizes, some of which will take several hours to run.
+You can select which datasets to use by using the `--mode` option, we recommed you use `--mode=precommit` to start with.
+
+@subsubsection tests_running_tests_benchmarking_instruments Instruments
+You can use the `--instruments` option to select one or more instruments to measure the execution time of the benchmark tests.
+
+`PMU` will try to read the CPU PMU events from the kernel (They need to be enabled on your platform)
+
+`MALI` will try to collect Arm® Mali™ hardware performance counters. (You need to have a recent enough Arm® Mali™ driver)
+
+`WALL_CLOCK_TIMER` will measure time using `gettimeofday`: this should work on all platforms.
+
+You can pass a combinations of these instruments: `--instruments=PMU,MALI,WALL_CLOCK_TIMER`
+
+@note You need to make sure the instruments have been selected at compile time using the `pmu=1` or `mali=1` scons options.
+
+@subsubsection tests_running_examples Examples
+
+To run all the precommit validation tests:
+
+	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit
+
+To run the OpenCL precommit validation tests:
+
+	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit --filter="^CL.*"
+
+To run the Arm® Neon™ precommit benchmark tests with PMU and Wall Clock timer in miliseconds instruments enabled:
+
+	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^NEON.*" --instruments="pmu,wall_clock_timer_ms" --iterations=10
+
+To run the OpenCL precommit benchmark tests with OpenCL kernel timers in miliseconds enabled:
+
+	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^CL.*" --instruments="opencl_timer_ms" --iterations=10
+
+@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
+*/
+} // namespace test
+} // namespace arm_compute
-- 
cgit v1.2.1