21 files changed, 6981 insertions, 2787 deletions
diff --git a/docs/00_introduction.dox b/docs/00_introduction.dox
deleted file mode 100644
index 8eb0762f9f..0000000000
--- a/docs/00_introduction.dox
+++ /dev/null
@@ -1,1836 +0,0 @@
-///
-/// Copyright (c) 2017-2020 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/** @mainpage Introduction
-
-@tableofcontents
-
-The Computer Vision and Machine Learning library is a set of functions optimised for both ARM CPUs and GPUs using SIMD technologies.
-
-Several builds of the library are available using various configurations:
- - OS: Linux, Android or bare metal.
- - Architecture: armv7a (32bit) or arm64-v8a (64bit)
- - Technology: NEON / OpenCL / GLES_COMPUTE / NEON and OpenCL and GLES_COMPUTE
- - Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance.
-
-@section S0_1_contact Contact / Support
-
-Please email developer@arm.com
-
-In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run:
-
-    $ strings android-armv7a-cl-asserts/libarm_compute.so | grep arm_compute_version
-    arm_compute_version=v16.12 Build options: {'embed_kernels': '1', 'opencl': '1', 'arch': 'armv7a', 'neon': '0', 'asserts': '1', 'debug': '0', 'os': 'android', 'Werror': '1'} Git hash=f51a545d4ea12a9059fe4e598a092f1fd06dc858
-
-@section S0_2_prebuilt_binaries Pre-built binaries
-
-For each release we provide some pre-built binaries of the library [here](https://github.com/ARM-software/ComputeLibrary/releases)
-
-These binaries have been built using the following toolchains:
-            - Linux armv7a: gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
-            - Linux arm64-v8a: gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
-            - Android armv7a: clang++ / libc++ NDK r18b
-            - Android am64-v8a: clang++ / libc++ NDK r18b
-
-@warning Make sure to use a compatible toolchain to build your application or you will get some std::bad_alloc errors at runtime.
-
-@section S1_file_organisation File organisation
-
-This archive contains:
- - The arm_compute header and source files
- - The latest Khronos OpenCL 1.2 C headers from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a>
- - The latest Khronos cl2.hpp from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> (API version 2.1 when this document was written)
- - The latest Khronos OpenGL ES 3.1 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos OpenGL ES registry</a>
- - The latest Khronos EGL 1.5 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos EGL registry</a>
- - The sources for a stub version of libOpenCL.so, libGLESv1_CM.so, libGLESv2.so and libEGL.so to help you build your application.
- - An examples folder containing a few examples to compile and link against the library.
- - A @ref utils folder containing headers with some boiler plate code used by the examples.
- - This documentation.
-
- For detailed information about file organization, please refer to Files -> File List section of this documentation.
-
-@section S2_versions_changelog Release versions and changelog
-
-@subsection S2_1_versions Release versions
-
-All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number.
-If there is more than one release in a month then an extra sequential number is appended at the end:
-
-	v17.03 (First release of March 2017)
-	v17.03.1 (Second release of March 2017)
-	v17.04 (First release of April 2017)
-
-@note We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.
-
-@subsection S2_2_changelog Changelog
-
-v20.11 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Performance regressions can be noted when executing Depthwise Convolution on Neon with a depth multiplier > 1 for quantized data type.
-   This is planned to be resolved in 21.02 release.
- - Added new data type QASYMM8_SIGNED support for @ref NEROIAlignLayer.
- - Added new data type S32 support for:
-   - @ref NEArithmeticSubtraction
-   - @ref NEArithmeticSubtractionKernel
-   - @ref NEPixelWiseMultiplication
-   - @ref NEPixelWiseMultiplicationKernel
-   - @ref NEElementwiseDivision
-   - @ref NEDivisionOperationKernel
- - Interface change
-   - Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension
-     on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5.
-     The supported value range of axis is [-rank, rank).
-     This change applies to the following functions:
-      - @ref NESoftmaxLayer
-      - @ref NELogSoftmaxLayer
-      - @ref CLSoftmaxLayer
-      - @ref CLLogSoftmaxLayer
-      - @ref GCSoftmaxLayer
- - New OpenCL kernels / functions:
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
-   - @ref CLLogicalNot
-   - @ref CLLogicalAnd
-   - @ref CLLogicalOr
- - New NEON kernels / functions:
-   - @ref NELogicalNot
-   - @ref NELogicalAnd
-   - @ref NELogicalOr
- - Removed padding from NEON kernels:
-   - @ref NEComplexPixelWiseMultiplicationKernel
-   - @ref NENonMaximaSuppression3x3Kernel
-   - @ref NERemapKernel
-   - @ref NEGEMMInterleave4x4Kernel
-   - @ref NEDirectConvolutionLayerKernel
-   - @ref NEScaleKernel
-   - @ref NELocallyConnectedMatrixMultiplyKernel
-   - @ref NEGEMMLowpOffsetContributionKernel
-   - @ref NEGEMMTranspose1xWKernel
-   - @ref NEPoolingLayerKernel
-   - @ref NEConvolutionKernel
-   - @ref NEDepthwiseConvolutionLayerNativeKernel
-   - @ref NEGEMMLowpMatrixMultiplyKernel
-   - @ref NEGEMMMatrixMultiplyKernel
-   - @ref NEDirectConvolutionLayerOutputStageKernel
-   - @ref NEReductionOperationKernel
-   - @ref NEGEMMLowpMatrixAReductionKernel
-   - @ref NEGEMMLowpMatrixBReductionKernel
- - Removed padding from OpenCL kernels:
-   - @ref CLBatchConcatenateLayerKernel
-   - @ref CLElementwiseOperationKernel
-   - @ref CLBatchNormalizationLayerKernel
-   - @ref CLPoolingLayerKernel
-   - @ref CLWinogradInputTransformKernel
-   - @ref CLGEMMLowpMatrixMultiplyNativeKernel
-   - @ref CLGEMMLowpMatrixAReductionKernel
-   - @ref CLGEMMLowpMatrixBReductionKernel
-   - @ref CLGEMMLowpOffsetContributionOutputStageKernel
-   - @ref CLGEMMLowpOffsetContributionKernel
-   - @ref CLWinogradOutputTransformKernel
-   - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
-   - @ref CLFuseBatchNormalizationKernel
-   - @ref CLDepthwiseConvolutionLayerNativeKernel
-   - @ref CLDepthConvertLayerKernel
-   - @ref CLCopyKernel
-   - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
-   - @ref CLActivationLayerKernel
-   - @ref CLWinogradFilterTransformKernel
-   - @ref CLWidthConcatenateLayerKernel
-   - @ref CLWidthConcatenate4TensorsKernel
-   - @ref CLWidthConcatenate2TensorsKernel
-   - @ref CLLogits1DMaxShiftExpSumKernel
-   - @ref CLLogits1DNormKernel
-   - @ref CLHeightConcatenateLayerKernel
-   - @ref CLGEMMMatrixMultiplyKernel
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
-   - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-   - @ref CLDepthConcatenateLayerKernel
-   - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
- - Removed OpenCL kernels / functions:
-   - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-   - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
-   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
- - Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
-     - CLLocallyConnectedLayer
-     - CLLocallyConnectedMatrixMultiplyKernel
-     - CLAbsoluteDifference
-     - CLAbsoluteDifferenceKernel
-     - CLAccumulate
-     - CLAccumulateKernel
-     - CLAccumulateSquared
-     - CLAccumulateSquaredKernel
-     - CLAccumulateWeighted
-     - CLAccumulateWeightedKernel
-     - CLAccumulateWeightedFP16Kernel
-     - CLBox3x3
-     - CLBox3x3Kernel
-     - CLBox3x3FP16Kernel
-     - CLCannyEdge
-     - CLChannelCombine
-     - CLChannelCombineKernel
-     - CLChannelExtract
-     - CLChannelExtractKernel
-     - CLColorConvert
-     - CLColorConvertKernel
-     - CLConvolution3x3
-     - CLConvolutionRectangle
-     - CLConvolutionRectangleKernel
-     - CLConvolutionSquare
-     - CLConvolutionKernel
-     - CLDerivative
-     - CLDerivativeKernel
-     - CLDilate
-     - CLDilateKernel
-     - CLEqualizeHistogram
-     - CLErode
-     - CLErodeKernel
-     - CLFastCorners
-     - CLFastCornersKernel
-     - CLGaussian3x3
-     - CLGaussian3x3Kernel
-     - CLGaussian5x5
-     - CLGaussian5x5HorKernel
-     - CLGaussian5x5VertKernel
-     - CLGaussianPyramid
-     - CLGaussianPyramidHalf
-     - CLGaussianPyramidOrb
-     - CLHarrisCorners
-     - CLHarrisScoreKernel
-     - CLHarrisScoreFP16Kernel
-     - CLHistogram
-     - CLHistogramKernel
-     - CLHOGOrientationBinningKernel
-     - CLHOGBlockNormalizationKernel
-     - CLHOGDetectorKernel
-     - CLHOGNonMaximaSuppressionKernel
-     - CLHOGDescriptor
-     - CLHOGDetector
-     - CLHOGGradient
-     - CLHOGMultiDetection
-     - CLHOGOrientationBinningKernel
-     - CLHOGBlockNormalizationKernel
-     - CLHOGDetectorKernel
-     - CLIntegralImage
-     - CLIntegralImageKernel
-     - CLLaplacianReconstruct
-     - CLLaplacianPyramid
-     - CLMagnitude
-     - CLMagnitudePhaseKernel
-     - CLMedian3x3
-     - CLMedian3x3Kernel
-     - CLMinMaxLocation
-     - CLMinMaxLocationKernel
-     - CLNonLinearFilter
-     - CLNonLinearFilterKernel
-     - CLNonMaximaSuppression3x3
-     - CLNonMaximaSuppression3x3FP16Kernel
-     - CLNonMaximaSuppression3x3Kernel
-     - CLOpticalFlow
-     - CLPhase
-     - CLRemap
-     - CLRemapKernel
-     - CLScharr3x3
-     - CLScharr3x3Kernel
-     - CLSobel3x3
-     - CLSobel3x3Kernel
-     - CLSobel5x5
-     - CLSobel5x5HorKernel
-     - CLSobel5x5VertKernel
-     - CLSobel7x7
-     - CLSobel7x7HorKernel
-     - CLSobel7x7VertKernel
-     - CLThreshold
-     - CLThresholdKernel
-     - CLWarpAffine
-     - CLWarpAffineKernel
-     - CLWarpPerspective
-     - CLWarpPerspectiveKernel
- - Deprecated NEON kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
-     - NELocallyConnectedLayer
-     - NELocallyConnectedMatrixMultiplyKernel
-     - NEAbsoluteDifference
-     - NEAbsoluteDifferenceKernel
-     - NEAccumulate
-     - NEAccumulateKernel
-     - NEAccumulateSquared
-     - NEAccumulateSquaredKernel
-     - NEAccumulateWeighted
-     - NEAccumulateWeightedKernel
-     - NEAccumulateWeightedFP16Kernel
-     - NEBox3x3
-     - NEBox3x3Kernel
-     - NEBox3x3FP16Kernel
-     - NECannyEdge
-     - NEChannelCombine
-     - NEChannelCombineKernel
-     - NEChannelExtract
-     - NEChannelExtractKernel
-     - NEColorConvert
-     - NEColorConvertKernel
-     - NEConvolution3x3
-     - NEConvolutionRectangle
-     - NEConvolutionRectangleKernel
-     - NEConvolutionSquare
-     - NEConvolutionKernel
-     - NEDerivative
-     - NEDerivativeKernel
-     - NEDilate
-     - NEDilateKernel
-     - NEEqualizeHistogram
-     - NEErode
-     - NEErodeKernel
-     - NEFastCorners
-     - NEFastCornersKernel
-     - NEGaussian3x3
-     - NEGaussian3x3Kernel
-     - NEGaussian5x5
-     - NEGaussian5x5HorKernel
-     - NEGaussian5x5VertKernel
-     - NEGaussianPyramid
-     - NEGaussianPyramidHalf
-     - NEGaussianPyramidOrb
-     - NEHarrisCorners
-     - NEHarrisScoreKernel
-     - NEHarrisScoreFP16Kernel
-     - NEHistogram
-     - NEHistogramKernel
-     - NEHOGOrientationBinningKernel
-     - NEHOGBlockNormalizationKernel
-     - NEHOGDetectorKernel
-     - NEHOGNonMaximaSuppressionKernel
-     - NEHOGDescriptor
-     - NEHOGDetector
-     - NEHOGGradient
-     - NEHOGMultiDetection
-     - NEHOGOrientationBinningKernel
-     - NEHOGBlockNormalizationKernel
-     - NEHOGDetectorKernel
-     - NEIntegralImage
-     - NEIntegralImageKernel
-     - NELaplacianReconstruct
-     - NELaplacianPyramid
-     - NEMagnitude
-     - NEMagnitudePhaseKernel
-     - NEMedian3x3
-     - NEMedian3x3Kernel
-     - NEMinMaxLocation
-     - NEMinMaxLocationKernel
-     - NENonLinearFilter
-     - NENonLinearFilterKernel
-     - NENonMaximaSuppression3x3
-     - NENonMaximaSuppression3x3FP16Kernel
-     - NENonMaximaSuppression3x3Kernel
-     - NEOpticalFlow
-     - NEPhase
-     - NERemap
-     - NERemapKernel
-     - NEScharr3x3
-     - NEScharr3x3Kernel
-     - NESobel3x3
-     - NESobel3x3Kernel
-     - NESobel5x5
-     - NESobel5x5HorKernel
-     - NESobel5x5VertKernel
-     - NESobel7x7
-     - NESobel7x7HorKernel
-     - NESobel7x7VertKernel
-     - NEThreshold
-     - NEThresholdKernel
-     - NEWarpAffine
-     - NEWarpAffineKernel
-     - NEWarpPerspective
-     - NEWarpPerspectiveKernel
- - Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
-     - GCAbsoluteDifference
-     - GCActivationLayer
-     - GCArithmeticAddition
-     - GCBatchNormalizationLayer
-     - GCConcatenateLayer
-     - GCConvolutionLayer
-     - GCDepthwiseConvolutionLayer
-     - GCDirectConvolutionLayer
-     - GCDropoutLayer
-     - GCFillBorder
-     - GCFullyConnectedLayer
-     - GCGEMM
-     - GCGEMMInterleave4x4
-     - GCGEMMTranspose1xW
-     - GCNormalizationLayer
-     - GCNormalizePlanarYUVLayer
-     - GCPixelWiseMultiplication
-     - GCPoolingLayer
-     - GCScale
-     - GCSoftmaxLayer
-     - GCTensorShift
-     - GCTranspose
-
-
-v20.08 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Added new data type QASYMM8_SIGNED support for:
-   - @ref CLArgMinMaxLayer
-   - @ref CLArgMinMaxLayerKernel
- - Added new data type U8 support for:
-   - @ref NECropKernel
-   - @ref CLCropKernel
- - Added aligh_corner support for nearest neighbor interpolation in:
-   - @ref NEScaleKernel
-   - @ref CLScaleKernel
- - New OpenCL kernels / functions:
-   - @ref CLMaxUnpoolingLayerKernel
- - New NEON kernels / functions:
-   - @ref NEMaxUnpoolingLayerKernel
- - New graph example:
-   - graph_yolov3_output_detector
- - GEMMTuner improvements:
-   - Added fp16 support
-   - Output json files for easier integration
-   - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
-   - More robust script for running benchmarks
- - Removed padding from:
-   - @ref NEPixelWiseMultiplicationKernel
-   - @ref NEHeightConcatenateLayerKernel
-   - @ref NEThresholdKernel
-   - @ref NEBatchConcatenateLayerKernel
-   - @ref NETransposeKernel
-   - @ref NEBatchNormalizationLayerKernel
-   - @ref NEArithmeticSubtractionKernel
-   - @ref NEBoundingBoxTransformKernel
-   - @ref NELogits1DMaxKernel
-   - @ref NELogits1DSoftmaxKernel
-   - @ref NEROIPoolingLayerKernel
-   - @ref NEROIAlignLayerKernel
-   - @ref NEYOLOLayerKernel
-   - @ref NEUpsampleLayerKernel
-   - @ref NEFloorKernel
-   - @ref NEWidthConcatenateLayerKernel
-   - @ref NEDepthConcatenateLayerKernel
-   - @ref NENormalizationLayerKernel
-   - @ref NEL2NormalizeLayerKernel
-   - @ref NEFillArrayKernel
-   - @ref NEDepthConvertLayerKernel
-   - @ref NERangeKernel
-   - @ref NEPriorBoxLayer
- - Removed OpenCL kernels / functions:
-   - CLGEMMLowpQuantizeDownInt32ToUint8Scale
-   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
- - Removed NEON kernels / functions:
-   - NEGEMMLowpQuantizeDownInt32ToUint8Scale
-   - NEGEMMMatrixAccumulateBiasesKernel
- - Deprecated functions / interfaces:
-   - Non-descriptor based interfaces for @ref NEThreshold, @ref CLThreshold
-   - Non-descriptor based interfaces for @ref NEScale, @ref CLScale and @ref GCScale
-   - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer :
-      The default "axis" value for @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and @ref GCSoftmaxLayer is changed from 1 to 0.
-      Only axis 0 is supported.
-      The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0.
-      Only axis 0 is supported.
- - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity.
- - Removed padding requirement for the input (e.g. LHS of GEMM) and output in @ref CLGEMMMatrixMultiplyNativeKernel, @ref CLGEMMMatrixMultiplyReshapedKernel, @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and @ref CLIm2ColKernel (NHWC only)
-   - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output.
-   - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation.
-   - Only on Arm Mali Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since @ref CLGEMMMatrixMultiplyKernel is called and currently requires padding.
- - Added support for exporting the OpenCL buffer object to the OpenCL image object in @ref CLGEMMMatrixMultiplyReshapedKernel and @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
-   - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
-   - The padding requirement for the OpenCL image object is considered into the @ref CLGEMMReshapeRHSMatrixKernel.
-   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate @ref CLGEMMConvolutionLayer.
-
-v20.05 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Updated recommended NDK version to r18b.
- - Updated recommended gcc version to Linaro 6.3.1.
- - Added Bfloat16 type support
- - Added Bfloat16 support in:
-     - @ref NEWeightsReshapeKernel
-     - @ref NEConvolutionLayerReshapeWeights
-     - @ref NEIm2ColKernel
-     - @ref NEIm2Col
-     - @ref NEDepthConvertLayerKernel
-     - @ref NEDepthConvertLayer
-     - @ref NEGEMMConvolutionLayer
-     - @ref NEGEMMAssemblyDispatch
- - Added new data type QASYMM8_SIGNED support for:
-     - @ref CLDirectConvolutionLayer
-     - @ref CLDeconvolutionLayer
-     - @ref CLDirectDeconvolutionLayer
-     - @ref CLGEMMDeconvolutionLayer
-     - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
-     - @ref CLGEMMLowpQuantizeDownInt32ScaleKernel
-     - @ref CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
-     - @ref CLReductionOperation
-     - @ref CLReduceMean
-     - @ref NEScale
-     - @ref NEScaleKernel
-     - @ref NEUpsampleLayer
-     - @ref NECast
-     - @ref NEReductionOperation
-     - @ref NEReduceMean
-     - @ref NEArgMinMaxLayer
-     - @ref NEDeconvolutionLayer
-     - @ref NEGEMMLowpQuantizeDownInt32ScaleKernel
-     - @ref CPPBoxWithNonMaximaSuppressionLimit
-     - @ref CPPDetectionPostProcessLayer
-     - @ref CPPPermuteKernel
-     - @ref CPPPermute
-     - @ref CPPTopKVKernel
-     - @ref CPPTopKV
-     - @ref CPPUpsample
-     - @ref CPPUpsampleKernel
- - New OpenCL kernels / functions:
-     - @ref CLQLSTMLayer
-     - @ref CLQLSTMLayerNormalizationKernel
- - New NEON kernels / functions:
-     - @ref NEQLSTMLayer
-     - @ref NEQLSTMLayerNormalizationKernel
- - Added HARD_SWISH support in:
-     - @ref CLActivationLayerKernel
-     - @ref NEActivationLayerKernel
- - Deprecated OpenCL kernels / functions:
-     - CLGEMMLowpQuantizeDownInt32ToUint8Scale
-     - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
- - Deprecated NEON kernels / functions:
-     - NEGEMMLowpQuantizeDownInt32ToUint8Scale
- - Removed CPP kernels / functions:
-     - CPPFlipWeightsKernel
- - Removed PoolingLayerInfo constructors without Data Layout.
- - Removed CLDepthwiseConvolutionLayer3x3
- - Removed NEDepthwiseConvolutionLayerOptimized
- - Added support for Winograd 3x3,4x4 on NEON FP16:
-     - @ref NEWinogradConvolutionLayer
-     - @ref NEWinogradLayerTransformInputKernel
-     - @ref NEWinogradLayerTransformOutputKernel
-     - @ref NEWinogradLayerTransformWeightsKernel
- - Added CLCompileContext
- - Added NEON GEMM kernel with 2D window support
-
-v20.02.1 Maintenance release
- - Added Android-NN build script.
-
-v20.02 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Added new data type QASYMM8_SIGNED support for:
-     - @ref CLDepthwiseConvolutionLayer
-     - CLDepthwiseConvolutionLayer3x3
-     - @ref CLGEMMConvolutionLayer
-     - @ref CLGEMMLowpMatrixMultiplyCore
-     - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-     - @ref CLGEMMLowpMatrixMultiplyNativeKernel
-     - @ref NEActivationLayer
-     - @ref NEComparisonOperationKernel
-     - @ref NEConvolutionLayer
-     - @ref NEDepthwiseConvolutionLayer
-     - NEDepthwiseConvolutionLayer3x3Kernel
-     - @ref NEDirectConvolutionLayerOutputStageKernel
-     - @ref NEElementwiseComparison
-     - @ref NEElementwiseMax
-     - @ref NEElementwiseMin
-     - @ref NEElementwiseSquaredDiff
-     - @ref NEFullyConnectedLayer
-     - NEGEMMMatrixVectorMultiplyKernel
-     - @ref NEPixelWiseMultiplication
-     - @ref NEPoolingLayer
-     - @ref NEPReluLayer
- - Added support for QSYMM8_PER_CHANNEL in:
-     - NEDepthwiseConvolutionLayer3x3Kernel
- - Added support for split sizes in:
-     - @ref CLSplit
-     - @ref NESplit
- - New OpenCL kernels / functions:
-     - @ref CLFill
-     - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
- - New NEON kernels / functions:
-     - @ref NEFill
-     - @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
- - Deprecated NEON functions / interfaces:
-     - CLDepthwiseConvolutionLayer3x3
-     - NEDepthwiseConvolutionLayerOptimized
-     - PoolingLayerInfo constructors without Data Layout.
- - Added support for quantization with multiplier greater than 1 on NEON and CL.
- - Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to @ref CLQuantizationLayer.
- - Added the ability to build bootcode for bare metal.
- - Added support for generating synthetic QASYMM8 graphs.
- - Added support for F16 datatype in VGG16.
- - Removed pre-built binaries for GLES.
-
-v19.11.1 Public maintenance release
- - Fix offset calculation in NEReductionOperationKernel.
- - Fix data layout in NEScaleKernel for nhwc.
- - Retain configuration step data layout to avoid side-effects.
- - Perform sqrt in double domain for L2 pooling.
- - Fix output shape calculation for Reduce Mean
- - Restrict cases where optimized NEPadLayer runs.
-
-v19.11 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Updated recommended NDK version to r17c.
- - Deprecated OpenCL kernels / functions:
-    - CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
-    - CLDepthwiseIm2ColKernel
-    - CLDepthwiseSeparableConvolutionLayer
-    - CLDepthwiseVectorToTensorKernel
-    - CLDirectConvolutionLayerOutputStageKernel
- - Deprecated NEON kernels / functions:
-    - NEDepthwiseWeightsReshapeKernel
-    - NEDepthwiseIm2ColKernel
-    - NEDepthwiseSeparableConvolutionLayer
-    - NEDepthwiseVectorToTensorKernel
-    - NEDepthwiseConvolutionLayer3x3
- - New OpenCL kernels / functions:
-    - @ref CLInstanceNormalizationLayerKernel / @ref CLInstanceNormalizationLayer
-    - @ref CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated
-      OpenCL kernels / functions)
-    - @ref CLLogSoftmaxLayer
- - New NEON kernels / functions:
-    - @ref NEBoundingBoxTransformKernel / @ref NEBoundingBoxTransform
-    - @ref NEComputeAllAnchorsKernel / @ref NEComputeAllAnchors
-    - @ref NEDetectionPostProcessLayer
-    - @ref NEGenerateProposalsLayer
-    - @ref NEInstanceNormalizationLayerKernel / @ref NEInstanceNormalizationLayer
-    - @ref NELogSoftmaxLayer
-    - @ref NEROIAlignLayerKernel / @ref NEROIAlignLayer
- - Added QASYMM8 support for:
-    - @ref CLGenerateProposalsLayer
-    - @ref CLROIAlignLayer
-    - @ref CPPBoxWithNonMaximaSuppressionLimit
- - Added QASYMM16 support for:
-    - @ref CLBoundingBoxTransform
- - Added FP16 support for:
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
- - Added new data type QASYMM8_PER_CHANNEL support for:
-    - @ref CLDequantizationLayer
-    - @ref NEDequantizationLayer
- - Added new data type QSYMM8_PER_CHANNEL support for:
-    - @ref CLConvolutionLayer
-    - @ref NEConvolutionLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref NEDepthwiseConvolutionLayer
- - Added FP16 mixed-precision support for:
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
-    - @ref CLPoolingLayerKernel
- - Added FP32 and FP16 ELU activation for:
-    - @ref CLActivationLayer
-    - @ref NEActivationLayer
- - Added asymmetric padding support for:
-    - @ref CLDirectDeconvolutionLayer
-    - @ref CLGEMMDeconvolutionLayer
-    - @ref NEDeconvolutionLayer
- - Added SYMMETRIC and REFLECT modes for @ref CLPadLayerKernel / @ref CLPadLayer.
- - Replaced the calls to @ref NECopyKernel and @ref NEMemsetKernel with @ref NEPadLayer in @ref NEGenerateProposalsLayer.
- - Replaced the calls to @ref CLCopyKernel and @ref CLMemsetKernel with @ref CLPadLayer in @ref CLGenerateProposalsLayer.
- - Improved performance for CL Inception V3 - FP16.
- - Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
- - Improved NEON performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
- - Improved NEON performance for MobileNet-SSD by improving the output detection performance.
- - Optimized @ref CLPadLayer.
- - Optimized CL generic depthwise convolution layer by introducing @ref CLDepthwiseConvolutionLayerNativeKernel.
- - Reduced memory consumption by implementing weights sharing.
-
-v19.08.1 Public maintenance release
- - Fix offset calculation in NEReductionOperationKernel.
- - Fix data layout in NEScaleKernel for nhwc.
- - Retain configuration step data layout to avoid side-effects.
- - Perform sqrt in double domain for L2 pooling.
- - Fix output shape calculation for Reduce Mean
- - Fix broadcast CLPixelwiseMultiplication with 5D tensors
-
-v19.08 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Deprecated NEON functions
-    - NEDepthConcatenateLayer
-    - NEWidthConcatenateLayer
- - Deprecated OpenCL kernels / functions
-    - CLDepthConcatenateLayer
-    - CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
-    - CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
-    - CLWidthConcatenateLayer
- - New NEON kernels / functions:
-    - @ref NEAbsLayer
-    - @ref NECast
-    - @ref NEElementwisePower
-    - @ref NELogLayer
-    - @ref NELSTMLayerQuantized
-    - @ref NENegLayer
-    - @ref NEPReluLayer
-    - @ref NESinLayer
-    - @ref NEBatchConcatenateLayerKernel
-    - @ref NEDepthToSpaceLayerKernel / @ref NEDepthToSpaceLayer
-    - @ref NEDepthwiseConvolutionLayerNativeKernel
-    - @ref NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-    - @ref NEMeanStdDevNormalizationKernel / @ref NEMeanStdDevNormalizationLayer
-    - @ref NESpaceToDepthLayerKernel / @ref NESpaceToDepthLayer
- - New OpenCL kernels / functions:
-    - @ref CLAbsLayer
-    - @ref CLElementwisePower
-    - @ref CLLogLayer
-    - @ref CLLSTMLayerQuantized
-    - @ref CLNegLayer
-    - @ref CLPReluLayer
-    - @ref CLSinLayer
-    - @ref CLBatchConcatenateLayerKernel
-    - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer
-    - @ref CLGEMMLowpMatrixMultiplyNativeKernel
-    - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
-    - @ref CLGEMMMatrixMultiplyNativeKernel
-    - @ref CLMeanStdDevNormalizationKernel / @ref CLMeanStdDevNormalizationLayer
-    - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer
- - New examples:
-    - neon_opticalflow
-    - cl_cache
-    - neon_permute
- - Added support for FP16 in @ref NEDeconvolutionLayer
- - Added support for FP16 in @ref CLDeconvolutionLayer
- - Added support for REDUCE_MIN and REDUCE_MAX in @ref ReductionOperation
- - Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
- - Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
- - Re-factored the depthwise convolution layer kernel on NEON for generic cases
- - Added an optimized depthwise convolution layer kernel for 5x5 filters (NEON only)
- - Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
- - Altered @ref QuantizationInfo interface to support per-channel quantization.
- - The CLDepthwiseConvolutionLayer3x3 will be included by @ref CLDepthwiseConvolutionLayer to accommodate for future optimizations.
- - The NEDepthwiseConvolutionLayerOptimized will be included by @ref NEDepthwiseConvolutionLayer to accommodate for future optimizations.
- - Removed inner_border_right and inner_border_top parameters from @ref CLDeconvolutionLayer interface
- - Removed inner_border_right and inner_border_top parameters from @ref NEDeconvolutionLayer interface
- - Optimized the NEON assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel
-
-v19.05 Public major release
- - Various bug fixes.
- - Various optimisations.
- - New Neon kernels / functions:
-    - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer
-    - @ref NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication
-    - @ref NECropKernel / @ref NECropResize
-    - @ref NEDepthwiseConvolutionAssemblyDispatch
-    - @ref NEFFTDigitReverseKernel
-    - @ref NEFFTRadixStageKernel
-    - @ref NEFFTScaleKernel
-    - @ref NEGEMMLowpOffsetContributionOutputStageKernel
-    - @ref NEHeightConcatenateLayerKernel
-    - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer
-    - @ref NEFFT1D
-    - @ref NEFFT2D
-    - @ref NEFFTConvolutionLayer
- - New OpenCL kernels / functions:
-    - @ref CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
-    - @ref CLCropKernel / @ref CLCropResize
-    - @ref CLDeconvolutionReshapeOutputKernel
-    - @ref CLFFTDigitReverseKernel
-    - @ref CLFFTRadixStageKernel
-    - @ref CLFFTScaleKernel
-    - @ref CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
-    - @ref CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
-    - @ref CLHeightConcatenateLayerKernel
-    - @ref CLDirectDeconvolutionLayer
-    - @ref CLFFT1D
-    - @ref CLFFT2D
-    - @ref CLFFTConvolutionLayer
-    - @ref CLGEMMDeconvolutionLayer
- - New OpenGLES kernels / functions:
-    - @ref GCConcatenateLayer
- - Deprecated functions/interfaces
-    - GCDepthConcatenateLayer
-    - NEWidthConcatenateLayer
-    - NEDepthConcatenateLayer
-    - CLWidthConcatenateLayer
-    - CLDepthConcatenateLayer
-    - CLGEMMInterleave4x4
-    - CLGEMMTranspose1xW
- - Support different quantization info in CLConcatLayer.
- - Add checks on different input/output quantization info were not supported.
- - Tensors have different quantization information.
- - Add FP16 support checks.
- - Fix output quantization CLDeptwiseConv3x3 when activation is fused.
- - New graph examples:
-     - graph_convolution
-     - graph_fully_connected
-     - graph_depthwise_convolution
-     - Deepspeech v0.4.1
- - Add support for QASYMM8 in NEArithmeticSubtractionKernel.
- - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
- - Add support for QASYMM8 NEDeconvolution.
- - Add support for DequantizationLayer for NEON/CL.
- - Add support for dilation in CLDepthwiseConvolution.
- - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
- - Optimize CLDeconvolution.
- - Add StackLayer to the graph API.
- - Add support for "reflect" padding mode in NEPad.
- - Winograd 7x7 NHWC on OpenCL.
- - Rework CL ML layers to run exclusively on CL.
- - Support different quantization info in PoolingLayer.
- - Implement and test import memory interfaces.
- - Added new tests and removed old ones.
- - Various clang-tidy fixes.
-
-v19.02 Public major release
- - Various bug fixes.
- - Various optimisations.
- - New Neon kernels / functions:
-    - @ref NETileKernel / @ref NETile
-    - @ref NEFuseBatchNormalizationKernel / @ref NEFuseBatchNormalization
-    - @ref NEElementwiseOperationKernel
-    - @ref NEElementwiseMax
-    - @ref NEElementwiseMin
-    - @ref NEElementwiseSquaredDiff
-    - @ref NESelectKernel / @ref NESelect
-    - @ref NESplit
-    - @ref NESlice
-    - @ref NEUnstack
-    - @ref NEStridedSliceKernel / @ref NEStridedSlice
-    - @ref NEElementwiseUnaryKernel
-    - @ref NERsqrtLayer
-    - @ref NEExpLayer
-    - @ref NEReverseKernel / @ref NEReverse
-    - @ref NEArgMinMaxLayer
-    - @ref NEStackLayerKernel / @ref NEStackLayer
-    - @ref NERangeKernel / @ref NERange
-    - @ref NEPadLayer
-    - @ref NEMemsetKernel
-    - @ref NEGatherKernel / @ref NEGather
-    - @ref NEElementwiseComparison
-    - @ref NEElementwiseComparisonStatic
-    - @ref NEComparisonOperationKernel
-    - @ref NEElementwiseDivision
- - New OpenCL kernels / functions:
-    - @ref CLSelectKernel / @ref CLSelect
-    - @ref CLTileKernel / @ref CLTile
-    - @ref CLComparisonKernel / @ref CLComparison
-    - @ref CLArgMinMaxLayer
-    - @ref CLElementwiseMax
-    - @ref CLElementwiseMin
-    - @ref CLElementwiseSquaredDiff
-    - @ref CLStackLayerKernel / @ref CLStackLayer
-    - @ref CLReverse / @ref CLReverseKernel
-    - @ref CLRsqrtLayer
-    - @ref CLExpLayer
-    - @ref CLElementWiseUnaryLayerKernel
-    - @ref CLGEMMReshapeLHSMatrixKernel
-    - @ref CLGEMMReshapeRHSMatrixKernel
-    - @ref CLGEMMMatrixMultiplyReshapedKernel
-    - @ref CLRangeKernel / @ref CLRange
-    - @ref CLUnstack
-    - @ref CLGatherKernel / @ref CLGather
-    - @ref CLGEMMLowpMatrixMultiplyReshapedKernel
- - New CPP kernels / functions:
-    - @ref CPPDetectionOutputLayer
-    - @ref CPPTopKV / @ref CPPTopKVKernel
- - Added new examples:
-    - graph_ssd_mobilenet.cpp
-    - graph_mobilenet_v2.cpp
-    - graph_resnet12.cpp
-    - graph_srcnn955.cpp
-    - graph_vgg_vdsr.cpp
-    - graph_inception_resnet_v1.cpp
- - Add 4D tensors support to
-    - @ref NESoftmaxLayer
- - Fused activation in @ref CLWinogradConvolutionLayer
- - Extented @ref NEPermute to support more cases
- - Added NEON/SVE GEMM Hybrid kernels
- - Added u8 and s8 hybrid assembly kernels
- - Introduced GEMM strategy name in NEGEMMAssemblyWrapper
- - Improved @ref CLTuner
- - Fused the bias addition within @ref CLGEMM
- - Added support for QASYMM8 LOGISTIC activation in @ref NEActivationLayer
- - Added NHWC data layout support to:
-    - @ref NEScale for F16
-    - @ref CLNormalizationLayer IN_MAP_2D for FP32/FP16
-    - @ref NEL2NormalizeLayer for FP32/FP16
-    - @ref NENormalizationLayer IN_MAP_2D for FP32/FP16
-    - @ref CLROIAlignLayer
-    - @ref CLGenerateProposalsLayer
- - Added QASYMM8 support to the following kernels:
-    - @ref NEArithmeticAdditionKernel
-    - @ref NEScale
- - Added new tests and improved validation and benchmarking suites.
- - Deprecated functions/interfaces
-    - Usage of inner_border_right and inner_border_top has been deprecated in @ref CLDeconvolutionLayer and @ref NEDeconvolutionLayer
-
-v18.11 Public major release
- - Various bug fixes.
- - Various optimisations.
- - New Neon kernels / functions:
-    - @ref NEChannelShuffleLayer / @ref NEChannelShuffleLayerKernel
-    - @ref NEReduceMean
-    - @ref NEReorgLayer / @ref NEReorgLayerKernel
-    - @ref NEPriorBoxLayer / @ref NEPriorBoxLayerKernel
-    - @ref NEUpsampleLayer / @ref NEUpsampleLayerKernel
-    - @ref NEYOLOLayer / @ref NEYOLOLayerKernel
- - New OpenCL kernels / functions:
-    - @ref CLBatchToSpaceLayer / @ref CLBatchToSpaceLayerKernel
-    - @ref CLBoundingBoxTransform / @ref CLBoundingBoxTransformKernel
-    - @ref CLComputeAllAnchorsKernel
-    - @ref CLGenerateProposalsLayer
-    - @ref CLNormalizePlanarYUVLayer / @ref CLNormalizePlanarYUVLayerKernel
-    - @ref CLReorgLayer / @ref CLReorgLayerKernel
-    - @ref CLSpaceToBatchLayer / @ref CLSpaceToBatchLayerKernel
-    - @ref CLPadLayer
-    - @ref CLReduceMean
-    - @ref CLPriorBoxLayer / @ref CLPriorBoxLayerKernel
-    - @ref CLROIAlignLayer / @ref CLROIAlignLayerKernel
-    - @ref CLSlice
-    - @ref CLSplit
-    - @ref CLStridedSlice / @ref CLStridedSliceKernel
-    - @ref CLUpsampleLayer / @ref CLUpsampleLayerKernel
-    - @ref CLYOLOLayer / @ref CLYOLOLayerKernel
- - New CPP kernels / functions:
-    - @ref CPPBoxWithNonMaximaSuppressionLimit / @ref CPPBoxWithNonMaximaSuppressionLimitKernel
- - Added the validate method in:
-    - @ref NEDepthConvertLayer
-    - @ref NEFloor / @ref CLFloor
-    - @ref NEGEMMMatrixAdditionKernel
-    - @ref NEReshapeLayer / @ref CLReshapeLayer
-    - @ref CLScale
- - Added new examples:
-    - graph_shufflenet.cpp
-    - graph_yolov3.cpp
- - Added documentation for add a new function or kernel.
- - Improved doxygen documentation adding a list of the existing functions.
- - Add 4D tensors support to
-    - CLWidthConcatenateLayer
-    - @ref CLFlattenLayer
-    - @ref CLSoftmaxLayer
- - Add dot product support for @ref CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
- - Add SVE support
- - Fused batch normalization into convolution layer weights in @ref CLFuseBatchNormalization
- - Fuses activation in @ref CLDepthwiseConvolutionLayer3x3NCHWKernel, @ref CLDepthwiseConvolutionLayer3x3NHWCKernel and @ref NEGEMMConvolutionLayer
- - Added NHWC data layout support to:
-    - @ref CLChannelShuffleLayer
-    - @ref CLDeconvolutionLayer
-    - @ref CLL2NormalizeLayer
- - Added QASYMM8 support to the following kernels:
-    - @ref CLScaleKernel
-    - NEDepthwiseConvolutionLayer3x3Kernel
-    - @ref CLPixelWiseMultiplicationKernel
- - Added FP16 support to the following kernels:
-    - @ref CLDepthwiseConvolutionLayer3x3NHWCKernel
-    - NEDepthwiseConvolutionLayer3x3Kernel
-    - @ref CLNormalizePlanarYUVLayerKernel
-    - @ref CLWinogradConvolutionLayer (5x5 kernel)
- - More tests added to both validation and benchmarking suites.
-
-v18.08 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Updated recommended NDK version to r17b.
- - Removed support for QS8/QS16 data types.
- - Added support for grouped convolution in @ref CLConvolutionLayer.
- - Added NHWC data layout support to:
-    - NEDepthConcatenateLayer / CLDepthConcatenateLayer
-    - @ref NEWinogradConvolutionLayer / @ref CLWinogradConvolutionLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref CLDirectConvolutionLayer
-    - @ref CLConvolutionLayer
-    - @ref CLScale
-    - @ref CLIm2ColKernel
- - New Neon kernels / functions:
-    - @ref NERNNLayer
- - New OpenCL kernels / functions:
-    - @ref CLArithmeticDivision
- - Introduced prepare() stage support in the graph API for GLES.
- - Added support for memory reusage when trying to allocate smaller CLTensors.
- - Enabled NHWC execution on graph examples.
- - Added JPEG accessor for validation purposes.
- - Added validate methods to some kernels / functions.
-
-v18.05 Public major release
- - Various bug fixes.
- - Various optimisations.
- - Major redesign in the interface for the neon kernels implemented in assembly.
- - Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
- - Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in neon functions.
- - Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
- - Moved neon assembly kernels to the folder src/core/NEON/kernels/arm_gemm.
- - Improved doxygen documentation.
- - Improved memory management for layer's transitions.
- - Added support for NHWC data layout in tensors.
- - Added NHWC data layout support to:
-    - @ref NEGEMMConvolutionLayer
-    - @ref NEDirectConvolutionLayer
-    - @ref NEPoolingLayer / @ref CLPoolingLayer
-    - @ref NEBatchNormalizationLayer / @ref CLBatchNormalizationLayer
-    - @ref NEDepthwiseConvolutionLayer
-    - @ref NEScale
-    - @ref NEIm2Col
- - Added support for dilated convolutions in @ref NEConvolutionLayer and @ref CLConvolutionLayer.
- - New OpenCL kernels / functions:
-    - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel
-    - @ref CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights
-    - @ref CLCopy / @ref CLCopyKernel
-    - @ref CLLSTMLayer
-    - @ref CLRNNLayer
-    - CLWidthConcatenateLayer / @ref CLWidthConcatenateLayerKernel
-    - @ref CLWinogradFilterTransformKernel / @ref CLWinogradInputTransformKernel / @ref CLWinogradConvolutionLayer
-    - @ref CLWinogradInputTransformKernel / @ref CLWinogradInputTransform
- - New Neon kernels / functions:
-    - @ref NEConvertFullyConnectedWeightsKernel / @ref NEConvertFullyConnectedWeights.
- - Created the validate method in @ref CLDepthwiseConvolutionLayer.
- - Beta and gamma are no longer mandatory arguments in @ref NEBatchNormalizationLayer and @ref CLBatchNormalizationLayer.
- - Added depth multiplier support in @ref NEDepthwiseConvolutionLayer and @ref CLDepthwiseConvolutionLayer.
- - Added broadcast multiply support in @ref NEPixelWiseMultiplication / @ref NEPixelWiseMultiplicationKernel.
- - Port mobilenet example to NHWC data layout.
- - Enabled Winograd method in @ref CLConvolutionLayer.
- - Renamed NEWinogradLayer to @ref NEWinogradConvolutionLayer.
- - Updated @ref NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/NEON/kernels/arm_gemm.
- - Added memory manager support in GLES functions.
- - Major refactoring of the graph API.
- - Added GLES backend in the graph API.
- - Added support for the memory manager in the graph API.
- - Enabled Winograd Convolution method in the graph API.
- - Added support for grouped convolutions in the graph API.
- - Replaced NEDeconvolutionLayerUpsampleKernel with @ref NEScaleKernel in @ref NEDeconvolutionLayer.
- - Added fast maths flag in @ref CLConvolutionLayer.
- - Added new tests and benchmarks in validation and benchmark frameworks
- - Merge Activation layer with Convolution Layer (NEON. CL, GLES)
- - Added support to OpenCL 2.0 SVM
- - Added support to import memory in OpenCL tensors.
- - Added the prepare() method to perform any one off pre-processing before running the function.
- - Added new examples:
-    - graph_inception_v4.cpp
-    - graph_resnext50.cpp
- - Added memory measurement instrument for CL.
-
-v18.03 Public maintenance release
- - Various bug fixes.
- - Fixed bug in @ref NEActivationLayer
- - Fix in @ref CLTuner when using batches.
- - Updated recommended NDK version to r16b (And fixed warnings).
- - Fixed bug in validation code.
- - Added Inception v4 graph example.
- - Renamed NEWinogradLayer.cpp to @ref NEWinogradConvolutionLayer
-
-v18.02 Public major release
- - Various NEON / OpenCL / GLES optimisations.
- - Various bug fixes.
- - Changed default number of threads on big LITTLE systems.
- - Refactored examples and added:
-    - graph_mobilenet_qassym8
-    - graph_resnet
-    - graph_squeezenet_v1_1
- - Renamed @ref CLConvolutionLayer into @ref CLGEMMConvolutionLayer and created a new @ref CLConvolutionLayer to select the fastest convolution method.
- - Renamed @ref NEConvolutionLayer into @ref NEGEMMConvolutionLayer and created a new @ref NEConvolutionLayer to select the fastest convolution method.
- - Added in place support to:
-    - @ref CLActivationLayer
-    - @ref CLBatchNormalizationLayer
- - Added QASYMM8 support to:
-    - @ref CLActivationLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref NEDepthwiseConvolutionLayer
-    - @ref NESoftmaxLayer
- - Added FP16 support to:
-    - CLDepthwiseConvolutionLayer3x3
-    - @ref CLDepthwiseConvolutionLayer
- - Added broadcasting support to @ref NEArithmeticAddition / @ref CLArithmeticAddition / @ref CLPixelWiseMultiplication
- - Added fused batched normalization and activation to @ref CLBatchNormalizationLayer and @ref NEBatchNormalizationLayer
- - Added support for non-square pooling to @ref NEPoolingLayer and @ref CLPoolingLayer
- - New OpenCL kernels / functions:
-    - CLDirectConvolutionLayerOutputStageKernel
- - New NEON kernels / functions
-    - Added name() method to all kernels.
-    - Added support for Winograd 5x5.
-    - @ref NEPermuteKernel / @ref NEPermute
-    - @ref NEWinogradLayerTransformInputKernel / NEWinogradLayer
-    - @ref NEWinogradLayerTransformOutputKernel / NEWinogradLayer
-    - @ref NEWinogradLayerTransformWeightsKernel / NEWinogradLayer
-    - Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel
- - New GLES kernels / functions:
-    - @ref GCTensorShiftKernel / @ref GCTensorShift
-
-v18.01 Public maintenance release
- - Various bug fixes
- - Added some of the missing validate() methods
- - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample
- - Added @ref CLPermuteKernel / @ref CLPermute
- - Added method to clean the programs cache in the CL Kernel library.
- - Added @ref GCArithmeticAdditionKernel / @ref GCArithmeticAddition
- - Added @ref GCDepthwiseConvolutionLayer3x3Kernel / @ref GCDepthwiseConvolutionLayer3x3
- - Added @ref GCNormalizePlanarYUVLayerKernel / @ref GCNormalizePlanarYUVLayer
- - Added @ref GCScaleKernel / @ref GCScale
- - Added @ref GCWeightsReshapeKernel / @ref GCConvolutionLayer
- - Added FP16 support to the following GLES compute kernels:
-    - @ref GCCol2ImKernel
-    - @ref GCGEMMInterleave4x4Kernel
-    - @ref GCGEMMTranspose1xWKernel
-    - @ref GCIm2ColKernel
- - Refactored NEON Winograd (NEWinogradLayerKernel)
- - Added @ref NEDirectConvolutionLayerOutputStageKernel
- - Added QASYMM8 support to the following NEON kernels:
-    - NEDepthwiseConvolutionLayer3x3Kernel
-    - @ref NEFillBorderKernel
-    - @ref NEPoolingLayerKernel
- - Added new examples:
-    - graph_cl_mobilenet_qasymm8.cpp
-    - graph_inception_v3.cpp
-    - gc_dc.cpp
- - More tests added to both validation and benchmarking suites.
-
-v17.12 Public major release
- - Most machine learning functions on OpenCL support the new data type QASYMM8
- - Introduced logging interface
- - Introduced opencl timer
- - Reworked GEMMLowp interface
- - Added new NEON assembly kernels for GEMMLowp, SGEMM and HGEMM
- - Added validation method for most Machine Learning kernels / functions
- - Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
- - Added sgemm example for OpenCL
- - Added absolute difference example for GLES compute
- - Added new tests and benchmarks in validation and benchmark frameworks
- - Added new kernels / functions for GLES compute
-
- - New OpenGL ES kernels / functions
-    - @ref GCAbsoluteDifferenceKernel / @ref GCAbsoluteDifference
-    - @ref GCActivationLayerKernel / @ref GCActivationLayer
-    - @ref GCBatchNormalizationLayerKernel / @ref GCBatchNormalizationLayer
-    - @ref GCCol2ImKernel
-    - @ref GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
-    - @ref GCDirectConvolutionLayerKernel / @ref GCDirectConvolutionLayer
-    - @ref GCDropoutLayerKernel / @ref GCDropoutLayer
-    - @ref GCFillBorderKernel / @ref GCFillBorder
-    - @ref GCGEMMInterleave4x4Kernel / @ref GCGEMMInterleave4x4
-    - @ref GCGEMMMatrixAccumulateBiasesKernel / @ref GCGEMMMatrixAdditionKernel / @ref GCGEMMMatrixMultiplyKernel / @ref GCGEMM
-    - @ref GCGEMMTranspose1xWKernel / @ref GCGEMMTranspose1xW
-    - @ref GCIm2ColKernel
-    - @ref GCNormalizationLayerKernel / @ref GCNormalizationLayer
-    - @ref GCPixelWiseMultiplicationKernel / @ref GCPixelWiseMultiplication
-    - @ref GCPoolingLayerKernel / @ref GCPoolingLayer
-    - @ref GCLogits1DMaxKernel / @ref GCLogits1DShiftExpSumKernel / @ref GCLogits1DNormKernel / @ref GCSoftmaxLayer
-    - @ref GCTransposeKernel / @ref GCTranspose
-
- - New NEON kernels / functions
-    - arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
-    - arm_compute::NEHGEMMAArch64FP16Kernel
-    - NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / @ref NEDepthwiseConvolutionLayer
-    - @ref NEGEMMLowpOffsetContributionKernel / @ref NEGEMMLowpMatrixAReductionKernel / @ref NEGEMMLowpMatrixBReductionKernel / @ref NEGEMMLowpMatrixMultiplyCore
-    - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-    - NEWinogradLayer / NEWinogradLayerKernel
-
- - New OpenCL kernels / functions
-    - @ref CLGEMMLowpOffsetContributionKernel / @ref CLGEMMLowpMatrixAReductionKernel / @ref CLGEMMLowpMatrixBReductionKernel / @ref CLGEMMLowpMatrixMultiplyCore
-    - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / @ref CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-
- - New graph nodes for NEON and OpenCL
-    - graph::BranchLayer
-    - graph::DepthConvertLayer
-    - graph::DepthwiseConvolutionLayer
-    - graph::DequantizationLayer
-    - graph::FlattenLayer
-    - graph::QuantizationLayer
-    - graph::ReshapeLayer
-
-v17.10 Public maintenance release
- - Bug fixes:
-    - Check the maximum local workgroup size supported by OpenCL devices
-    - Minor documentation updates (Fixed instructions to build the examples)
-    - Introduced a graph::GraphContext
-    - Added a few new Graph nodes, support for branches and grouping.
-    - Automatically enable cl_printf in debug builds
-    - Fixed bare metal builds for armv7a
-    - Added AlexNet and cartoon effect examples
-    - Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)
-
-v17.09 Public major release
- - Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers.
- - Memory Manager (@ref BlobLifetimeManager, @ref BlobMemoryPool, @ref ILifetimeManager, @ref IMemoryGroup, @ref IMemoryManager, @ref IMemoryPool, @ref IPoolManager, @ref MemoryManagerOnDemand, @ref PoolManager)
- - New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework).
- - Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both NEON and OpenCL.
- - New NEON kernels / functions:
-    - arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel
-    - @ref NEDequantizationLayerKernel / @ref NEDequantizationLayer
-    - @ref NEFloorKernel / @ref NEFloor
-    - @ref NEL2NormalizeLayerKernel / @ref NEL2NormalizeLayer
-    - @ref NEQuantizationLayerKernel @ref NEMinMaxLayerKernel / @ref NEQuantizationLayer
-    - @ref NEROIPoolingLayerKernel / @ref NEROIPoolingLayer
-    - @ref NEReductionOperationKernel / @ref NEReductionOperation
-    - @ref NEReshapeLayerKernel / @ref NEReshapeLayer
-
- - New OpenCL kernels / functions:
-    - @ref CLDepthwiseConvolutionLayer3x3NCHWKernel @ref CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 @ref CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer
-    - @ref CLDequantizationLayerKernel / @ref CLDequantizationLayer
-    - @ref CLDirectConvolutionLayerKernel / @ref CLDirectConvolutionLayer
-    - @ref CLFlattenLayer
-    - @ref CLFloorKernel / @ref CLFloor
-    - CLGEMMTranspose1xW
-    - @ref CLGEMMMatrixVectorMultiplyKernel
-    - @ref CLL2NormalizeLayerKernel / @ref CLL2NormalizeLayer
-    - @ref CLQuantizationLayerKernel @ref CLMinMaxLayerKernel / @ref CLQuantizationLayer
-    - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer
-    - @ref CLReductionOperationKernel / @ref CLReductionOperation
-    - @ref CLReshapeLayerKernel / @ref CLReshapeLayer
-
-v17.06 Public major release
- - Various bug fixes
- - Added support for fixed point 8 bit (QS8) to the various NEON machine learning kernels.
- - Added unit tests and benchmarks (AlexNet, LeNet)
- - Added support for sub tensors.
- - Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
- - Added @ref OMPScheduler (OpenMP) scheduler for NEON
- - Added @ref SingleThreadScheduler scheduler for NEON (For bare metal)
- - User can specify his own scheduler by implementing the @ref IScheduler interface.
- - New OpenCL kernels / functions:
-    - @ref CLBatchNormalizationLayerKernel / @ref CLBatchNormalizationLayer
-    - @ref CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
-    - @ref CLHOGOrientationBinningKernel @ref CLHOGBlockNormalizationKernel, @ref CLHOGDetectorKernel / @ref CLHOGDescriptor @ref CLHOGDetector @ref CLHOGGradient @ref CLHOGMultiDetection
-    - @ref CLLocallyConnectedMatrixMultiplyKernel / @ref CLLocallyConnectedLayer
-    - @ref CLWeightsReshapeKernel / @ref CLConvolutionLayerReshapeWeights
- - New C++ kernels:
-    - @ref CPPDetectionWindowNonMaximaSuppressionKernel
- - New NEON kernels / functions:
-    - @ref NEBatchNormalizationLayerKernel / @ref NEBatchNormalizationLayer
-    - @ref NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer
-    - @ref NEDirectConvolutionLayerKernel / @ref NEDirectConvolutionLayer
-    - @ref NELocallyConnectedMatrixMultiplyKernel / @ref NELocallyConnectedLayer
-    - @ref NEWeightsReshapeKernel / @ref NEConvolutionLayerReshapeWeights
-
-v17.05 Public bug fixes release
- - Various bug fixes
- - Remaining of the functions ported to use accurate padding.
- - Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
- - Added "free" method to allocator.
- - Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9
-
-v17.04 Public bug fixes release
-
- The following functions have been ported to use the new accurate padding:
- -  @ref CLColorConvertKernel
- -  @ref CLEdgeNonMaxSuppressionKernel
- -  @ref CLEdgeTraceKernel
- -  @ref CLGaussianPyramidHorKernel
- -  @ref CLGaussianPyramidVertKernel
- -  @ref CLGradientKernel
- -  @ref NEChannelCombineKernel
- -  @ref NEFillArrayKernel
- -  @ref NEGaussianPyramidHorKernel
- -  @ref NEGaussianPyramidVertKernel
- -  NEHarrisScoreFP16Kernel
- -  @ref NEHarrisScoreKernel
- -  @ref NEHOGDetectorKernel
- -  @ref NELogits1DMaxKernel
- -  NELogits1DShiftExpSumKernel
- -  NELogits1DNormKernel
- -  @ref NENonMaximaSuppression3x3FP16Kernel
- -  @ref NENonMaximaSuppression3x3Kernel
-
-v17.03.1 First Major public release of the sources
- - Renamed the library to arm_compute
- - New CPP target introduced for C++ kernels shared between NEON and CL functions.
- - New padding calculation interface introduced and ported most kernels / functions to use it.
- - New OpenCL kernels / functions:
-   - CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
- - New NEON kernels / functions:
-   - @ref NENormalizationLayerKernel / @ref NENormalizationLayer
-   - @ref NETransposeKernel / @ref NETranspose
-   - @ref NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / @ref NESoftmaxLayer
-   - @ref NEIm2ColKernel, @ref NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / @ref NEConvolutionLayer
-   - NEGEMMMatrixAccumulateBiasesKernel / @ref NEFullyConnectedLayer
-   - @ref NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp
-
-v17.03 Sources preview
- - New OpenCL kernels / functions:
-   - @ref CLGradientKernel, @ref CLEdgeNonMaxSuppressionKernel, @ref CLEdgeTraceKernel / @ref CLCannyEdge
-   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, @ref CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
-   - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer
-   - @ref CLTransposeKernel / @ref CLTranspose
-   - @ref CLLKTrackerInitKernel, @ref CLLKTrackerStage0Kernel, @ref CLLKTrackerStage1Kernel, @ref CLLKTrackerFinalizeKernel / @ref CLOpticalFlow
-   - @ref CLNormalizationLayerKernel / @ref CLNormalizationLayer
-   - @ref CLLaplacianPyramid, @ref CLLaplacianReconstruct
- - New NEON kernels / functions:
-   - @ref NEActivationLayerKernel / @ref NEActivationLayer
-   - GEMM refactoring + FP16 support (Requires armv8.2 CPU): @ref NEGEMMInterleave4x4Kernel, @ref NEGEMMTranspose1xWKernel, @ref NEGEMMMatrixMultiplyKernel, @ref NEGEMMMatrixAdditionKernel / @ref NEGEMM
-   - @ref NEPoolingLayerKernel / @ref NEPoolingLayer
-
-v17.02.1 Sources preview
- - New OpenCL kernels / functions:
-   - CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, @ref CLLogits1DNormKernel / @ref CLSoftmaxLayer
-   - @ref CLPoolingLayerKernel / @ref CLPoolingLayer
-   - @ref CLIm2ColKernel, @ref CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / @ref CLConvolutionLayer
-   - @ref CLRemapKernel / @ref CLRemap
-   - @ref CLGaussianPyramidHorKernel, @ref CLGaussianPyramidVertKernel / @ref CLGaussianPyramid, @ref CLGaussianPyramidHalf, @ref CLGaussianPyramidOrb
-   - @ref CLMinMaxKernel, @ref CLMinMaxLocationKernel / @ref CLMinMaxLocation
-   - @ref CLNonLinearFilterKernel / @ref CLNonLinearFilter
- - New NEON FP16 kernels (Requires armv8.2 CPU)
-   - @ref NEAccumulateWeightedFP16Kernel
-   - @ref NEBox3x3FP16Kernel
-   - @ref NENonMaximaSuppression3x3FP16Kernel
-
-v17.02 Sources preview
- - New OpenCL kernels / functions:
-   - @ref CLActivationLayerKernel / @ref CLActivationLayer
-   - @ref CLChannelCombineKernel / @ref CLChannelCombine
-   - @ref CLDerivativeKernel / @ref CLChannelExtract
-   - @ref CLFastCornersKernel / @ref CLFastCorners
-   - @ref CLMeanStdDevKernel / @ref CLMeanStdDev
- - New NEON kernels / functions:
-   - HOG / SVM: @ref NEHOGOrientationBinningKernel, @ref NEHOGBlockNormalizationKernel, @ref NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / @ref NEHOGDescriptor, @ref NEHOGDetector, @ref NEHOGGradient, @ref NEHOGMultiDetection
-   - @ref NENonLinearFilterKernel / @ref NENonLinearFilter
- - Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
- - Switched all the kernels / functions to use tensors instead of images.
- - Updated documentation to include instructions to build the library from sources.
-
-v16.12 Binary preview release
- - Original release
-
-@section S3_how_to_build How to build the library and the examples
-
-@subsection S3_1_build_options Build options
-
-scons 2.3 or above is required to build the library.
-To see the build options available simply run ```scons -h```:
-
-	debug: Debug (yes|no)
-		default: False
-		actual: False
-
-	asserts: Enable asserts (this flag is forced to 1 for debug=1) (yes|no)
-		default: False
-		actual: False
-
-	arch: Target Architecture (armv7a|arm64-v8a|arm64-v8.2-a|x86_32|x86_64)
-		default: armv7a
-		actual: armv7a
-
-	os: Target OS (linux|android|bare_metal)
-		default: linux
-		actual: linux
-
-	build: Build type (native|cross_compile|embed_only)
-		default: cross_compile
-		actual: cross_compile
-
-	examples: Build example programs (yes|no)
-		default: True
-		actual: True
-
-	Werror: Enable/disable the -Werror compilation flag (yes|no)
-		default: True
-		actual: True
-
-	opencl: Enable OpenCL support (yes|no)
-		default: True
-		actual: True
-
-	neon: Enable Neon support (yes|no)
-		default: False
-		actual: False
-
-	gles_compute: Enable OpenGL ES Compute Shader support (yes|no)
-		default: False
-		actual: False
-
-	embed_kernels: Embed OpenCL kernels and OpenGL ES compute shader in library binary (yes|no)
-		default: True
-		actual: True
-
-	set_soname: Set the library's soname and shlibversion (requires SCons 2.4 or above) (yes|no)
-		default: False
-		actual: False
-
-	openmp: Enable OpenMP backend (yes|no)
-		default: False
-		actual: False
-
-	cppthreads: Enable C++11 threads backend (yes|no)
-		default: True
-		actual: True
-
-	build_dir: Specify sub-folder for the build ( /path/to/build_dir )
-		default: .
-		actual: .
-
-	extra_cxx_flags: Extra CXX flags to be appended to the build command
-		default:
-		actual:
-
-	pmu: Enable PMU counters (yes|no)
-		default: False
-		actual: False
-
-	mali: Enable Mali hardware counters (yes|no)
-		default: False
-		actual: False
-
-	validation_tests: Build validation test programs (yes|no)
-		default: False
-		actual: False
-
-	benchmark_tests: Build benchmark test programs (yes|no)
-		default: False
-		actual: False
-
-@b debug / @b asserts:
- - With debug=1 asserts are enabled, and the library is built with symbols and no optimisations enabled.
- - With debug=0 and asserts=1: Optimisations are enabled and symbols are removed, however all the asserts are still present (This is about 20% slower than the release build)
- - With debug=0 and asserts=0: All optimisations are enable and no validation is performed, if the application misuses the library it is likely to result in a crash. (Only use this mode once you are sure your application is working as expected).
-
-@b arch: The x86_32 and x86_64 targets can only be used with neon=0 and opencl=1.
-
-@b os: Choose the operating system you are targeting: Linux, Android or bare metal.
-@note bare metal can only be used for NEON (not OpenCL), only static libraries get built and NEON's multi-threading support is disabled.
-
-@b build: you can either build directly on your device (native) or cross compile from your desktop machine (cross-compile). In both cases make sure the compiler is available in your path.
-
-@note If you want to natively compile for 32bit on a 64bit ARM device running a 64bit OS then you will have to use cross-compile too.
-
-There is also an 'embed_only' option which will generate all the .embed files for the OpenCL kernels and / or OpenGLES compute shaders. This might be useful if using a different build system to compile the library.
-
-@b Werror: If you are compiling using the same toolchains as the ones used in this guide then there shouldn't be any warning and therefore you should be able to keep Werror=1. If with a different compiler version the library fails to build because of warnings interpreted as errors then, if you are sure the warnings are not important, you might want to try to build with Werror=0 (But please do report the issue either on Github or by an email to developer@arm.com so that the issue can be addressed).
-
-@b opencl / @b neon / @b gles_compute: Choose which SIMD technology you want to target. (NEON for ARM Cortex-A CPUs or OpenCL / GLES_COMPUTE for ARM Mali GPUs)
-
-@b embed_kernels: For OpenCL / GLES_COMPUTE only: set embed_kernels=1 if you want the OpenCL / GLES_COMPUTE kernels to be built in the library's binaries instead of being read from separate ".cl" / ".cs" files. If embed_kernels is set to 0 then the application can set the path to the folder containing the OpenCL / GLES_COMPUTE kernel files by calling CLKernelLibrary::init() / GCKernelLibrary::init(). By default the path is set to "./cl_kernels" / "./cs_shaders".
-
-@b set_soname: Do you want to build the versioned version of the library ?
-
-If enabled the library will contain a SONAME and SHLIBVERSION and some symlinks will automatically be created between the objects.
-Example:
-  libarm_compute_core.so -> libarm_compute_core.so.1.0.0
-  libarm_compute_core.so.1 -> libarm_compute_core.so.1.0.0
-  libarm_compute_core.so.1.0.0
-
-@note This options is disabled by default as it requires SCons version 2.4 or above.
-
-@b extra_cxx_flags: Custom CXX flags which will be appended to the end of the build command.
-
-@b build_dir: Build the library in a subfolder of the "build" folder. (Allows to build several configurations in parallel).
-
-@b examples: Build or not the examples
-
-@b validation_tests: Enable the build of the validation suite.
-
-@b benchmark_tests: Enable the build of the benchmark tests
-
-@b pmu: Enable the PMU cycle counter to measure execution time in benchmark tests. (Your device needs to support it)
-
-@b mali: Enable the collection of Mali hardware counters to measure execution time in benchmark tests. (Your device needs to have a Mali driver that supports it)
-
-@b openmp Build in the OpenMP scheduler for NEON.
-
-@note Only works when building with g++ not clang++
-
-@b cppthreads Build in the C++11 scheduler for NEON.
-
-@sa Scheduler::set
-
-@subsection S3_2_linux Building for Linux
-
-@subsubsection S3_2_1_library How to build the library ?
-
-For Linux, the library was successfully built and tested using the following Linaro GCC toolchain:
-
- - gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
- - gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
-
-To cross-compile the library in debug mode, with NEON only support, for Linux 32bit:
-
-	scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
-
-To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit:
-
-	scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=arm64-v8a
-
-To cross-compile the library in asserts mode, with GLES_COMPUTE only support, for Linux 64bit:
-
-	scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=0 gles_compute=1 embed_kernels=1 os=linux arch=arm64-v8a
-
-You can also compile the library natively on an ARM device by using <b>build=native</b>:
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=arm64-v8a build=native
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native
-
-@note g++ for ARM is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler.
-
-For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b>
-
-	apt-get install g++-arm-linux-gnueabihf
-
-Then run
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile
-
-or simply remove the build parameter as build=cross_compile is the default value:
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a
-
-@subsubsection S3_2_2_examples How to manually build the examples ?
-
-The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
-
-@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
-
-To cross compile a NEON example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute -larm_compute_core -o neon_convolution
-
-To cross compile a NEON example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -L. -larm_compute -larm_compute_core -o neon_convolution
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
-
-To cross compile an OpenCL example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
-
-To cross compile an OpenCL example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -L. -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
-
-To cross compile a GLES example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude/ -L. -larm_compute -larm_compute_core -std=c++11 -mfpu=neon -DARM_COMPUTE_GC -Iinclude/linux/ -o gc_absdiff
-
-To cross compile a GLES example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude/ -L. -larm_compute -larm_compute_core -std=c++11 -DARM_COMPUTE_GC -Iinclude/linux/ -o gc_absdiff
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
-
-To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
-
-i.e. to cross compile the "graph_lenet" example for Linux 32bit:
-
-	arm-linux-gnueabihf-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-i.e. to cross compile the "graph_lenet" example for Linux 64bit:
-
-	aarch64-linux-gnu-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
-
-@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
-
-To compile natively (i.e directly on an ARM device) for NEON for Linux 32bit:
-
-	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -mfpu=neon -larm_compute -larm_compute_core -o neon_convolution
-
-To compile natively (i.e directly on an ARM device) for NEON for Linux 64bit:
-
-	g++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute -larm_compute_core -o neon_convolution
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
-
-To compile natively (i.e directly on an ARM device) for OpenCL for Linux 32bit or Linux 64bit:
-
-	g++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute -larm_compute_core -o cl_convolution -DARM_COMPUTE_CL
-
-To compile natively (i.e directly on an ARM device) for GLES for Linux 32bit or Linux 64bit:
-
-	g++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude/ -L. -larm_compute -larm_compute_core -std=c++11 -DARM_COMPUTE_GC -Iinclude/linux/ -o gc_absdiff
-
-To compile natively the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
-
-i.e. to natively compile the "graph_lenet" example for Linux 32bit:
-
-	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -mfpu=neon -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-i.e. to natively compile the "graph_lenet" example for Linux 64bit:
-
-	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -L. -larm_compute_graph -larm_compute -larm_compute_core -Wl,--allow-shlib-undefined -o graph_lenet
-
-(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
-
-@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute, arm_compute_core
-
-@note These two commands assume libarm_compute.so is available in your library path, if not add the path to it using -L (e.g. -Llib/linux-arm64-v8a-neon-cl-asserts/)
-@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
-
-To run the built executable simply run:
-
-	LD_LIBRARY_PATH=build ./neon_convolution
-
-or
-
-	LD_LIBRARY_PATH=build ./cl_convolution
-
-@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
-
-For example:
-
-	LD_LIBRARY_PATH=. ./graph_lenet --help
-
-Below is a list of the common parameters among the graph examples :
-@snippet utils/CommonGraphOptions.h Common graph examples parameters
-
-@subsection S3_3_android Building for Android
-
-For Android, the library was successfully built and tested using Google's standalone toolchains:
- - clang++ from NDK r18b for armv7a
- - clang++ from NDK r18b for arm64-v8a
- - clang++ from NDK r18b for arm64-v8.2-a with FP16 support
-
-Here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a>
-
-- Download the NDK r18b from here: https://developer.android.com/ndk/downloads/index.html to directory $NDK
-- Make sure you have Python 2.7 installed on your machine.
-- Generate the 32 and/or 64 toolchains by running the following commands to your toolchain dirctory $MY_TOOLCHAINS:
-
-
-	$NDK/build/tools/make_standalone_toolchain.py --arch arm64 --install-dir $MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b --stl libc++ --api 21
-	$NDK/build/tools/make_standalone_toolchain.py --arch arm --install-dir $MY_TOOLCHAINS/arm-linux-android-ndk-r18b --stl libc++ --api 21
-
-@attention We used to use gnustl but as of NDK r17 it is deprecated so we switched to libc++
-
-@note Make sure to add the toolchains to your PATH:
-
-	export PATH=$PATH:$MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b/bin:$MY_TOOLCHAINS/arm-linux-android-ndk-r18b/bin
-
-@subsubsection S3_3_1_library How to build the library ?
-
-To cross-compile the library in debug mode, with NEON only support, for Android 32bit:
-
-	CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a
-
-To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit:
-
-	CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=arm64-v8a
-
-To cross-compile the library in asserts mode, with GLES_COMPUTE only support, for Android 64bit:
-
-	CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=0 gles_compute=1 embed_kernels=1 os=android arch=arm64-v8a
-
-@subsubsection S3_3_2_examples How to manually build the examples ?
-
-The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
-
-@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
-
-Once you've got your Android standalone toolchain built and added to your path you can do the following:
-
-To cross compile a NEON example:
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_arm -static-libstdc++ -pie
-	#64 bit:
-	aarch64-linux-android-clang++ examples/neon_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o neon_convolution_aarch64 -static-libstdc++ -pie
-
-To cross compile an OpenCL example:
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
-	#64 bit:
-	aarch64-linux-android-clang++ examples/cl_convolution.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o cl_convolution_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
-
-To cross compile a GLES example:
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o gc_absdiff_arm -static-libstdc++ -pie -DARM_COMPUTE_GC
-	#64 bit:
-	aarch64-linux-android-clang++ examples/gc_absdiff.cpp utils/Utils.cpp -I. -Iinclude -std=c++11 -larm_compute-static -larm_compute_core-static -L. -o gc_absdiff_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_GC
-
-To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also.
-
-	#32 bit:
-	arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
-	#64 bit:
-	aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++11 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -larm_compute_core-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
-
-@note Due to some issues in older versions of the Mali OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android.
-@note When linked statically the arm_compute_graph library currently needs the --whole-archive linker flag in order to work properly
-
-Then you need to do is upload the executable and the shared library to the device using ADB:
-
-	adb push neon_convolution_arm /data/local/tmp/
-	adb push cl_convolution_arm /data/local/tmp/
-	adb push gc_absdiff_arm /data/local/tmp/
-	adb shell chmod 777 -R /data/local/tmp/
-
-And finally to run the example:
-
-	adb shell /data/local/tmp/neon_convolution_arm
-	adb shell /data/local/tmp/cl_convolution_arm
-	adb shell /data/local/tmp/gc_absdiff_arm
-
-For 64bit:
-
-	adb push neon_convolution_aarch64 /data/local/tmp/
-	adb push cl_convolution_aarch64 /data/local/tmp/
-	adb push gc_absdiff_aarch64 /data/local/tmp/
-	adb shell chmod 777 -R /data/local/tmp/
-
-And finally to run the example:
-
-	adb shell /data/local/tmp/neon_convolution_aarch64
-	adb shell /data/local/tmp/cl_convolution_aarch64
-	adb shell /data/local/tmp/gc_absdiff_aarch64
-
-@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
-
-For example:
-	adb shell /data/local/tmp/graph_lenet --help
-
-In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on NEON, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run.
-
-@subsection S3_4_bare_metal Building for bare metal
-
-For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains:
- - arm-eabi for armv7a
- - aarch64-elf for arm64-v8a
-
-Download linaro for <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/arm-eabi/">armv7a</a> and <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/aarch64-elf/">arm64-v8a</a>.
-
-@note Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin
-
-@subsubsection S3_4_1_library How to build the library ?
-
-To cross-compile the library with NEON support for baremetal arm64-v8a:
-
-	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=arm64-v8a build=cross_compile cppthreads=0 openmp=0 standalone=1
-
-@subsubsection S3_4_2_examples How to manually build the examples ?
-
-Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a/index.html">here</a>.
-
-@subsection S3_5_windows_host Building on a Windows host system
-
-Using `scons` directly from the Windows command line is known to cause
-problems. The reason seems to be that if `scons` is setup for cross-compilation
-it gets confused about Windows style paths (using backslashes). Thus it is
-recommended to follow one of the options outlined below.
-
-@subsubsection S3_5_1_ubuntu_on_windows Bash on Ubuntu on Windows
-
-The best and easiest option is to use
-<a href="https://msdn.microsoft.com/en-gb/commandline/wsl/about">Ubuntu on Windows</a>.
-This feature is still marked as *beta* and thus might not be available.
-However, if it is building the library is as simple as opening a *Bash on
-Ubuntu on Windows* shell and following the general guidelines given above.
-
-@subsubsection S3_5_2_cygwin Cygwin
-
-If the Windows subsystem for Linux is not available <a href="https://www.cygwin.com/">Cygwin</a>
-can be used to install and run `scons`, the minimum Cygwin version must be 3.0.7 or later. In addition
-to the default packages installed by Cygwin `scons` has to be selected in the installer. (`git` might
-also be useful but is not strictly required if you already have got the source
-code of the library.) Linaro provides pre-built versions of
-<a href="http://releases.linaro.org/components/toolchain/binaries/">GCC cross-compilers</a>
-that can be used from the Cygwin terminal. When building for Android the
-compiler is included in the Android standalone toolchain. After everything has
-been set up in the Cygwin terminal the general guide on building the library
-can be followed.
-
-@subsection S3_6_cl_requirements OpenCL DDK Requirements
-
-@subsubsection S3_6_1_cl_hard_requirements Hard Requirements
-
-Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Mali OpenCL DDK r8p0 and above as an extension (respective extension flag is \a -cl-arm-non-uniform-work-group-size).
-
-Enabling 16-bit floating point calculations require \a cl_khr_fp16 extension to be supported. All Mali GPUs with compute capabilities have native support for half precision floating points.
-
-Use of @ref CLMeanStdDev function requires 64-bit atomics support, thus \a cl_khr_int64_base_atomics should be supported in order to use.
-
-@subsubsection S3_6_2_cl_performance_requirements Performance improvements
-
-Integer dot product built-in function extensions (and therefore optimized kernels) are available with Mali OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are \a cl_arm_integer_dot_product_int8, \a cl_arm_integer_dot_product_accumulate_int8 and \a cl_arm_integer_dot_product_accumulate_int16.
-
-OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported.
-
-SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.
-
-@subsection S3_7_cl_tuner OpenCL Tuner
-
-The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
-The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
-The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
-In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
-
-If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:
-
-https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice
-
-Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.
-
-CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.
-
-    #Example: 2 unique Matrix Multiply configurations
-@code{.cpp}
-    TensorShape a0 = TensorShape(32,32);
-    TensorShape b0 = TensorShape(32,32);
-    TensorShape c0 = TensorShape(32,32);
-    TensorShape a1 = TensorShape(64,64);
-    TensorShape b1 = TensorShape(64,64);
-    TensorShape c1 = TensorShape(64,64);
-
-    Tensor a0_tensor;
-    Tensor b0_tensor;
-    Tensor c0_tensor;
-    Tensor a1_tensor;
-    Tensor b1_tensor;
-    Tensor c1_tensor;
-
-    a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
-    b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
-    c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
-    a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
-    b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
-    c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
-
-    CLGEMM gemm0;
-    CLGEMM gemm1;
-
-    // Configuration 0
-    gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
-
-    // Configuration 1
-    gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);
-@endcode
-
-@subsubsection S3_7_1_cl_tuner_how_to How to use it
-
-All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file
-
-    #Enable CL tuner
-    ./graph_mobilenet --enable-tuner –-target=CL
-    ./arm_compute_benchmark --enable-tuner
-
-    #Export/Import to/from a file
-    ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
-    ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv
-
-If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.
-
-Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:
-
-    -# Disable the power management
-    -# Keep the GPU frequency constant
-    -# Run multiple times the network (i.e. 10).
-
-If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.
-
-@code{.cpp}
-CLTuner tuner;
-
-// Setup Scheduler
-CLScheduler::get().default_init(&tuner);
-@endcode
-
-After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".
-- tuner.save_to_file("results.csv");
-
-This file can be also imported using the method "load_from_file("results.csv")".
-- tuner.load_from_file("results.csv");
-*/
-} // namespace arm_compute
diff --git a/docs/03_scripts.dox b/docs/03_scripts.dox
deleted file mode 100644
index 7e16edfb0d..0000000000
--- a/docs/03_scripts.dox
+++ /dev/null
@@ -1,178 +0,0 @@
-///
-/// Copyright (c) 2017-2019 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/**
-@page data_import Importing data from existing models
-
-@tableofcontents
-
-@section caffe_data_extractor Extract data from pre-trained caffe model
-
-One can find caffe <a href="https://github.com/BVLC/caffe/wiki/Model-Zoo">pre-trained models</a> on
-caffe's official github repository.
-
-The caffe_data_extractor.py provided in the scripts folder is an example script that shows how to
-extract the parameter values from a trained model.
-
-@note complex networks might require altering the script to properly work.
-
-@subsection caffe_how_to How to use the script
-
-Install caffe following <a href="http://caffe.berkeleyvision.org/installation.html">caffe's document</a>.
-Make sure the pycaffe has been added into the PYTHONPATH.
-
-Download the pre-trained caffe model.
-
-Run the caffe_data_extractor.py script by
-
-        python caffe_data_extractor.py -m <caffe model> -n <caffe netlist>
-
-For example, to extract the data from pre-trained caffe Alex model to binary file:
-
-        python caffe_data_extractor.py -m /path/to/bvlc_alexnet.caffemodel -n /path/to/caffe/models/bvlc_alexnet/deploy.prototxt
-
-The script has been tested under Python2.7.
-
-@subsection caffe_result  What is the expected output from the script
-
-If the script runs successfully, it prints the names and shapes of each layer onto the standard
-output and generates *.npy files containing the weights and biases of each layer.
-
-The arm_compute::utils::load_trained_data shows how one could load
-the weights and biases into tensor from the .npy file by the help of Accessor.
-
-@section tensorflow_data_extractor Extract data from pre-trained tensorflow model
-
-The script tensorflow_data_extractor.py extracts trainable parameters (e.g. values of weights and biases) from a
-trained tensorflow model. A tensorflow model consists of the following two files:
-
-{model_name}.data-{step}-{global_step}: A binary file containing values of each variable.
-
-{model_name}.meta:  A binary file containing a MetaGraph struct which defines the graph structure of the neural
-network.
-
-@note Since Tensorflow version 0.11 the binary checkpoint file which contains the values for each parameter has the format of:
-    {model_name}.data-{step}-of-{max_step}
-instead of:
-    {model_name}.ckpt
-When dealing with binary files with version >= 0.11, only pass {model_name} to -m option;
-when dealing with binary files with version < 0.11, pass the whole file name {model_name}.ckpt to -m option.
-
-@note This script relies on the parameters to be extracted being in the
-'trainable_variables' tensor collection. By default all variables are automatically added to this collection unless
-specified otherwise by the user. Thus should a user alter this default behavior and/or want to extract parameters from other
-collections, tf.GraphKeys.TRAINABLE_VARIABLES should be replaced accordingly.
-
-@subsection tensorflow_how_to How to use the script
-
-Install tensorflow and numpy.
-
-Download the pre-trained tensorflow model.
-
-Run tensorflow_data_extractor.py with
-
-        python tensorflow_data_extractor -m <path_to_binary_checkpoint_file> -n <path_to_metagraph_file>
-
-For example, to extract the data from pre-trained tensorflow Alex model to binary files:
-
-        python tensorflow_data_extractor -m /path/to/bvlc_alexnet -n /path/to/bvlc_alexnet.meta
-
-Or for binary checkpoint files before Tensorflow 0.11:
-
-        python tensorflow_data_extractor -m /path/to/bvlc_alexnet.ckpt -n /path/to/bvlc_alexnet.meta
-
-@note with versions >= Tensorflow 0.11 only model name is passed to the -m option
-
-The script has been tested with Tensorflow 1.2, 1.3 on Python 2.7.6 and Python 3.4.3.
-
-@subsection tensorflow_result What is the expected output from the script
-
-If the script runs successfully, it prints the names and shapes of each parameter onto the standard output and generates
- *.npy files containing the weights and biases of each layer.
-
-The arm_compute::utils::load_trained_data shows how one could load
-the weights and biases into tensor from the .npy file by the help of Accessor.
-
-@section tf_frozen_model_extractor Extract data from pre-trained frozen tensorflow model
-
-The script tf_frozen_model_extractor.py extracts trainable parameters (e.g. values of weights and biases) from a
-frozen trained Tensorflow model.
-
-@subsection tensorflow_frozen_how_to How to use the script
-
-Install Tensorflow and NumPy.
-
-Download the pre-trained Tensorflow model and freeze the model using the architecture and the checkpoint file.
-
-Run tf_frozen_model_extractor.py with
-
-        python tf_frozen_model_extractor -m <path_to_frozen_pb_model_file> -d <path_to_store_parameters>
-
-For example, to extract the data from pre-trained Tensorflow model to binary files:
-
-        python tf_frozen_model_extractor -m /path/to/inceptionv3.pb -d ./data
-
-@subsection tensorflow_frozen_result What is the expected output from the script
-
-If the script runs successfully, it prints the names and shapes of each parameter onto the standard output and generates
- *.npy files containing the weights and biases of each layer.
-
-The arm_compute::utils::load_trained_data shows how one could load
-the weights and biases into tensor from the .npy file by the help of Accessor.
-
-@section validate_examples Validating examples
-
-Compute Library provides a list of graph examples that are used in the context of integration and performance testing.
-The provenance of each model is part of its documentation and no structural or data alterations have been applied to any
-of them unless explicitly specified otherwise in the documentation.
-
-Using one of the provided scripts will generate files containing the trainable parameters.
-
-You can validate a given graph example on a list of inputs by running:
-
-    LD_LIBRARY_PATH=lib ./<graph_example> --validation-range='<validation_range>' --validation-file='<validation_file>' --validation-path='/path/to/test/images/' --data='/path/to/weights/'
-
-e.g:
-
-LD_LIBRARY_PATH=lib ./bin/graph_alexnet --target=CL --layout=NHWC --type=F32 --threads=4 --validation-range='16666,24998' --validation-file='val.txt' --validation-path='images/' --data='data/'
-
-where:
-    validation file is a plain document containing a list of images along with their expected label value.
-    e.g:
-
-        val_00000001.JPEG 65
-        val_00000002.JPEG 970
-        val_00000003.JPEG 230
-        val_00000004.JPEG 809
-        val_00000005.JPEG 516
-
-    --validation-range is the index range of the images within the validation file you want to check:
-    e.g:
-
-       --validation-range='100,200' will validate 100 images starting from 100th one in the validation file.
-
-    This can be useful when parallelizing the validation process is needed.
-*/
-}
diff --git a/docs/06_functions_list.dox b/docs/06_functions_list.dox
deleted file mode 100644
index c8006c6c3d..0000000000
--- a/docs/06_functions_list.dox
+++ /dev/null
@@ -1,384 +0,0 @@
-///
-/// Copyright (c) 2018-2019 Arm Limited.
-///
-/// SPDX-License-Identifier: MIT
-///
-/// Permission is hereby granted, free of charge, to any person obtaining a copy
-/// of this software and associated documentation files (the "Software"), to
-/// deal in the Software without restriction, including without limitation the
-/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
-/// sell copies of the Software, and to permit persons to whom the Software is
-/// furnished to do so, subject to the following conditions:
-///
-/// The above copyright notice and this permission notice shall be included in all
-/// copies or substantial portions of the Software.
-///
-/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-/// SOFTWARE.
-///
-namespace arm_compute
-{
-/**
-
-@page functions_list List of functions
-
-@tableofcontents
-
-@section S6_1 NEON functions
-
-- @ref IFunction
-    - @ref INESimpleFunction
-        - @ref NEAbsoluteDifference
-        - @ref NEArithmeticAddition
-        - @ref NEArithmeticSubtraction
-        - @ref NEBoundingBoxTransform
-        - @ref NEBox3x3
-        - @ref NECast
-        - @ref NEComplexPixelWiseMultiplication
-        - @ref NEComputeAllAnchors
-        - @ref NEConvolution3x3
-        - @ref NEConvolutionRectangle
-        - @ref NEDilate
-        - @ref NEElementwiseComparison
-        - @ref NEElementwiseComparisonStatic
-        - @ref NEElementwiseDivision
-        - @ref NEElementwiseMax
-        - @ref NEElementwiseMin
-        - @ref NEElementwiseSquaredDiff
-        - @ref NEErode
-        - @ref NEExpLayer
-        - @ref NEGaussian3x3
-        - @ref NEIntegralImage
-        - @ref NELogicalAnd
-        - @ref NELogicalNot
-        - @ref NELogicalOr
-        - @ref NEMedian3x3
-        - @ref NENonLinearFilter
-        - @ref NENonMaximaSuppression3x3
-        - @ref NEPixelWiseMultiplication
-        - @ref NEPReluLayer
-        - @ref NERemap
-        - @ref NEROIAlignLayer
-        - @ref NERoundLayer
-        - @ref NERsqrtLayer
-        - @ref NEScharr3x3
-        - @ref NESelect
-        - @ref NESobel3x3
-        - @ref NEStridedSlice
-        - @ref NEWarpAffine
-        - @ref NEWarpPerspective
-    - @ref INESimpleFunctionNoBorder
-        - @ref NEAccumulate
-        - @ref NEAccumulateSquared
-        - @ref NEAccumulateWeighted
-        - @ref NEActivationLayer
-        - @ref NEBatchToSpaceLayer
-        - @ref NEBitwiseAnd
-        - @ref NEBitwiseNot
-        - @ref NEBitwiseOr
-        - @ref NEBitwiseXor
-        - @ref NEChannelCombine
-        - @ref NEChannelExtract
-        - @ref NEChannelShuffleLayer
-        - @ref NECol2Im
-        - @ref NEColorConvert
-        - @ref NECopy
-        - @ref NEDepthConvertLayer
-        - @ref NEFlattenLayer
-        - @ref NEFloor
-        - @ref NEFullyConnectedLayerReshapeWeights
-        - @ref NEGather
-        - @ref NEGEMMInterleave4x4
-        - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-        - @ref NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-        - @ref NEGEMMTranspose1xW
-        - @ref NEHOGDetector
-        - @ref NEMagnitude
-        - @ref NEMeanStdDevNormalizationLayer
-        - @ref NEPermute
-        - @ref NEPhase
-        - @ref NEPriorBoxLayer
-        - @ref NEReorgLayer
-        - @ref NEReshapeLayer
-        - @ref NEReverse
-        - @ref NESlice
-        - @ref NETableLookup
-        - @ref NEThreshold
-        - @ref NETile
-        - @ref NETranspose
-        - @ref NEYOLOLayer
-    - @ref NEArgMinMaxLayer
-    - @ref NEBatchNormalizationLayer
-    - @ref NECannyEdge
-    - @ref NEComplexPixelWiseMultiplication
-    - @ref NEConcatenateLayer
-    - @ref NEConvertFullyConnectedWeights
-    - @ref NEConvolutionLayer
-    - @ref NEConvolutionLayerReshapeWeights
-    - @ref NEConvolutionSquare &lt;matrix_size&gt;
-    - @ref NECropResize
-    - @ref NEDeconvolutionLayer
-    - @ref NEDepthwiseConvolutionAssemblyDispatch
-    - @ref NEDepthwiseConvolutionLayer
-    - @ref NEDequantizationLayer
-    - @ref NEDerivative
-    - @ref NEDetectionPostProcessLayer
-    - @ref NEDirectConvolutionLayer
-    - @ref NEEqualizeHistogram
-    - @ref NEFastCorners
-    - @ref NEFFT1D
-    - @ref NEFFT2D
-    - @ref NEFFTConvolutionLayer
-    - @ref NEFill
-    - @ref NEFillBorder
-    - @ref NEFullyConnectedLayer
-    - @ref NEFuseBatchNormalization
-    - @ref NEGaussian5x5
-    - @ref NEGaussianPyramid
-        - @ref NEGaussianPyramidHalf
-        - @ref NEGaussianPyramidOrb
-    - @ref NEGEMM
-    - @ref NEGEMMAssemblyDispatch
-    - @ref NEGEMMConv2d
-    - @ref NEGEMMConvolutionLayer
-    - @ref NEGEMMLowpMatrixMultiplyCore
-    - @ref NEGenerateProposalsLayer
-    - @ref NEHarrisCorners
-    - @ref NEHistogram
-    - @ref NEHOGDescriptor
-    - @ref NEHOGGradient
-    - @ref NEHOGMultiDetection
-    - @ref NEIm2Col
-    - @ref NEInstanceNormalizationLayer
-    - @ref NEL2NormalizeLayer
-    - @ref NELaplacianPyramid
-    - @ref NELaplacianReconstruct
-    - @ref NELocallyConnectedLayer
-    - @ref NELSTMLayer
-    - @ref NELSTMLayerQuantized
-    - @ref NEQLSTMLayer
-    - @ref NEMaxUnpoolingLayer
-    - @ref NEMeanStdDev
-    - @ref NEMinMaxLocation
-    - @ref NENormalizationLayer
-    - @ref NEOpticalFlow
-    - @ref NEPadLayer
-    - @ref NEPoolingLayer
-    - @ref NEQuantizationLayer
-    - @ref NERange
-    - @ref NEReduceMean
-    - @ref NEReductionOperation
-    - @ref NERNNLayer
-    - @ref NEROIPoolingLayer
-    - @ref NEScale
-    - @ref NESobel5x5
-    - @ref NESobel7x7
-    - @ref NESoftmaxLayerGeneric &lt;IS_LOG&gt;
-    - @ref NESpaceToBatchLayer
-    - @ref NESpaceToDepthLayer
-    - @ref NESplit
-    - @ref NEStackLayer
-    - @ref NEUnstack
-    - @ref NEUpsampleLayer
-    - @ref NEWinogradConvolutionLayer
-
-@section S6_2 OpenCL functions
-
-- @ref IFunction
-    - @ref CLBatchNormalizationLayer
-    - @ref CLBatchToSpaceLayer
-    - @ref CLCannyEdge
-    - @ref CLComplexPixelWiseMultiplication
-    - @ref CLConcatenateLayer
-    - @ref CLConvolutionLayer
-    - @ref CLConvolutionLayerReshapeWeights
-    - @ref CLConvolutionSquare &lt;matrix_size&gt;
-    - @ref CLCropResize
-    - @ref CLDeconvolutionLayer
-    - @ref CLDeconvolutionLayerUpsample
-    - @ref CLDepthToSpaceLayer
-    - @ref CLDepthwiseConvolutionLayer
-    - @ref CLDequantizationLayer
-    - @ref CLDirectConvolutionLayer
-    - @ref CLDirectDeconvolutionLayer
-    - @ref CLEqualizeHistogram
-    - @ref CLFastCorners
-    - @ref CLFFT1D
-    - @ref CLFFT2D
-    - @ref CLFFTConvolutionLayer
-    - @ref CLFullyConnectedLayer
-    - @ref CLFuseBatchNormalization
-    - @ref CLGaussian5x5
-    - @ref CLGaussianPyramid
-        - @ref CLGaussianPyramidHalf
-        - @ref CLGaussianPyramidOrb
-    - @ref CLGEMM
-    - @ref CLGEMMConvolutionLayer
-    - @ref CLGEMMDeconvolutionLayer
-    - @ref CLGEMMLowpMatrixMultiplyCore
-    - @ref CLGenerateProposalsLayer
-    - @ref CLHarrisCorners
-    - @ref CLHistogram
-    - @ref CLHOGDescriptor
-    - @ref CLHOGDetector
-    - @ref CLHOGGradient
-    - @ref CLHOGMultiDetection
-    - @ref CLIntegralImage
-    - @ref CLL2NormalizeLayer
-    - @ref CLLaplacianPyramid
-    - @ref CLLaplacianReconstruct
-    - @ref CLLocallyConnectedLayer
-    - @ref CLLogicalAnd
-    - @ref CLLogicalNot
-    - @ref CLLogicalOr
-    - @ref CLLSTMLayer
-    - @ref CLLSTMLayerQuantized
-    - @ref CLQLSTMLayer
-    - @ref CLMaxUnpoolingLayer
-    - @ref CLMeanStdDev
-    - @ref CLMinMaxLocation
-    - @ref CLNormalizationLayer
-    - @ref CLNormalizePlanarYUVLayer
-    - @ref CLOpticalFlow
-    - @ref CLPadLayer
-    - @ref CLQuantizationLayer
-    - @ref CLReduceMean
-    - @ref CLReductionOperation
-    - @ref CLRNNLayer
-    - @ref CLSobel5x5
-    - @ref CLSobel7x7
-    - @ref CLSoftmaxLayerGeneric &lt;IS_LOG&gt;
-    - @ref CLSpaceToBatchLayer
-    - @ref CLSpaceToDepthLayer
-    - @ref CLSplit
-    - @ref CLStackLayer
-    - @ref CLUnstack
-    - @ref CLUpsampleLayer
-    - @ref CLWinogradConvolutionLayer
-    - @ref ICLSimpleFunction
-        - @ref CLAbsoluteDifference
-        - @ref CLAccumulate
-        - @ref CLAccumulateSquared
-        - @ref CLAccumulateWeighted
-        - @ref CLActivationLayer
-        - @ref CLArgMinMaxLayer
-        - @ref CLArithmeticAddition
-        - @ref CLArithmeticDivision
-        - @ref CLArithmeticSubtraction
-        - @ref CLBitwiseAnd
-        - @ref CLBitwiseNot
-        - @ref CLBitwiseOr
-        - @ref CLBitwiseXor
-        - @ref CLBoundingBoxTransform
-        - @ref CLBox3x3
-        - @ref CLCast
-        - @ref CLChannelCombine
-        - @ref CLChannelExtract
-        - @ref CLChannelShuffleLayer
-        - @ref CLColorConvert
-        - @ref CLComparison
-        - @ref CLComparisonStatic
-        - @ref CLComputeAllAnchors
-        - @ref CLConvertFullyConnectedWeights
-        - @ref CLConvolution3x3
-        - @ref CLConvolutionRectangle
-        - @ref CLCopy
-        - @ref CLDepthConvertLayer
-        - @ref CLDerivative
-        - @ref CLDilate
-        - @ref CLElementwiseMax
-        - @ref CLElementwiseMin
-        - @ref CLElementwiseSquaredDiff
-        - @ref CLErode
-        - @ref CLExpLayer
-        - @ref CLFill
-        - @ref CLFillBorder
-        - @ref CLFlattenLayer
-        - @ref CLFloor
-        - @ref CLFullyConnectedLayerReshapeWeights
-        - @ref CLGather
-        - @ref CLGaussian3x3
-        - @ref CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
-        - @ref CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
-        - @ref CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
-        - @ref CLMagnitude
-        - @ref CLMeanStdDevNormalizationLayer
-        - @ref CLMedian3x3
-        - @ref CLNonLinearFilter
-        - @ref CLNonMaximaSuppression3x3
-        - @ref CLPermute
-        - @ref CLPhase
-        - @ref CLPixelWiseMultiplication
-        - @ref CLPoolingLayer
-        - @ref CLPReluLayer
-        - @ref CLPriorBoxLayer
-        - @ref CLRange
-        - @ref CLRemap
-        - @ref CLReorgLayer
-        - @ref CLReshapeLayer
-        - @ref CLReverse
-        - @ref CLROIAlignLayer
-        - @ref CLROIPoolingLayer
-        - @ref CLRsqrtLayer
-        - @ref CLScale
-        - @ref CLScharr3x3
-        - @ref CLSelect
-        - @ref CLSlice
-        - @ref CLSobel3x3
-        - @ref CLStridedSlice
-        - @ref CLTableLookup
-        - @ref CLThreshold
-        - @ref CLTile
-        - @ref CLTranspose
-        - @ref CLWarpAffine
-        - @ref CLWarpPerspective
-        - @ref CLWinogradInputTransform
-        - @ref CLYOLOLayer
-
-@section S6_3 GLES Compute functions
-
-- @ref IFunction
-    - @ref GCBatchNormalizationLayer
-    - @ref GCConcatenateLayer
-    - @ref GCConvolutionLayer
-    - @ref GCConvolutionLayerReshapeWeights
-    - @ref GCDepthwiseConvolutionLayer3x3
-    - @ref GCDirectConvolutionLayer
-    - @ref GCDropoutLayer
-    - @ref GCFullyConnectedLayer
-    - @ref GCGEMM
-    - @ref GCNormalizationLayer
-    - @ref GCNormalizePlanarYUVLayer
-    - @ref GCPoolingLayer
-    - @ref GCSoftmaxLayer
-    - @ref IGCSimpleFunction
-        - @ref GCAbsoluteDifference
-        - @ref GCActivationLayer
-        - @ref GCArithmeticAddition
-        - @ref GCFillBorder
-        - @ref GCFullyConnectedLayerReshapeWeights
-        - @ref GCGEMMInterleave4x4
-        - @ref GCGEMMTranspose1xW
-        - @ref GCPixelWiseMultiplication
-        - @ref GCScale
-        - @ref GCTensorShift
-        - @ref GCTranspose
-
-@section S6_4 CPP functions
-
- - @ref IFunction
-    - @ref CPPDetectionOutputLayer
-    - @ref CPPDetectionPostProcessLayer
-    - @ref ICPPSimpleFunction
-        - @ref CPPBoxWithNonMaximaSuppressionLimit
-        - @ref CPPPermute
-        - @ref CPPTopKV
-        - @ref CPPUpsample
-
-*/
-} // namespace
diff --git a/docs/ComputeLibrary.dir b/docs/ComputeLibrary.dir
index 7733e531cd..ab9dfc1b93 100644
--- a/docs/ComputeLibrary.dir
+++ b/docs/ComputeLibrary.dir
@@ -1,8 +1,12 @@
 //
-// Copyright © 2020 Arm Ltd. All rights reserved.
+// Copyright © 2020,2022 Arm Ltd. All rights reserved.
 // SPDX-License-Identifier: MIT
 //
 
+// The following files are omitted due to technical limitations:
+// Directories : data, include, python
+// Files : LICENSE, README.md, SConscript, SConstruct, Security.md, filelist.json, filedefs.json
+
 /** @file Android.bp
  *  @brief Generation script for building AndroidNN driver.
  */
@@ -43,36 +47,16 @@
  *  @brief All experimental interfaces
  */
 
-/** @dir arm_compute/core/GLES_COMPUTE
- *  @brief OpenGLES backend core: kernels and utilities.
- */
-
-/** @file arm_compute/core/GLES_COMPUTE/GCKernelLibrary.h
- *  @brief Manages all the GLES kernels compilation and caching, provides accessors for the GLES Context.
- */
-
-/** @file arm_compute/core/GLES_COMPUTE/GCKernels.h
- *  @brief Includes all the GLES kernels at once
- */
-
-/** @file arm_compute/core/GLES_COMPUTE/OpenGLES.h
- *  @brief Wrapper to configure the Khronos EGL and OpenGL ES C header
- */
-
-/** @dir arm_compute/core/GLES_COMPUTE/kernels
- *  @brief Folder containing all the GLES kernels
- */
-
 /** @dir src/core/NEON
- *  @brief NEON backend core: kernels and utilities.
+ *  @brief Arm® Neon™ backend core: kernels and utilities.
  */
 
 /** @file src/core/NEON/NEKernels.h
- *  @brief Includes all the NEON kernels at once
+ *  @brief Includes all the Arm® Neon™ kernels at once
  */
 
 /** @dir src/core/NEON/kernels
- *  @brief Folder containing all the NEON kernels
+ *  @brief Folder containing all the Arm® Neon™ kernels
  */
 
 /** @dir arm_compute/core/utils
@@ -95,12 +79,8 @@
  *  @brief OpenCL specific operations
  */
 
-/** @dir arm_compute/graph/backends/GLES
- *  @brief OpenGLES specific operations
- */
-
 /** @dir arm_compute/graph/backends/NEON
- *  @brief NEON specific operations
+ *  @brief Arm® Neon™ specific operations
  */
 
 /** @dir arm_compute/graph/detail
@@ -160,7 +140,7 @@
  */
 
 /** @file arm_compute/runtime/CPP/CPPScheduler.h
- *  @brief Basic pool of threads to execute CPP/NEON code on several cores in parallel.
+ *  @brief Basic pool of threads to execute CPP/Neon code on several cores in parallel.
  */
 
 /** @dir arm_compute/runtime/CPP/functions
@@ -171,32 +151,16 @@
  *  @brief Experimental runtime interface.
  */
 
-/** @dir arm_compute/runtime/GLES_COMPUTE
- *  @brief OpenGLES backend runtime interface.
- */
-
-/** @file arm_compute/runtime/GLES_COMPUTE/GCFunctions.h
- *  @brief Includes all the OpenGLES functions at once
- */
-
-/** @file arm_compute/runtime/GLES_COMPUTE/GCScheduler.h
- *  @brief Interface to enqueue GLES kernels and get/set the GLES CommandQueue.
- */
-
-/** @dir arm_compute/runtime/GLES_COMPUTE/functions
- *  @brief Folder containing all the GLES functions.
- */
-
 /** @dir arm_compute/runtime/NEON
- *  @brief NEON backend runtime interface.
+ *  @brief Arm® Neon™ backend runtime interface.
  */
 
 /** @file arm_compute/runtime/NEON/NEFunctions.h
- *  @brief Includes all the NEON functions at once.
+ *  @brief Includes all the Arm® Neon™ functions at once.
  */
 
 /** @dir arm_compute/runtime/NEON/functions
- *  @brief Folder containing all the NEON functions.
+ *  @brief Folder containing all the Arm® Neon™ functions.
  */
 
 /** @dir arm_compute/runtime/OMP
@@ -221,10 +185,9 @@
  *  @details Examples have the following structure:
  *
  *  -# cl_*.cpp --> OpenCL examples
- *  -# gc_*.cpp --> GLES compute shaders examples
  *  -# graph_*.cpp --> Graph examples
- *  -# neoncl_*.cpp --> NEON / OpenCL interoperability examples
- *  -# neon_*.cpp --> NEON examples
+ *  -# neoncl_*.cpp --> Arm® Neon™ / OpenCL interoperability examples
+ *  -# neon_*.cpp --> Arm® Neon™ examples
  */
 
 /** @dir examples/gemm_tuner
@@ -235,14 +198,6 @@
  *  @brief Utility scripts.
  */
 
-/** @file scripts/caffe_data_extractor.py
- *  @brief Basic script to export weights from Caffe to npy files.
- */
-
-/** @file scripts/tensorflow_data_extractor.py
- *  @brief Basic script to export weights from TensorFlow to npy files.
- */
-
 /** @dir src
  *  @brief Source code implementing all the arm_compute headers.
  */
@@ -252,11 +207,11 @@
  */
 
 /** @dir src/core/NEON/wrapper
- *  @brief NEON wrapper used to simplify code
+ *  @brief Arm® Neon™ wrapper used to simplify code
  */
 
 /** @file src/core/NEON/wrapper/traits.h
- *  @brief Traits defined on NEON vectors
+ *  @brief Traits defined on Arm® Neon™ vectors
  */
 
 /** @file src/core/NEON/wrapper/wrapper.h
@@ -264,14 +219,14 @@
  */
 
 /** @dir src/core/NEON/wrapper/intrinsics
- *  @brief NEON intrinsics wrappers
+ *  @brief Arm® Neon™ intrinsics wrappers
  */
 
 /** @dir src/core/NEON/wrapper/scalar
  *  @brief Scalar operations
  */
 
-/** @dir src/core/CL/gemm
+/** @dir src/gpu/cl/kernels/gemm
  *  @brief Folder containing all the configuration files for GEMM
  */
 
@@ -295,12 +250,8 @@
  *  @brief OpenCL accessors.
  */
 
-/** @dir tests/GLES_COMPUTE
- *  @brief GLES accessors.
- */
-
 /** @dir tests/NEON
- *  @brief NEON accessors.
+ *  @brief Arm® Neon™ accessors.
  */
 
 /** @dir tests/benchmark
@@ -311,12 +262,8 @@
  *  @brief OpenCL benchmarking tests.
  */
 
-/** @dir tests/benchmark/GLES_COMPUTE
- *  @brief GLES benchmarking tests.
- */
-
 /** @dir tests/benchmark/NEON
- *  @brief NEON benchmarking tests.
+ *  @brief Arm® Neon™ benchmarking tests.
  */
 
 /** @dir tests/benchmark_examples
@@ -347,12 +294,8 @@
  *  @brief C++ validation tests.
  */
 
-/** @dir tests/validation/GLES_COMPUTE
- *  @brief GLES validation tests.
- */
-
 /** @dir tests/validation/NEON
- *  @brief NEON validation tests.
+ *  @brief Arm® Neon™ validation tests.
  */
 
 /** @dir tests/validation/reference
diff --git a/docs/Doxyfile b/docs/Doxyfile
index bdc4b776d3..186f66c086 100644
--- a/docs/Doxyfile
+++ b/docs/Doxyfile
@@ -1,4 +1,4 @@
-# Doxyfile 1.8.15
+# Doxyfile 1.8.17
 
 # This file describes the settings to be used by the documentation system
 # doxygen (www.doxygen.org) for a project.
@@ -364,7 +364,7 @@ SUBGROUPING            = YES
 
 # When the INLINE_GROUPED_CLASSES tag is set to YES, classes, structs and unions
 # are shown inside the group in which they are included (e.g. using \ingroup)
-# instead of on a separate page (for HTML and Man pages) or section (for LaTeX
+# instead of on a separate page (for HTML and Manual pages) or section (for LaTeX
 # and RTF).
 #
 # Note that this feature does not work in combination with
@@ -378,7 +378,7 @@ INLINE_GROUPED_CLASSES = NO
 # the documentation of the scope in which they are defined (i.e. file,
 # namespace, or group documentation), provided this scope is documented. If set
 # to NO, structs, classes, and unions are shown on a separate page (for HTML and
-# Man pages) or section (for LaTeX and RTF).
+# Manual pages) or section (for LaTeX and RTF).
 # The default value is: NO.
 
 INLINE_SIMPLE_STRUCTS  = NO
@@ -687,7 +687,7 @@ FILE_VERSION_FILTER    =
 # DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE
 # tag is left empty.
 
-LAYOUT_FILE            = 
+LAYOUT_FILE            = ./docs/DoxygenLayout.xml
 
 # The CITE_BIB_FILES tag can be used to specify one or more bib files containing
 # the reference definitions. This must be a list of .bib files. The .bib
@@ -768,14 +768,20 @@ WARN_LOGFILE           =
 # spaces.
 # Note: If this tag is empty the current directory is searched.
 
-INPUT                  = ./docs/00_introduction.dox \
-                         ./docs/01_library.dox \
-                         ./docs/02_tests.dox \
-                         ./docs/03_scripts.dox \
-                         ./docs/04_adding_operator.dox \
-                         ./docs/05_contribution_guidelines.dox \
-                         ./docs/06_functions_list.dox \
-                         ./docs/07_errata.dox \
+INPUT                  = ./docs/user_guide/introduction.dox \
+                         ./docs/user_guide/how_to_build_and_run_examples.dox \
+                         ./docs/user_guide/library.dox \
+                         ./docs/user_guide/data_type.dox \
+                         ./docs/user_guide/data_layout.dox \
+                         ./docs/user_guide/conv2d_heuristic.dox \
+                         ./docs/user_guide/operator_list.dox \
+                         ./docs/user_guide/tests.dox \
+                         ./docs/user_guide/advanced.dox \
+                         ./docs/user_guide/release_version_and_change_log.dox \
+                         ./docs/user_guide/errata.dox \
+                         ./docs/contributor_guide/contribution_guidelines.dox \
+                         ./docs/contributor_guide/adding_operator.dox \
+                         ./docs/contributor_guide/implementation_topics.dox \
                          ./docs/ComputeLibrary.dir \
                          ./arm_compute/ \
                          ./src/ \
@@ -869,9 +875,9 @@ EXCLUDE                = ./arm_compute/core/NEON/kernels/assembly/ \
                          ./src/core/NEON/kernels/convolution/ \
                          ./src/core/NEON/kernels/NELKTrackerKernel.cpp \
                          ./src/core/NEON/kernels/NEL2NormalizeLayerKernel.cpp \
-                         ./src/core/GLES_COMPUTE/cs_shaders/ \
                          ./tests/datasets/ \
                          ./tests/benchmark/fixtures/ \
+                         ./tests/validation/CL/UNIT/dynamic_fusion/ClCompositeKernel.cpp \
                          ./tests/validation/fixtures/
 
 # The EXCLUDE_SYMLINKS tag can be used to select whether or not files or
@@ -1082,7 +1088,7 @@ CLANG_ASSISTED_PARSING = NO
 # specified with INPUT and INCLUDE_PATH.
 # This tag requires that the tag CLANG_ASSISTED_PARSING is set to YES.
 
-CLANG_OPTIONS          = -std=c++11
+CLANG_OPTIONS          = -std=c++14
 
 #---------------------------------------------------------------------------
 # Configuration options related to the alphabetical class index
@@ -1546,11 +1552,11 @@ MATHJAX_FORMAT         = HTML-CSS
 # MATHJAX_RELPATH should be ../mathjax. The default value points to the MathJax
 # Content Delivery Network so you can quickly see the result without installing
 # MathJax. However, it is strongly recommended to install a local copy of
-# MathJax from http://www.mathjax.org before deployment.
-# The default value is: http://cdn.mathjax.org/mathjax/latest.
+# MathJax from https://www.mathjax.org before deployment.
+# The default value is: https://cdn.mathjax.org/mathjax/latest.
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
-MATHJAX_RELPATH        = http://cdn.mathjax.org/mathjax/latest
+MATHJAX_RELPATH        = https://cdn.mathjax.org/mathjax/latest
 
 # The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax
 # extension names that should be enabled during MathJax rendering. For example
@@ -1878,16 +1884,16 @@ RTF_EXTENSIONS_FILE    =
 #RTF_SOURCE_CODE        = NO
 
 #---------------------------------------------------------------------------
-# Configuration options related to the man page output
+# Configuration options related to the manual page output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_MAN tag is set to YES, doxygen will generate man pages for
+# If the GENERATE_MAN tag is set to YES, doxygen will generate manual pages for
 # classes and files.
 # The default value is: NO.
 
 GENERATE_MAN           = NO
 
-# The MAN_OUTPUT tag is used to specify where the man pages will be put. If a
+# The MAN_OUTPUT tag is used to specify where the manual pages will be put. If a
 # relative path is entered the value of OUTPUT_DIRECTORY will be put in front of
 # it. A directory man3 will be created inside the directory specified by
 # MAN_OUTPUT.
@@ -1897,7 +1903,7 @@ GENERATE_MAN           = NO
 MAN_OUTPUT             = man
 
 # The MAN_EXTENSION tag determines the extension that is added to the generated
-# man pages. In case the manual section does not start with a number, the number
+# manual pages. In case the manual section does not start with a number, the number
 # 3 is prepended. The dot (.) at the beginning of the MAN_EXTENSION tag is
 # optional.
 # The default value is: .3.
@@ -1906,15 +1912,15 @@ MAN_OUTPUT             = man
 MAN_EXTENSION          = .3
 
 # The MAN_SUBDIR tag determines the name of the directory created within
-# MAN_OUTPUT in which the man pages are placed. If defaults to man followed by
+# MAN_OUTPUT in which the manual pages are placed. If defaults to man followed by
 # MAN_EXTENSION with the initial . removed.
 # This tag requires that the tag GENERATE_MAN is set to YES.
 
 #MAN_SUBDIR             = 
 
 # If the MAN_LINKS tag is set to YES and doxygen generates man output, then it
-# will generate one additional man file for each entity documented in the real
-# man page(s). These additional files only source the real man page, but without
+# will generate one additional manual file for each entity documented in the real
+# manual page(s). These additional files only source the real manual page, but without
 # them the man command would be unable to find the correct page.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_MAN is set to YES.
@@ -2064,7 +2070,7 @@ SEARCH_INCLUDES        = YES
 # preprocessor.
 # This tag requires that the tag SEARCH_INCLUDES is set to YES.
 
-INCLUDE_PATH           = 
+INCLUDE_PATH           = ./src/core/CL/cl_kernels/
 
 # You can use the INCLUDE_FILE_PATTERNS tag to specify one or more wildcard
 # patterns (like *.h and *.hpp) to filter out the header-files in the
diff --git a/docs/DoxygenLayout.xml b/docs/DoxygenLayout.xml
new file mode 100644
index 0000000000..4e09e20e3d
--- /dev/null
+++ b/docs/DoxygenLayout.xml
@@ -0,0 +1,211 @@
+<doxygenlayout version="1.0">
+  <!-- Generated by doxygen 1.8.15 -->
+  <!-- Navigation index tabs for HTML output -->
+  <navindex>
+    <tab type="mainpage" url="@ref introduction" title="Introduction"/>
+    <tab type="usergroup" title="User Guide">
+        <tab type="user" url="@ref how_to_build" title="How to Build and Run Examples"/>
+        <tab type="user" url="@ref architecture" title="Library Architecture"/>
+        <tab type="user" url="@ref data_type_support" title="Data Type Support"/>
+        <tab type="user" url="@ref data_layout_support" title="Data Layout Support"/>
+        <tab type="user" url="@ref conv2d_heuristic" title="Convolution 2D heuristic"/>
+        <tab type="user" url="@ref operators_list" title="Operator List"/>
+        <tab type="user" url="@ref tests" title="Validation and benchmarks"/>
+        <tab type="user" url="@ref advanced" title="Advanced"/>
+        <tab type="user" url="@ref versions_changelogs" title="Release Versions and Changelog"/>
+        <tab type="user" url="@ref errata" title="Errata"/>
+    </tab>
+    <tab type="usergroup" title="Contributor Guide">
+        <tab type="user" url="@ref contribution_guidelines" title="Contribution Guidelines"/>
+        <tab type="user" url="@ref adding_operator" title="How to Add a New Operator"/>
+        <tab type="user" url="@ref implementation_topic" title="Implementation Topics"/>
+    </tab>
+    <tab type="pages" visible="no" title="Pages" intro=""/>
+    <tab type="modules" visible="yes" title="" intro=""/>
+    <tab type="namespaces" visible="yes" title="">
+      <tab type="namespacelist" visible="yes" title="" intro=""/>
+      <tab type="namespacemembers" visible="yes" title="" intro=""/>
+    </tab>
+    <tab type="classes" visible="yes" title="">
+      <tab type="classlist" visible="yes" title="" intro=""/>
+      <tab type="classindex" visible="$ALPHABETICAL_INDEX" title=""/>
+      <tab type="hierarchy" visible="yes" title="" intro=""/>
+      <tab type="classmembers" visible="yes" title="" intro=""/>
+    </tab>
+    <tab type="files" visible="yes" title="">
+      <tab type="filelist" visible="yes" title="" intro="Below is a list of files with brief descriptions. Please note that the descriptions of some miscellaneous files and directories (e.g. data/ and include/) are omitted because of Doxygen limitations."/>
+      <tab type="globals" visible="yes" title="" intro=""/>
+    </tab>
+    <tab type="examples" visible="yes" title="" intro=""/>
+  </navindex>
+
+  <!-- Layout definition for a class page -->
+  <class>
+    <briefdescription visible="yes"/>
+    <includes visible="$SHOW_INCLUDE_FILES"/>
+    <inheritancegraph visible="$CLASS_GRAPH"/>
+    <collaborationgraph visible="$COLLABORATION_GRAPH"/>
+    <memberdecl>
+      <nestedclasses visible="yes" title=""/>
+      <publictypes title=""/>
+      <services title=""/>
+      <interfaces title=""/>
+      <publicslots title=""/>
+      <signals title=""/>
+      <publicmethods title=""/>
+      <publicstaticmethods title=""/>
+      <publicattributes title=""/>
+      <publicstaticattributes title=""/>
+      <protectedtypes title=""/>
+      <protectedslots title=""/>
+      <protectedmethods title=""/>
+      <protectedstaticmethods title=""/>
+      <protectedattributes title=""/>
+      <protectedstaticattributes title=""/>
+      <packagetypes title=""/>
+      <packagemethods title=""/>
+      <packagestaticmethods title=""/>
+      <packageattributes title=""/>
+      <packagestaticattributes title=""/>
+      <properties title=""/>
+      <events title=""/>
+      <privatetypes title=""/>
+      <privateslots title=""/>
+      <privatemethods title=""/>
+      <privatestaticmethods title=""/>
+      <privateattributes title=""/>
+      <privatestaticattributes title=""/>
+      <friends title=""/>
+      <related title="" subtitle=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <inlineclasses title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <services title=""/>
+      <interfaces title=""/>
+      <constructors title=""/>
+      <functions title=""/>
+      <related title=""/>
+      <variables title=""/>
+      <properties title=""/>
+      <events title=""/>
+    </memberdef>
+    <allmemberslink visible="yes"/>
+    <usedfiles visible="$SHOW_USED_FILES"/>
+    <authorsection visible="yes"/>
+  </class>
+
+  <!-- Layout definition for a namespace page -->
+  <namespace>
+    <briefdescription visible="yes"/>
+    <memberdecl>
+      <nestednamespaces visible="yes" title=""/>
+      <constantgroups visible="yes" title=""/>
+      <classes visible="yes" title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <inlineclasses title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+    </memberdef>
+    <authorsection visible="yes"/>
+  </namespace>
+
+  <!-- Layout definition for a file page -->
+  <file>
+    <briefdescription visible="yes"/>
+    <includes visible="$SHOW_INCLUDE_FILES"/>
+    <includegraph visible="$INCLUDE_GRAPH"/>
+    <includedbygraph visible="$INCLUDED_BY_GRAPH"/>
+    <sourcelink visible="yes"/>
+    <memberdecl>
+      <classes visible="yes" title=""/>
+      <namespaces visible="yes" title=""/>
+      <constantgroups visible="yes" title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <inlineclasses title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <functions title=""/>
+      <variables title=""/>
+    </memberdef>
+    <authorsection/>
+  </file>
+
+  <!-- Layout definition for a group page -->
+  <group>
+    <briefdescription visible="yes"/>
+    <groupgraph visible="$GROUP_GRAPHS"/>
+    <memberdecl>
+      <nestedgroups visible="yes" title=""/>
+      <dirs visible="yes" title=""/>
+      <files visible="yes" title=""/>
+      <namespaces visible="yes" title=""/>
+      <classes visible="yes" title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <enumvalues title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <signals title=""/>
+      <publicslots title=""/>
+      <protectedslots title=""/>
+      <privateslots title=""/>
+      <events title=""/>
+      <properties title=""/>
+      <friends title=""/>
+      <membergroups visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+    <memberdef>
+      <pagedocs/>
+      <inlineclasses title=""/>
+      <defines title=""/>
+      <typedefs title=""/>
+      <enums title=""/>
+      <enumvalues title=""/>
+      <functions title=""/>
+      <variables title=""/>
+      <signals title=""/>
+      <publicslots title=""/>
+      <protectedslots title=""/>
+      <privateslots title=""/>
+      <events title=""/>
+      <properties title=""/>
+      <friends title=""/>
+    </memberdef>
+    <authorsection visible="yes"/>
+  </group>
+
+  <!-- Layout definition for a directory page -->
+  <directory>
+    <briefdescription visible="yes"/>
+    <directorygraph visible="yes"/>
+    <memberdecl>
+      <dirs visible="yes"/>
+      <files visible="yes"/>
+    </memberdecl>
+    <detaileddescription title=""/>
+  </directory>
+</doxygenlayout>
diff --git a/docs/04_adding_operator.dox b/docs/contributor_guide/adding_operator.dox
index 13be712549..559e8e2e76 100644
--- a/docs/04_adding_operator.dox
+++ b/docs/contributor_guide/adding_operator.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2018-2019 Arm Limited.
+/// Copyright (c) 2018-2022 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -25,10 +25,12 @@
 namespace arm_compute
 {
 /**
-@page add_operator Adding new operators
+@page adding_operator How to Add a New Operator
 
 @tableofcontents
 
+@section S4_0_introduction Adding new operators
+
 @section S4_1_introduction Introduction
 In Compute Library there are two main parts or modules:
 - The core library consists of a low-level collection of algorithms implemented in C++ and optimized for Arm CPUs and GPUs. The core module is designed to be embedded in other projects and it doesn't perform any memory management or scheduling.
@@ -53,13 +55,13 @@ Following are the steps involved in adding support for a new operator in Compute
 @subsection S4_1_1_add_datatypes Adding new data types
 
 Compute Library declares a few new datatypes related to its domain, kernels, and functions in the library process Tensors and Images (Computer Vision functions). Tensors are multi-dimensional arrays with a maximum of Coordinates::num_max_dimensions dimensions; depending on the number of dimensions tensors can be interpreted as various objects. A scalar can be represented as a zero-dimensional tensor and a vector of numbers can be represented as a one-dimensional tensor. Furthermore, an image is just a 2D tensor, a 3D tensor can be seen as an array of images and a 4D tensor as a 2D array of images, etc.
-All the datatype classes or structures are grouped in the core library folder arm_compute/core  like the @ref ITensor, @ref ITensorInfo (all the information of a tensor), TensorShape and simpler types are in arm_compute/core/Types.h.
+All the datatype classes or structures are grouped in the core library folder arm_compute/core  like the @ref ITensor, @ref ITensorInfo (all the information of a tensor), TensorShape and simpler types are in arm_compute/core/CoreTypes.h.
 
 If an operator handles a new datatype, it must be added to the library. While adding a new data type to the library, it's necessary to implement the function to enable printing, the to_string() method and the output stream insertion (<<) operator. Every datatype implements these two functions in utils/TypePrinter.h
 
-A quick example, in <a href="https://github.com/ARM-software/ComputeLibrary/blob/master/arm_compute/core/Types.h">Types.h</a> we add:
+A quick example, in <a href="https://github.com/ARM-software/ComputeLibrary/blob/main/arm_compute/core/CoreTypes.h">CoreTypes.h</a> we add:
 
-@snippet arm_compute/core/Types.h DataLayout enum definition
+@snippet arm_compute/core/CoreTypes.h DataLayout enum definition
 
 And for printing:
 
@@ -71,12 +73,12 @@ Similarly, all common functions that process shapes, like calculating output sha
 
 
 @subsection S4_1_2_add_kernel Add a kernel
-As we mentioned at the beginning, the kernel is the implementation of the operator or algorithm partially using a specific programming language related to the backend we want to use. Adding a kernel in the library means implementing the algorithm in a SIMD technology like NEON or OpenCL. All kernels in Compute Library must implement a common interface IKernel or one of the specific subinterfaces.
+As we mentioned at the beginning, the kernel is the implementation of the operator or algorithm partially using a specific programming language related to the backend we want to use. Adding a kernel in the library means implementing the algorithm in a SIMD technology like Arm® Neon™ or OpenCL. All kernels in Compute Library must implement a common interface IKernel or one of the specific subinterfaces.
 IKernel is the common interface for all the kernels in the core library, it contains the main methods for configure and run the kernel itself, such as window()  that return the maximum window the kernel can be executed on or is_parallelisable() for indicate whether or not the kernel is parallelizable. If the kernel is parallelizable then the window returned by the window() method can be split into sub-windows which can then be run in parallel, in the other case, only the window returned by window() can be passed to the run method.
-There are specific interfaces for OpenCL and Neon: @ref ICLKernel, INEKernel (using INEKernel = @ref ICPPKernel).
+There are specific interfaces for OpenCL and Neon™: @ref ICLKernel, INEKernel (using INEKernel = @ref ICPPKernel).
 
 - @ref ICLKernel is the common interface for all the OpenCL kernels. It implements the inherited methods and adds all the methods necessary to configure the CL kernel, such as set/return the Local-Workgroup-Size hint, add single, array or tensor argument, set the targeted GPU architecture according to the CL device. All these methods are used during the configuration and the run of the operator.
-- INEKernel inherits from @ref IKernel as well and it's the common interface for all kernels implemented in NEON, it adds just the run and the name methods.
+- INEKernel inherits from @ref IKernel as well and it's the common interface for all kernels implemented in Neon™, it adds just the run and the name methods.
 
 There are two others implementation of @ref IKernel called @ref ICLSimpleKernel and INESimpleKernel, they are the interface for simple kernels that have just one input tensor and one output tensor.
 Creating a new kernel implies adding new files:
@@ -85,7 +87,7 @@ Creating a new kernel implies adding new files:
 - src/core/CL/kernels/CLReshapeLayerKernel.cpp
 - src/core/CL/CLKernelLibrary.cpp
 
-Neon kernel
+Neon™ kernel
 - arm_compute/core/NEON/kernels/NEReshapeLayerKernel.h
 - src/core/NEON/kernels/NEReshapeLayerKernel.cpp
 
@@ -95,12 +97,12 @@ We must register the new layer in the respective libraries:
 
 These files contain the list of all kernels available in the corresponding Compute Library's backend, for example CLKernels:
 @code{.cpp}
-... 
+...
 #include "src/core/CL/kernels/CLMinMaxLayerKernel.h"
 #include "src/core/CL/kernels/CLMinMaxLocationKernel.h"
-... 
+...
 #include "src/core/CL/kernels/CLReshapeLayerKernel.h"
-... 
+...
 
 @endcode
 
@@ -117,13 +119,13 @@ Each kernel will have to implement the method:
 
 The structure of the kernel .cpp file should be similar to the next ones.
 For OpenCL:
-@snippet src/core/CL/kernels/CLReshapeLayerKernel.cpp CLReshapeLayerKernel Kernel
+@snippet src/gpu/cl/kernels/ClReshapeKernel.cpp ClReshapeKernel Kernel
 The run will call the function defined in the .cl file.
 
-For the NEON backend case:
-@snippet src/core/NEON/kernels/NEReshapeLayerKernel.cpp NEReshapeLayerKernel Kernel
+For the Arm® Neon™ backend case:
+@snippet src/cpu/kernels/CpuReshapeKernel.cpp NEReshapeLayerKernel Kernel
 
-In the NEON case, there is no need to add an extra file and we implement the kernel in the same NEReshapeLayerKernel.cpp file.
+In the Arm® Neon™ case, there is no need to add an extra file and we implement the kernel in the same NEReshapeLayerKernel.cpp file.
 If the tests are already in place, the new kernel can be tested using the existing tests by adding the configure and run of the kernel to the compute_target() in the fixture.
 
 
@@ -137,13 +139,13 @@ If the tests are already in place, the new kernel can be tested using the existi
 - (sub[n].start() - max[n].start()) % max[n].step() == 0
 - (sub[n].end() - sub[n].start()) % max[n].step() == 0
 
-@ref CPPScheduler::schedule provides a sample implementation that is used for NEON kernels.
-%Memory management is the other aspect that the runtime layer is supposed to handle. %Memory management of the tensors is abstracted using TensorAllocator. Each tensor holds a pointer to a TensorAllocator object, which is used to allocate and free the memory at runtime. The implementation that is currently supported in Compute Library allows memory blocks, required to be fulfilled for a given operator, to be grouped together under a @ref MemoryGroup. Each group can be acquired and released. The underlying implementation of memory groups vary depending on whether NEON or CL is used. The memory group class uses memory pool to provide the required memory. It also uses the memory manager to manage the lifetime and a IPoolManager to manage the memory pools registered with the memory manager.
+@ref CPPScheduler::schedule provides a sample implementation that is used for Arm® Neon™ kernels.
+%Memory management is the other aspect that the runtime layer is supposed to handle. %Memory management of the tensors is abstracted using TensorAllocator. Each tensor holds a pointer to a TensorAllocator object, which is used to allocate and free the memory at runtime. The implementation that is currently supported in Compute Library allows memory blocks, required to be fulfilled for a given operator, to be grouped together under a @ref MemoryGroup. Each group can be acquired and released. The underlying implementation of memory groups vary depending on whether Arm® Neon™ or CL is used. The memory group class uses memory pool to provide the required memory. It also uses the memory manager to manage the lifetime and a IPoolManager to manage the memory pools registered with the memory manager.
 
 
 We have seen the various interfaces for a kernel in the core library, the same structure the same file structure design exists in the runtime module. IFunction is the base class for all the functions, it has two child interfaces: ICLSimpleFunction and INESimpleFunction that are used as base class for functions which call a single kernel.
 
-The new operator has to implement %validate(), configure() and run(), these methods will call the respective function in the kernel considering that the multi-threading is used for the kernels which are parallelizable, by default std::thread::hardware_concurrency() threads are used. For NEON function can be used CPPScheduler::set_num_threads() to manually set the number of threads, whereas for OpenCL kernels all the kernels are enqueued on the queue associated with CLScheduler and the queue is then flushed.
+The new operator has to implement %validate(), configure() and run(), these methods will call the respective function in the kernel considering that the multi-threading is used for the kernels which are parallelizable, by default std::thread::hardware_concurrency() threads are used. For Arm® Neon™ function can be used CPPScheduler::set_num_threads() to manually set the number of threads, whereas for OpenCL kernels all the kernels are enqueued on the queue associated with CLScheduler and the queue is then flushed.
 For the runtime functions, there is an extra method implemented: prepare(), this method prepares the function for the run, it does all the heavy operations that are done only once (reshape the weight, release the memory not necessary after the reshape, etc). The prepare method can be called standalone or in the first run, if not called before, after then the function will be marked as prepared.
 The files we add are:
 
@@ -151,7 +153,7 @@ OpenCL function
 - arm_compute/runtime/CL/functions/CLReshapeLayer.h
 - src/runtime/CL/functions/CLReshapeLayer.cpp
 
-Neon function
+Neon™ function
 - arm_compute/runtime/NEON/functions/NEReshapeLayer.h
 - src/runtime/NEON/functions/NEReshapeLayer.cpp
 
@@ -214,7 +216,7 @@ void CLAddReshapeLayer::run()
 
 @endcode
 
-For NEON:
+For Neon™:
 
 @code{.cpp}
 using namespace arm_compute;
@@ -262,7 +264,7 @@ At this point, everything is in place at the library level. If you are following
 
 @subsubsection S4_1_4_1_add_reference Add the reference implementation and the tests
 As mentioned in the introduction, the reference implementation is a pure C++ implementation without any optimization or backend specific instruction.
-The refence implementation consist of two files into the folder tests/validation/reference:
+The reference implementation consist of two files into the folder tests/validation/reference:
 - tests/validation/reference/ReshapeLayer.h
 - tests/validation/reference/ReshapeLayer.cpp
 
@@ -298,7 +300,7 @@ For example, dataset for ReshapeLayer:
 
 Benchmark and validation tests are based on the same framework to setup and run the tests. In addition to running simple, self-contained test functions the framework supports fixtures and data test cases.
 Fixtures can be used to share common setup, teardown or even run tasks among multiple test cases, for that purpose a fixture can define a "setup", "teardown" and "run" method.
-Adding tests for the new operator in the runtime library we need to implement at least the setup method, that is used to call two methods for configure, run and return the output respectively of the target (CL or Neon) and the reference (C++ implementation).
+Adding tests for the new operator in the runtime library we need to implement at least the setup method, that is used to call two methods for configure, run and return the output respectively of the target (CL or Neon™) and the reference (C++ implementation).
 
 For example let's have a look at Reshape Layer Fixture :
 
@@ -308,7 +310,7 @@ In the fixture class above we can see that the setup method computes the target
 The compute_target method reflects the exact behavior expected when we call a function. The input and output tensor must be declared, function configured, tensors allocated, the input tensor filled with required data, and finally, the function must be run and the results returned.
 This fixture is used in the test case, that is a parameterized test case that inherits from a fixture. The test case will have access to all public and protected members of the fixture. Only the setup and teardown methods of the fixture will be used. The setup method of the fixture needs to be a template and must accept inputs from the dataset as arguments.
 The body of this function will be used as a test function.
-For the fixture test case the first argument is the name of the test case (has to be unique within the enclosing test suite), the second argument is the class name of the fixture, the third argument is the dataset mode in which the test will be active (PRECOMMIT or NIGTHLY) and the fourth argument is the dataset.
+For the fixture test case the first argument is the name of the test case (has to be unique within the enclosing test suite), the second argument is the class name of the fixture, the third argument is the dataset mode in which the test will be active (PRECOMMIT or NIGHTLY) and the fourth argument is the dataset.
 For example:
 
 @snippet tests/validation/CL/ActivationLayer.cpp CLActivationLayerFixture snippet
diff --git a/docs/05_contribution_guidelines.dox b/docs/contributor_guide/contribution_guidelines.dox
index 1cdd129733..cbaa502635 100644
--- a/docs/05_contribution_guidelines.dox
+++ b/docs/contributor_guide/contribution_guidelines.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2019 Arm Limited.
+/// Copyright (c) 2019-2023 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -24,7 +24,7 @@
 namespace arm_compute
 {
 /**
-@page contribution_guidelines Contribution guidelines
+@page contribution_guidelines Contribution Guidelines
 
 @tableofcontents
 
@@ -35,6 +35,14 @@ The development is structured in the following way:
 - Development repository: https://review.mlplatform.org/#/admin/projects/ml/ComputeLibrary
 - Please report issues here: https://github.com/ARM-software/ComputeLibrary/issues
 
+@section S5_0_inc_lang Inclusive language guideline
+As part of the initiative to use inclusive language, there are certain phrases and words that were removed or replaced by more inclusive ones. Examples include but not limited to:
+\includedoc non_inclusive_language_examples.dox
+
+Please also follow this guideline when committing changes to Compute Library.
+It is worth mentioning that the term "master" is still used in some comments but only in reference to external code links that Arm has no governance on.
+
+Futhermore, starting from release (22.05), 'master' branch is no longer being used, it has been replaced by 'main'. Please update your clone jobs accordingly.
 @section S5_1_coding_standards Coding standards and guidelines
 
 Best practices (as suggested by clang-tidy):
@@ -139,11 +147,11 @@ void foobar(const MyLargeCustomTypeClass &m); // Definitely better as const-refe
 
 - Don't use unions
 
-Unions cannot be used to convert values between different types because (in C++) it is undefined behaviour to read from a member other than the last one that has been assigned to. This limits the use of unions to a few corner cases and therefor the general advice is not to use unions. See http://releases.llvm.org/3.8.0/tools/clang/tools/extra/docs/clang-tidy/checks/cppcoreguidelines-pro-type-union-access.html
+Unions cannot be used to convert values between different types because (in C++) it is undefined behaviour to read from a member other than the last one that has been assigned to. This limits the use of unions to a few corner cases and therefore the general advice is not to use unions. See http://releases.llvm.org/3.8.0/tools/clang/tools/extra/docs/clang-tidy/checks/cppcoreguidelines-pro-type-union-access.html
 
 - Use pre-increment/pre-decrement whenever possible
 
-In contrast to the pre-incerement the post-increment has to make a copy of the incremented object. This might not be a problem for primitive types like int but for class like objects that overload the operators, like iterators, it can have a huge impact on the performance. See http://stackoverflow.com/a/9205011
+In contrast to the pre-increment the post-increment has to make a copy of the incremented object. This might not be a problem for primitive types like int but for class like objects that overload the operators, like iterators, it can have a huge impact on the performance. See http://stackoverflow.com/a/9205011
 
 To be consistent across the different cases the general advice is to use the pre-increment operator unless post-increment is explicitly required. The same rules apply for the decrement operator.
 
@@ -232,6 +240,18 @@ class ClassName
 };
 @endcode
 
+- In header files, use header guards that use the full file path from the project root and prepend it with "ACL_"
+
+@code{cpp}
+// For File arm_compute/runtime/NEON/functions/NEBatchNormalizationLayer.h
+#ifndef ACL_ARM_COMPUTE_RUNTIME_NEON_FUNCTIONS_NEBATCHNORMALIZATIONLAYER
+#define ACL_ARM_COMPUTE_RUNTIME_NEON_FUNCTIONS_NEBATCHNORMALIZATIONLAYER
+.
+.
+.
+#endif /* ACL_ARM_COMPUTE_RUNTIME_NEON_FUNCTIONS_NEBATCHNORMALIZATIONLAYER */
+@endcode
+
 - Use quotes instead of angular brackets to include local headers. Use angular brackets for system headers.
 - Also include the module header first, then local headers, and lastly system headers. All groups should be separated by a blank line and sorted lexicographically within each group.
 - Where applicable the C++ version of system headers has to be included, e.g. cstddef instead of stddef.h.
@@ -256,6 +276,47 @@ auto c = img.ptr(); // NO: Can't tell what the type is without knowing the API.
 auto d = vdup_n_u8(0); // NO: It's not obvious what type this function returns.
 @endcode
 
+- When to use const
+
+    - Local variables: Use const as much as possible. E.g. all read-ony variables should be declared as const.
+
+    - Function parameters
+
+        - Top-level const must not be used in the function declaration or definition. (Note that this applies to all types, including non-primitive types)
+          This is because for function parameters, top-level const in function declaration is always ignored by the compiler (it is meaningless).
+          Therefore we should omit top-level const to reduce visual clutter. In addition, its omission can improve API/ABI
+          stability to some extent as there is one fewer varying factor in function signatures.
+
+          Note that we could in theory allow top-level const in only definition (which is not ignored by the compiler) but not declaration.
+          But certain toolchains are known to require the declaration and definition to match exactly.
+
+        - Use low-level const (of references and pointers) as much as possible.
+@code{.cpp}
+// Primitive types
+void foo(const int a);              // NO: Top-level const must not be used in function declaration or definition
+void foo(int a);                    // OK
+// Pointer to primitive types
+void foo(int *const a);             // NO: Top-level const
+void foo(const int *const a);       // NO: Top-level const
+void foo(int *a);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const int *a);             // OK but not recommended: See section above about passing primitives by value
+// Reference to primitive types
+// There's no "top-level const" for references
+void foo(int &a);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const int &a);             // OK but not recommended: See section above about passing primitives by value
+
+// Custom types
+void foo(const Goo g);              // NO: Top-level const
+void foo(Goo g);                    // OK
+// Pointer to custom types
+void foo(Goo *const g);             // NO: Top-level const
+void foo(Goo *g);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const Goo *g);             // OK
+// Reference to custom types
+void foo(Goo &g);                   // OK. But only if foo needs to mutate the underlying object
+void foo(const Goo &g);             // OK
+@endcode
+
 - OpenCL:
     - Use __ in front of the memory types qualifiers and kernel: __kernel, __constant, __private, __global, __local.
     - Indicate how the global workgroup size / offset / local workgroup size are being calculated.
@@ -264,7 +325,7 @@ auto d = vdup_n_u8(0); // NO: It's not obvious what type this function returns.
 
         - No '*' in front of argument names
         - [in], [out] or [in,out] *in front* of arguments
-        - Skip a line between the description and params and between params and @return (If there is a return)
+        - Skip a line between the description and params and between params and \@return (If there is a return)
         - Align params names and params descriptions (Using spaces), and with a single space between the widest column and the next one.
         - Use an upper case at the beginning of the description
 
@@ -274,6 +335,26 @@ auto d = vdup_n_u8(0); // NO: It's not obvious what type this function returns.
 
 astyle (http://astyle.sourceforge.net/) and clang-format (https://clang.llvm.org/docs/ClangFormat.html) can check and help you apply some of these rules.
 
+We have also provided the python scripts we use in our precommit pipeline inside scripts directory.
+    - format_code.py: checks Android.bp, bad style, end of file, formats doxygen, runs astyle and clang-format (assuming necessary binaries are in the path). Example invocations:
+@code{.sh}
+        python format_code.py
+        python format_code.py --error-on-diff
+        python format_code.py --files=git-diff (Default behavior in pre-commit configuration, where it checks the staged files)
+@endcode
+    - generate_build_files.py: generates build files required for CMake and Bazel builds. Example invocations:
+@code{.sh}
+        python generate_build_files.py --cmake
+        python generate_build_files.py --bazel
+@endcode
+
+Another way of running the checks is using `pre-commit` (https://pre-commit.com/) framework, which has also nice features like checking trailing spaces, and large committed files etc.
+`pre-commit` can be installed via `pip`. After installing, run the following command in the root directory of the repository:
+
+	pre-commit install
+
+This will create the hooks that perform the formatting checks mentioned above and will automatically run just before committing to flag issues.
+
 @subsection S5_1_3_library_size_guidelines Library size: best practices and guidelines
 
 @subsubsection S5_1_3_1_template_suggestions Template suggestions
@@ -391,7 +472,7 @@ In order to deprecate an existing API, these rules should be followed.
 - Deprecation of runtime APIs should strictly follow the aforementioned period, whereas core APIs can have more flexibility as they are mostly used internally rather than user-facing.
 - Any API changes (update, addition and deprecation) in all components should be well documented by the contribution itself.
 
-Also, it is recommended to use the following utility macros which is designed to work with both clang and gcc using C++11 and later.
+Also, it is recommended to use the following utility macros which is designed to work with both clang and gcc using C++14 and later.
 
 - ARM_COMPUTE_DEPRECATED: Just deprecate the wrapped function
 - ARM_COMPUTE_DEPRECATED_REL: Deprecate the wrapped function and also capture the release that was deprecated
@@ -434,17 +515,17 @@ You can add this to your patch with:
 
 You are now ready to submit your patch for review:
 
-	git push acl-gerrit HEAD:refs/for/master
+	git push acl-gerrit HEAD:refs/for/main
 
 @section S5_3_code_review Patch acceptance and code review
 
-Once a patch is uploaded for review, there is a pre-commit test that runs on a Jenkins server for continuos integration tests. In order to be merged a patch needs to:
+Once a patch is uploaded for review, there is a pre-commit test that runs on a Jenkins server for continuous integration tests. In order to be merged a patch needs to:
 
 - get a "+1 Verified" from the pre-commit job
 - get a "+1 Comments-Addressed", in case of comments from reviewers the committer has to address them all. A comment is considered addressed when the first line of the reply contains the word "Done"
 - get a "+2" from a reviewer, that means the patch has the final approval
 
-At the moment, the Jenkins server is not publicly accessible and for security reasons patches submitted by non-whitelisted committers do not trigger the pre-commit tests. For this reason, one of the maintainers has to manually trigger the job.
+At the moment, the Jenkins server is not publicly accessible and for security reasons patches submitted by non-allowlisted committers do not trigger the pre-commit tests. For this reason, one of the maintainers has to manually trigger the job.
 
 If the pre-commit test fails, the Jenkins job will post a comment on Gerrit with the details about the failure so that the committer will be able to reproduce the error and fix the issue, if any (sometimes there can be infrastructure issues, a test platform disconnecting for example, where the job needs to be retriggered).
 
diff --git a/docs/contributor_guide/implementation_topics.dox b/docs/contributor_guide/implementation_topics.dox
new file mode 100644
index 0000000000..6ca78f98e7
--- /dev/null
+++ b/docs/contributor_guide/implementation_topics.dox
@@ -0,0 +1,189 @@
+///
+/// Copyright (c) 2017-2021, 2024 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page implementation_topic Implementation Topics
+
+@section implementation_topic_assembly_kernels Assembly kernels
+
+Arm Compute Library contains a collection of highly optimized assembly kernels for Arm® A profile architecture. At runtime the
+library selects the best kernel based on the CPU detected. For example if the CPU supports the dot product instruction
+the library will choose a GEMM kernel which uses the dot product instruction. There are various kernels using Neon™ and
+architecture extensions like FP16, Dot product, SVE, SVE2 and SME.
+
+For example, some assembly kernels are located in the folders:
+- src/core/NEON/kernels/arm_gemm/kernels
+- src/core/NEON/kernels/arm_gemm/pooling
+- src/core/NEON/kernels/arm_conv/depthwise
+
+
+The assembly kernels are written using assembly mnemonics and the .inst directive which inserts the machine code to the output directly.
+
+Below you can see a code block from one of the kernels in the library which uses the .inst directive to generate the sdot instruction.
+This code can be found in the kernel @ref src/core/NEON/kernels/arm_gemm/kernels/a64_hybrid_s8qa_dot_4x16/a55.cpp
+
+@code{.cpp}
+".inst 0x4f80eb10  // sdot v16.4s, v24.16b, v0.4b[2]\n"
+".inst 0x4f81eb14  // sdot v20.4s, v24.16b, v1.4b[2]\n"
+" ldr d24, [x12, #0xf0]\n"
+" ldr x20, [x12, #0xf8]\n"
+" .inst 0x4f80ebd1  // sdot v17.4s, v30.16b, v0.4b[2]\n"
+" .inst 0x4f81ebd5  // sdot v21.4s, v30.16b, v1.4b[2]\n"
+" mov v27.d[1], x23\n"
+" .inst 0x4f80ebb2  // sdot v18.4s, v29.16b, v0.4b[2]\n"
+" mov v26.d[1], x22\n"
+" .inst 0x4f81ebb6  // sdot v22.4s, v29.16b, v1.4b[2]\n"
+" mov v25.d[1], x21\n"
+" .inst 0x4f80eb93  // sdot v19.4s, v28.16b, v0.4b[2]\n"
+" mov v24.d[1], x20\n"
+" .inst 0x4f81eb97  // sdot v23.4s, v28.16b, v1.4b[2]\n"
+" add x9, x9, #0x10\n"
+" add x28, x28, #0x10\n"
+" add x12, x12, #0x100\n"
+" .inst 0x4fa0eb70  // sdot v16.4s, v27.16b, v0.4b[3]\n"
+" .inst 0x4fa1eb74  // sdot v20.4s, v27.16b, v1.4b[3]\n"
+" .inst 0x4fa0eb51  // sdot v17.4s, v26.16b, v0.4b[3]\n"
+" .inst 0x4fa1eb55  // sdot v21.4s, v26.16b, v1.4b[3]\n"
+@endcode
+
+Note that every occurrence of .inst is accompanied by a comment with the original opcode for readability purposes.
+
+The reason for using the opcodes instead of the mnemonic is that this approach will work on any toolchain, including the ones without support for the dot product mnemonic. The .inst directive is used to generate many other instructions and ensuring the code will compile on older toolchains that do not support them.
+
+@section implementation_topic_windows Windows
+
+A @ref Window represents a workload to execute, it can handle up to @ref Coordinates::num_max_dimensions dimensions.
+Each dimension is defined by a start, end and step.
+
+It can split into subwindows as long as *all* the following rules remain true for all the dimensions:
+
+- max[n].start() <= sub[n].start() < max[n].end()
+- sub[n].start() < sub[n].end() <= max[n].end()
+- max[n].step() == sub[n].step()
+- (sub[n].start() - max[n].start()) % max[n].step() == 0
+- (sub[n].end() - sub[n].start()) % max[n].step() == 0
+
+@section implementation_topic_kernels Kernels
+
+Each implementation of the @ref IKernel interface (base class of all the kernels in the core library) works in the same way:
+
+OpenCL kernels:
+
+@code{.cpp}
+// Initialize the CLScheduler with the default context and default command queue
+// Implicitly initializes the CLKernelLibrary to use ./cl_kernels as location for OpenCL kernels files and sets a default device for which OpenCL programs are built.
+CLScheduler::get().default_init();
+
+cl::CommandQueue q = CLScheduler::get().queue();
+//Create a kernel object:
+MyKernel kernel;
+// Initialize the kernel with the input/output and options you want to use:
+kernel.configure( input, output, option0, option1);
+// Retrieve the execution window of the kernel:
+const Window& max_window = kernel.window();
+// Run the whole kernel in the current thread:
+kernel.run( q, max_window ); // Enqueue the kernel to process the full window on the default queue
+
+// Wait for the processing to complete:
+q.finish();
+@endcode
+
+Neon / CPP kernels:
+
+@code{.cpp}
+//Create a kernel object:
+MyKernel kernel;
+// Initialize the kernel with the input/output and options you want to use:
+kernel.configure( input, output, option0, option1);
+// Retrieve the execution window of the kernel:
+const Window& max_window = kernel.window();
+// Run the whole kernel in the current thread:
+kernel.run( max_window ); // Run the kernel on the full window
+@endcode
+
+@section implementation_topic_multithreading Multi-threading
+
+The previous section shows how to run a Arm® Neon™ / CPP kernel in the current thread, however if your system has several CPU cores, you will probably want the kernel to use several cores. Here is how this can be done:
+
+@code{.cpp}
+    ThreadInfo info;
+    info.cpu_info = &_cpu_info;
+
+    const Window      &max_window     = kernel->window();
+    const unsigned int num_iterations = max_window.num_iterations(split_dimension);
+    info.num_threads                  = std::min(num_iterations, _num_threads);
+
+    if(num_iterations == 0)
+    {
+        return;
+    }
+
+    if(!kernel->is_parallelisable() || info.num_threads == 1)
+    {
+        kernel->run(max_window, info);
+    }
+    else
+    {
+        int  t         = 0;
+        auto thread_it = _threads.begin();
+
+        for(; t < info.num_threads - 1; ++t, ++thread_it)
+        {
+            Window win     = max_window.split_window(split_dimension, t, info.num_threads);
+            info.thread_id = t;
+            thread_it->start(kernel, win, info);
+        }
+
+        // Run last part on main thread
+        Window win     = max_window.split_window(split_dimension, t, info.num_threads);
+        info.thread_id = t;
+        kernel->run(win, info);
+
+        try
+        {
+            for(auto &thread : _threads)
+            {
+                thread.wait();
+            }
+        }
+        catch(const std::system_error &e)
+        {
+            std::cerr << "Caught system_error with code " << e.code() << " meaning " << e.what() << '\n';
+        }
+    }
+@endcode
+
+This is a very basic implementation which was originally used in the Arm® Neon™ runtime library by all the Arm® Neon™ functions.
+
+@sa CPPScheduler
+
+@note Some kernels need some local temporary buffer to perform their calculations. In order to avoid memory corruption between threads, the local buffer must be of size: ```memory_needed_per_thread * num_threads``` and a unique thread_id between 0 and num_threads must be assigned to the @ref ThreadInfo object passed to the ```run``` function.
+
+
+@section implementation_topic_cl_scheduler OpenCL kernel library
+
+All OpenCL kernels used by the library are built and stored in @ref CLKernelLibrary.
+If the library is compiled with embed_kernels=0 the application can set the path to the OpenCL kernels by calling @ref CLKernelLibrary::init(), by default the path is set to "./cl_kernels"
+*/
+} // namespace arm_compute
diff --git a/docs/contributor_guide/non_inclusive_language_examples.dox b/docs/contributor_guide/non_inclusive_language_examples.dox
new file mode 100644
index 0000000000..addfdd34dd
--- /dev/null
+++ b/docs/contributor_guide/non_inclusive_language_examples.dox
@@ -0,0 +1,4 @@
+ - master/slave
+ - black/white
+ - he/she, him/her, his/hers
+   - When referring to a person where gender is irrelevant or unknown, kindly use they, them, theirs, or a person’s preferred pronoun.
+\ No newline at end of file
diff --git a/docs/user_guide/advanced.dox b/docs/user_guide/advanced.dox
new file mode 100644
index 0000000000..2b9e0d02f7
--- /dev/null
+++ b/docs/user_guide/advanced.dox
@@ -0,0 +1,139 @@
+///
+/// Copyright (c) 2017-2021 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page advanced Advanced
+
+@tableofcontents
+
+@section S1_8_cl_tuner OpenCL Tuner
+
+The OpenCL tuner, a.k.a. CLTuner, is a module of Arm Compute Library that can improve the performance of the OpenCL kernels tuning the Local-Workgroup-Size (LWS).
+The optimal LWS for each unique OpenCL kernel configuration is stored in a table. This table can be either imported or exported from/to a file.
+The OpenCL tuner runs the same OpenCL kernel for a range of local workgroup sizes and keeps the local workgroup size of the fastest run to use in subsequent calls to the kernel. It supports three modes of tuning with different trade-offs between the time taken to tune and the kernel execution time achieved using the best LWS found. In the Exhaustive mode, it searches all the supported values of LWS. This mode takes the longest time to tune and is the most likely to find the optimal LWS. Normal mode searches a subset of LWS values to yield a good approximation of the optimal LWS. It takes less time to tune than Exhaustive mode. Rapid mode takes the shortest time to tune and finds an LWS value that is at least as good or better than the default LWS value. The mode affects only the search for the optimal LWS and has no effect when the LWS value is imported from a file.
+In order for the performance numbers to be meaningful you must disable the GPU power management and set it to a fixed frequency for the entire duration of the tuning phase.
+
+If you wish to know more about LWS and the important role on improving the GPU cache utilization, we suggest having a look at the presentation "Even Faster CNNs: Exploring the New Class of Winograd Algorithms available at the following link:
+
+https://www.embedded-vision.com/platinum-members/arm/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-iodice
+
+Tuning a network from scratch can be long and affect considerably the execution time for the first run of your network. It is recommended for this reason to store the CLTuner's result in a file to amortize this time when you either re-use the same network or the functions with the same configurations. The tuning is performed only once for each OpenCL kernel.
+
+CLTuner looks for the optimal LWS for each unique OpenCL kernel configuration. Since a function (i.e. Convolution Layer, Pooling Layer, Fully Connected Layer ...) can be called multiple times but with different parameters, we associate an "id" (called "config_id") to each kernel to distinguish the unique configurations.
+
+    #Example: 2 unique Matrix Multiply configurations
+@code{.cpp}
+    TensorShape a0 = TensorShape(32,32);
+    TensorShape b0 = TensorShape(32,32);
+    TensorShape c0 = TensorShape(32,32);
+    TensorShape a1 = TensorShape(64,64);
+    TensorShape b1 = TensorShape(64,64);
+    TensorShape c1 = TensorShape(64,64);
+
+    Tensor a0_tensor;
+    Tensor b0_tensor;
+    Tensor c0_tensor;
+    Tensor a1_tensor;
+    Tensor b1_tensor;
+    Tensor c1_tensor;
+
+    a0_tensor.allocator()->init(TensorInfo(a0, 1, DataType::F32));
+    b0_tensor.allocator()->init(TensorInfo(b0, 1, DataType::F32));
+    c0_tensor.allocator()->init(TensorInfo(c0, 1, DataType::F32));
+    a1_tensor.allocator()->init(TensorInfo(a1, 1, DataType::F32));
+    b1_tensor.allocator()->init(TensorInfo(b1, 1, DataType::F32));
+    c1_tensor.allocator()->init(TensorInfo(c1 1, DataType::F32));
+
+    CLGEMM gemm0;
+    CLGEMM gemm1;
+
+    // Configuration 0
+    gemm0.configure(&a0, &b0, nullptr, &c0, 1.0f, 0.0f);
+
+    // Configuration 1
+    gemm1.configure(&a1, &b1, nullptr, &c1, 1.0f, 0.0f);
+@endcode
+
+@subsection S1_8_1_cl_tuner_how_to How to use it
+
+All the graph examples in the Compute Library's folder "examples" and the arm_compute_benchmark accept an argument to enable the OpenCL tuner and an argument to export/import the LWS values to/from a file
+
+    #Enable CL tuner
+    ./graph_mobilenet --enable-tuner –-target=CL
+    ./arm_compute_benchmark --enable-tuner
+
+    #Export/Import to/from a file
+    ./graph_mobilenet --enable-tuner --target=CL --tuner-file=acl_tuner.csv
+    ./arm_compute_benchmark --enable-tuner --tuner-file=acl_tuner.csv
+
+If you are importing the CLTuner'results from a file, the new tuned LWS values will be appended to it.
+
+Either you are benchmarking the graph examples or the test cases in the arm_compute_benchmark remember to:
+
+    -# Disable the power management
+    -# Keep the GPU frequency constant
+    -# Run multiple times the network (i.e. 10).
+
+If you are not using the graph API or the benchmark infrastructure you will need to manually pass a CLTuner object to CLScheduler before configuring any function.
+
+@code{.cpp}
+CLTuner tuner;
+
+// Setup Scheduler
+CLScheduler::get().default_init(&tuner);
+@endcode
+
+After the first run, the CLTuner's results can be exported to a file using the method "save_to_file()".
+- tuner.save_to_file("results.csv");
+
+This file can be also imported using the method "load_from_file("results.csv")".
+- tuner.load_from_file("results.csv");
+
+@section Security Concerns
+Here are some security concerns that may affect Compute Library.
+
+@subsection A process running under the same uid could read another process memory
+
+Processes running under same user ID (UID) may be able to read each other memory and running state. Hence, This can
+lead to information disclosure and sensitive data can be leaked, such as the weights of the model currently executing.
+This mainly affects Linux systems and it's the responsibility of the system owner to make processes secure against
+this vulnerability. Moreover, the YAMA security kernel module can be used to detect and stop such a trial of hacking,
+it can be selected at the kernel compile time by CONFIG_SECURITY_YAMA and configured during runtime changing the
+ptrace_scope in /proc/sys/kernel/yama.
+
+Please refer to: https://www.kernel.org/doc/html/v4.15/admin-guide/LSM/Yama.html for more information on this regard.
+
+@subsection Malicious users could alter Compute Library related files
+
+Extra care must be taken in order to reduce the posibility of a user altering sensitive files. CLTuner files
+should be protected by arbitrary writes since this can lead Compute Library to crash or waste all system's resources.
+
+@subsection Various concerns
+
+Sensitive applications that use Compute Library should consider posible attack vectors such as shared library hooking,
+information leakage from the underlying OpenCL driver or previous excecution and running arbitrary networks that consume
+all the available resources on the system, leading to denial of service.
+
+*/
+} // namespace
+\ No newline at end of file
diff --git a/docs/user_guide/conv2d_heuristic.dox b/docs/user_guide/conv2d_heuristic.dox
new file mode 100644
index 0000000000..edd24a3d36
--- /dev/null
+++ b/docs/user_guide/conv2d_heuristic.dox
@@ -0,0 +1,89 @@
+///
+/// Copyright (c) 2023 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+
+namespace arm_compute
+{
+/**
+@page conv2d_heuristic Convolution 2D heuristic
+
+@section conv2d_heuristic_algorithms_used Convolution 2D heuristic: algorithm selection
+
+The convolution 2D (in short, conv2D) is certainly one of the most compute intensive and performance critical operators in ML workloads.
+This operator can be implemented with different algorithms, which differ in terms of accuracy, kernel size support, and additional memory required.
+Unfortunately, it does not exist a single algorithm that can be used in all scenarios to achieve the best performance.
+Therefore, the Arm Compute Library integrates an heuristic within the conv2d operators to select the most efficient algorithm, depending on input and kernel shapes and desired level of accuracy.
+The heuristic depends on the target backend (either NEON™ for Arm® CPUs or OpenCL for Arm® GPUs) and the following subsections will provide the main details behind the selection of the algorithm.
+
+⚠ Attention: The heuristics presented in the following subsections will only refer to the NHWC data layout, which is the optimal and recommended layout for the Arm Compute Library.
+
+@subsection conv2d_heuristic_on_cpu Convolution 2D heuristic: Arm® Cortex®-based CPUs
+
+The conv2d heuristic for Arm® Cortex®-based CPUs is inside the get_convolution_method() method in the CpuConv2d function.
+The algorithms used in the get_convolution_method() function are the following:
+- Direct-Conv2D
+- Im2Col+GeMM-based
+- Indirect-GeMM (a.k.a. GEMMCONV2D)
+- GeMM
+- Winograd
+
+⚠ Attention: Winograd only works with floating-point data types (F32, F16)
+
+The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
+-# Non unit dilation: We call Im2Col+GeMM
+-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
+-# Small Input-Feature-Maps (IFM): In this scenario, we have found that the GeMM implementation is generally the most efficient algorithm compared to Winograd and Indirect-GeMM
+
+If we have a most frequent case, such as unit dilations, of larger IFM, we evaluate the following conditions instead:
+-# Unit kernel size (1x1): In this scenario, the conv2d operations corresponds to a matrix multiplication and we call GeMM.
+-# Winograd. Winograd only works with unit strides and supports a limited number of kernel sizes, such as 3x3, 3x1, 1x3, 5x1, 1x5 and 5x5
+-# Indirect-GeMM: It should be used in all cases expect when the kernel size is 1x1 or when the IFM is small
+
+If the preceding cases are not met, we will fall-back to the Im2Col+GeMM-based algorithm.
+
+@subsection conv2d_heuristic_on_gpu Convolution 2D heuristic: Arm® Mali™-based GPUs
+
+The conv2d heuristic for Arm® Mali™-based GPUs is inside the get_convolution_method() method in the ClConv2d function.
+
+The algorithms used in the get_convolution_method() function are the following:
+- Direct-Conv2D
+- Im2Col+GeMM-based
+- Indirect-GeMM
+- GeMM
+- Winograd
+
+⚠ Attention: Winograd only works with floating-point data types (F32, F16)
+
+The heuristic first checks less frequent cases that we may have in ML workloads for edge devices. These cases are the following:
+-# Non unit dilation: We call Im2Col+GeMM
+-# Large input and kernel shapes: We call Direct-Conv2D because it is the only algorithm that does not extra additionally temporary memory
+
+In all the other cases, the GPU heuristic evaluates the suitability of Winograd and Direct-Conv2D/Indirect-Conv2D.
+In particular, Winograd is adopted when the convolution parameters (kernel size and strides) are supported by the algorithm and when the IFM is not small (for example, greater than 8).
+The conditions for using the Direct-Conv2D algorithms are several and we recommend you look at the heuristic directly.
+In general, the Direct-Conv2D operators is used in almost all cases where kernel size is not 1x1.
+The Indirect-GeMM algorithm is used in alternative to Direct-Conv2D only for Arm® Mali™-G77 GPU.
+If neither Winograd nor Direct-Conv2D can be used, we will fall-back to either GeMM (when the kernel size is 1x1) or the Im2Col+GeMM-based algorithm.
+
+*/
+} // namespace
diff --git a/docs/user_guide/data_layout.dox b/docs/user_guide/data_layout.dox
new file mode 100644
index 0000000000..711b85f08c
--- /dev/null
+++ b/docs/user_guide/data_layout.dox
@@ -0,0 +1,64 @@
+///
+/// Copyright (c) 2021-2022 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+
+namespace arm_compute
+{
+/**
+@page data_layout_support Data Layout Support
+
+@section data_layout_support_supported_data_layout Supported Data Layouts
+
+With regard to convolution layers, Compute Library supports the following data layouts for input and output tensors:
+
+- NHWC: The native layout of Compute Library that delivers the best performance where channels are in the fastest changing dimension
+- NCHW: Legacy layout where width is in the fastest changing dimension
+- NDHWC: New data layout for supporting 3D operators
+
+, where N = batch, C = channel, H = height, W = width, D = depth.
+
+Note: The right-most letter represents the fastest changing dimension, which is the "lower dimension".
+The corresponding @ref TensorShape for each of the data layout would be initialized as:
+
+- NHWC: TensorShape(C, W, H, N)
+- NCHW: TensorShape(W, H, C, N)
+- NDHWC: TensorShape(C, W, H, D, N)
+
+For 2d Conv, the weight / filter tensors are arranged in 4 dimensions: Height (H), Width (W), Input channel (I), Output channel (O)
+For 3d Conv, the additional Depth dimension means exactly the same as the Depth in the input / output layout.
+
+The layout of weight tensors change with that of the input / output tensors, and the dimensions can be mapped as:
+
+- Weight Height -> Height
+- Weight Width -> Width
+- Weight Input channel -> Channel
+- Weight Output channel -> Batch
+
+Therefore, the corresponding weight layouts for each input / output layout are:
+
+- (input/output tensor) NHWC: (weight tensor) OHWI
+- (input/output tensor) NCHW: (weight tensor) OIHW
+- (input/output tensor) NDHWC: (weight tensor) ODHWI
+
+*/
+} // namespace
diff --git a/docs/07_errata.dox b/docs/user_guide/data_type.dox
index 2d35e67986..7083270a07 100644
--- a/docs/07_errata.dox
+++ b/docs/user_guide/data_type.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2019 Arm Limited.
+/// Copyright (c) 2021 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -24,26 +24,24 @@
 namespace arm_compute
 {
 /**
-@page errata Errata
+@page data_type_support Data Type Support
 
 @tableofcontents
 
-@section S7_1_errata Errata
+@section data_type_support_supported_data_type Supported Data Types
 
-- Under certain conditions, benchmark examples can hang when OpenCL profiling queues are enabled.
-    - Versions Affected: >= v19.11
-    - OSs Affected: Linux
-    - Conditions:
-        - Mali DDK r1p0 - r8p0, and
-        - Linux kernel >= 4.4
-
-- On Android with arm64-v8a/arm64-v8.2-a architecture, NEON validation tests can fail when compiled using Android Ndk
-  >= r18b in debug mode (https://github.com/android/ndk/issues/1135).
-    - Versions Affected: >= v19.11
-    - OSs Affected: Android
-    - Conditions:
-        - arm64-v8a/arm64-v8.2-a architecture, and
-        - Compiled using Android NDK >= r18b in debug mode.
+Compute Library supports the following list of data types. More detailed information
+can be found from the documentation of each operator since the data types supported
+by each operator vary.
 
+- BFLOAT16: 16-bit non-standard brain floating point
+- QASYMM8: 8-bit unsigned asymmetric quantized
+- QASYMM8_SIGNED: 8-bit signed asymmetric quantized
+- QSYMM8_PER_CHANNEL: 8-bit signed symmetric quantized (Used for the weights)
+- QSYMM8: 8-bit unsigned symmetric quantized
+- QSYMM16: 16-bit unsigned symmetric quantized
+- F32: 32-bit single precision floating point
+- F16: 16-bit half precision floating point
+- S32: 32-bit signed integer
 */
 } // namespace
diff --git a/docs/user_guide/errata.dox b/docs/user_guide/errata.dox
new file mode 100644
index 0000000000..056e45a432
--- /dev/null
+++ b/docs/user_guide/errata.dox
@@ -0,0 +1,136 @@
+///
+/// Copyright (c) 2019-2023 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page errata Errata
+
+@tableofcontents
+
+@section S7_1_errata Errata
+
+- (COMPMID-6493) Crash when running Arm Compute Library compiled for SVE2 on a computer that support SVE only.
+    - Versions: >= v21.02 && <=v23.08
+    - OSs: Linux, Android.
+    - Conditions:
+        - Compile the latest Arm Compute Library for SVE2 (arch=armv8.6-a-sve2).
+        - multi_isa = 0
+        - Device with SVE but without SVE2 support.
+    - Result:
+        - Crash due to illegal instruction.
+        - To run SVE only, build with arch="armv8.2-a-sve", arch="armv8.6-a-sve", or with multi_isa=1.
+
+- (COMPMID-6404) Under certain conditions, CLTile may produce incorrect result.
+    - Versions: >= v19.02 && < v23.08
+    - OSs: Linux, Android.
+    - Conditions:
+        - The size of the lowest dimension of the input tensor is greater than 16 bytes.
+        - The size of the lowest dimension of the input tensor is not a multiple of 16.
+    - Result:
+        - Incorrect result is produced.
+
+- (COMPMID-6271) Under certain conditions, CLArgMinMaxLayer validation tests may fail
+    - Versions Affected: >= v20.02 && < v23.08
+    - OSs Affected: Linux
+    - Conditions:
+        - Backend: OpenCL
+        - Axis == 0
+    - Result:
+        - Sporadic mismatches only on certain devices
+
+- (COMPMID-5324) Issue identified with direct and depthwise convolutions for certain Arm® Mali™ DDK versions.
+    - Versions Affected: < v22.08
+    - Conditions:
+        - Arm® Mali™ DDK Versions : >= r23p0 && <= r38p0
+        - Mali™ GPUs: Bifrost GPU family with the exception of G71
+        - Backend: OpenCL
+        - Build options Include : "cl-fast-relaxed-math"
+    - Result: Reduced accuracy issue, while using direct and depthwise convolutions fused with LU_BOUNDED_RELU activation.
+
+- (COMPMID-5134) An issue has been identified when running the graph_deepspeech_v0_4_1 graph example.
+    - Versions Affected: >= v21.08
+    - Conditions:
+        - Data type input: F32
+        - Backend: OpenCL
+    - Result: The execution of the graph_deepspeech_v0_4_1 could fail on OpenCL backend for systems with a small RAM. The issue is due to the extra temporary memory required to reshape the network weights
+
+- (COMPMID-4013) Experimented performance regressions for some networks on OpenCL when using Arm® Mali™ DDK r8p0
+    - Versions Affected: v21.05
+    - OSs Affected: All
+    - Conditions:
+        - Arm® Mali™ DDK r8p0
+
+- (COMPMID-5146) Under certain conditions, CLFullyConnectedLayer quantized tests may fail due to an issue in the test framework.
+    - Versions Affected: v21.02
+    - OSs Affected: Linux
+    - Conditions:
+        - armv7a architecture
+        - release mode
+        - asserts enabled
+
+- (COMPMID-4367) Performance regression in Convolution Layer OpenCL backend on Mali™ G77 when QSYMM8_PER_CHANNEL is used as weights' data type.
+    - Versions Affected: >= v20.11 && < v21.08
+    - OSs Affected: All
+    - Conditions:
+        - Mali™ G77
+        - Convolution Layer in use
+        - OpenCL backend
+        - Convolution Layer uses QSYMM8_PER_CHANNEL as the data type of its weight
+
+- (COMPMID-4306) A wrong test configuration has been found in CLGEMMMatrixMultiplyReshapedOnlyRHS set of tests.
+    - Versions Affected: >= v20.11 && < v21.05
+    - Conditions:
+        - Data type input: F32/F16
+        - Fused bounded relu activation with coefficient 'a' being negative
+
+- (COMPMID-5135) Under certain conditions, the validation test case 'CL/DirectConvolutionLayer/Float/FP32/RunSmall9x9\@InputShape=32x37x3x4:StrideX=1:StrideY=1:PadX=0:PadY=0:KernelSize=9:NumKernels=1:DataType=F32:ActivationInfo=LU_BOUNDED_RELU:DataLayout=NHWC' may fail.
+    - Versions Affected: >= v20.08
+    - Conditions:
+        - The validation suite has to run in nightly mode and execute 40k+ test cases before the test mentioned above
+
+- (COMPMID-5136) Under certain conditions, benchmark examples can hang when OpenCL profiling queues are enabled.
+    - Versions Affected: >= v19.11
+    - OSs Affected: Linux
+    - Conditions:
+        - Arm® Mali™ DDK r1p0 - r8p0, and
+        - Linux kernel >= 4.4
+
+- (COMPMID-5137) On Android with armv8a/armv8.2-a architecture, Arm® Neon™ validation tests can fail when compiled using Android Ndk
+  >= r18b in debug mode (https://github.com/android/ndk/issues/1135).
+    - Versions Affected: >= v19.11
+    - OSs Affected: Android
+    - Conditions:
+        - armv8a/armv8.2-a architecture, and
+        - Compiled using Android NDK >= r18b in debug mode.
+
+- (COMPMID-4288) An issue has been identified with CLCast.
+    - Versions Affected: >= v18.11 && < v21.05
+    - Conditions:
+        - Data type input: F32
+        - Data type output: All integer types
+        - Conversion policy: SATURATE
+    - Result: OpenCL backend will always wrap around instead of saturating for out-of-range inputs
+
+*/
+} // namespace
diff --git a/docs/user_guide/how_to_build_and_run_examples.dox b/docs/user_guide/how_to_build_and_run_examples.dox
new file mode 100644
index 0000000000..0b8a23b368
--- /dev/null
+++ b/docs/user_guide/how_to_build_and_run_examples.dox
@@ -0,0 +1,569 @@
+///
+/// Copyright (c) 2017-2024 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page how_to_build How to Build and Run Examples
+
+@tableofcontents
+
+@section S1_1_build_options Build options
+
+scons 2.3 or above is required to build the library.
+To see the build options available simply run ```scons -h```
+
+@section S1_2_linux Building for Linux
+
+@subsection S1_2_1_library How to build the library ?
+
+For Linux, the library was successfully built and tested using the following Linaro GCC toolchain:
+
+ - gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
+ - gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
+
+To cross-compile the library in debug mode, with Arm® Neon™ only support, for Linux 32bit:
+
+	scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=linux arch=armv7a
+
+To cross-compile the library in asserts mode, with OpenCL only support, for Linux 64bit:
+
+	scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=linux arch=armv8a
+
+You can also compile the library natively on an Arm device by using <b>build=native</b>:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv8a build=native
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=native
+
+@note g++ for Arm is mono-arch, therefore if you want to compile for Linux 32bit on a Linux 64bit platform you will have to use a cross compiler.
+
+For example on a 64bit Debian based system you would have to install <b>g++-arm-linux-gnueabihf</b>
+
+	apt-get install g++-arm-linux-gnueabihf
+
+Then run
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a build=cross_compile
+
+or simply remove the build parameter as build=cross_compile is the default value:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=linux arch=armv7a
+
+@subsection S1_2_2_examples How to manually build the examples ?
+
+The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
+
+@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
+
+To cross compile a Arm® Neon™ example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/neon_cnn.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -o neon_cnn
+
+To cross compile a Arm® Neon™ example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/neon_cnn.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -o neon_cnn
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+To cross compile an OpenCL example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/cl_sgemm.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute -o cl_sgemm -DARM_COMPUTE_CL
+
+To cross compile an OpenCL example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/cl_sgemm.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -L. -larm_compute -o cl_sgemm -DARM_COMPUTE_CL
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
+
+i.e. to cross compile the "graph_lenet" example for Linux 32bit:
+
+	arm-linux-gnueabihf-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -Wl,--allow-shlib-undefined -o graph_lenet
+
+i.e. to cross compile the "graph_lenet" example for Linux 64bit:
+
+	aarch64-linux-gnu-g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -Wl,--allow-shlib-undefined -o graph_lenet
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option and the compiler's name is different)
+
+@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute
+
+To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 32bit:
+
+	g++ examples/neon_cnn.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -mfpu=neon -larm_compute -o neon_cnn
+
+To compile natively (i.e directly on an Arm device) for Arm® Neon™ for Linux 64bit:
+
+	g++ examples/neon_cnn.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -o neon_cnn
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
+
+To compile natively (i.e directly on an Arm device) for OpenCL for Linux 32bit or Linux 64bit:
+
+	g++ examples/cl_sgemm.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute -o cl_sgemm -DARM_COMPUTE_CL
+
+To compile natively the examples with the Graph API, such as graph_lenet.cpp, you need to link the examples against arm_compute_graph.so too.
+
+i.e. to natively compile the "graph_lenet" example for Linux 32bit:
+
+	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -mfpu=neon -L. -larm_compute_graph -larm_compute -Wl,--allow-shlib-undefined -o graph_lenet
+
+i.e. to natively compile the "graph_lenet" example for Linux 64bit:
+
+	g++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -L. -larm_compute_graph -larm_compute -Wl,--allow-shlib-undefined -o graph_lenet
+
+(notice the only difference with the 32 bit command is that we don't need the -mfpu option)
+
+@note If compiling using static libraries, this order must be followed when linking: arm_compute_graph_static, arm_compute
+
+@note These two commands assume libarm_compute.so is available in your library path, if not add the path to it using -L (e.g. -Llib/linux-armv8a-neon-cl-asserts/)
+@note You might need to export the path to OpenCL library as well in your LD_LIBRARY_PATH if Compute Library was built with OpenCL enabled.
+
+To run the built executable simply run:
+
+	LD_LIBRARY_PATH=build ./neon_cnn
+
+or
+
+	LD_LIBRARY_PATH=build ./cl_sgemm
+
+@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
+
+For example:
+
+	LD_LIBRARY_PATH=. ./graph_lenet --help
+
+Below is a list of the common parameters among the graph examples :
+@snippet utils/CommonGraphOptions.h Common graph examples parameters
+
+@subsection S1_2_3_sve Build for SVE or SVE2
+
+In order to build for SVE or SVE2 you need a compiler that supports them. You can find more information in the following these links:
+    -# GCC: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/sve-support
+    -# LLVM: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/llvm-toolchain/sve-support
+
+@note You the need to indicate the toolchains using the scons "toolchain_prefix" parameter.
+
+An example build command with SVE is:
+
+        scons arch=armv8.2-a-sve os=linux build_dir=arm64 -j55 standalone=0 opencl=0 openmp=0 validation_tests=1 neon=1 cppthreads=1 toolchain_prefix=aarch64-none-linux-gnu-
+
+@subsection S1_2_4_sme Build for SME2
+
+In order to build for SME2 you need to use a compiler that supports SVE2 and enable SVE2 in the build as well.
+
+@note You the need to indicate the toolchains using the scons "toolchain_prefix" parameter.
+
+An example build command with SME2 is:
+
+        scons arch=armv8.6-a-sve2-sme2 os=linux build_dir=arm64 -j55 standalone=0 opencl=0 openmp=0 validation_tests=1 neon=1 cppthreads=1 toolchain_prefix=aarch64-none-linux-gnu-
+
+@subsection S1_2_5_clang_build_linux Building with LLVM+Clang Natively on Linux
+
+The library can be built with LLVM+Clang by specifying CC and CXX environment variables appropriately as below. The **minimum** supported clang version is 11, as LLVM 11 introduces SVE/SVE2 VLA intrinsics: https://developer.arm.com/Tools%20and%20Software/LLVM%20Toolchain#Supported-Devices.
+
+	CC=clang CXX=clang++ <build command>
+
+Or, if the environment has multiple clang versions:
+
+	CC=clang-16 CXX=clang++-16
+
+Examples for different build tools look like below.
+
+(experimental) CMake:
+
+	mkdir build
+	cd build
+	CC=clang CXX=clang++ cmake .. -DCMAKE_BUILD_TYPE=Release -DARM_COMPUTE_OPENMP=1 -DARM_COMPUTE_WERROR=0 -DARM_COMPUTE_BUILD_EXAMPLES=1 -DARM_COMPUTE_BUILD_TESTING=1 -DCMAKE_INSTALL_LIBDIR=.
+	CC=clang CXX=clang++ cmake --build . -j32
+
+(experimental) Bazel:
+
+	CC=clang CXX=clang++ bazel build //...
+
+Scons:
+
+	CC=clang CXX=clang++ scons -j32 Werror=1 debug=0 neon=1 openmp=1 cppthreads=1 os=linux arch=armv8a multi_isa=1 build=native validation_tests=1
+
+Configurations supported are limited to the configurations supported by our CMake, Bazel and Multi ISA Scons builds. For more details on CMake and Bazel builds, please see @ref S1_8_experimental_builds
+
+@section S1_3_android Building for Android
+
+For Android, the library was successfully built and tested using Google's standalone toolchains:
+ - clang++ from NDK r20b for armv8a
+ - clang++ from NDK r20b for armv8.2-a with FP16 support
+
+(From 23.02, NDK >= r20b is highly recommended) For NDK r18 or older, here is a guide to <a href="https://developer.android.com/ndk/guides/standalone_toolchain.html">create your Android standalone toolchains from the NDK</a>:
+- Download the NDK r18b from here: https://developer.android.com/ndk/downloads/index.html to directory $NDK
+- Make sure you have Python 2.7 installed on your machine.
+- Generate the 32 and/or 64 toolchains by running the following commands to your toolchain directory $MY_TOOLCHAINS:
+
+	$NDK/build/tools/make_standalone_toolchain.py --arch arm64 --install-dir $MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b --stl libc++ --api 21
+
+	$NDK/build/tools/make_standalone_toolchain.py --arch arm --install-dir $MY_TOOLCHAINS/arm-linux-android-ndk-r18b --stl libc++ --api 21
+
+For NDK r19 or newer, you can directly <a href="https://developer.android.com/ndk/downloads">Download</a> the NDK package for your development platform, without the need to launch the make_standalone_toolchain.py script. You can find all the prebuilt binaries inside $NDK/toolchains/llvm/prebuilt/$OS_ARCH/bin/.
+
+@parblock
+@attention The building script will look for a binary named "aarch64-linux-android-clang++", while the prebuilt binaries will have their API version as a suffix to their filename (e.g. "aarch64-linux-android21-clang++"). You can instruct scons to use the correct version by using a combination of the toolchain_prefix and the "CC" "CXX" environment variables.
+@attention For this particular example, you can specify:
+
+	CC=clang CXX=clang++ scons toolchain_prefix=aarch64-linux-android21-
+
+@attention or:
+
+	CC=aarch64-linux-android21-clang CXX=aarch64-linux-android21-clang++ scons toolchain_prefix=""
+
+@endparblock
+
+@parblock
+@attention We used to use gnustl but as of NDK r17 it is deprecated so we switched to libc++
+@endparblock
+
+@note Make sure to add the toolchains to your PATH:
+
+	export PATH=$PATH:$MY_TOOLCHAINS/aarch64-linux-android-ndk-r18b/bin:$MY_TOOLCHAINS/arm-linux-android-ndk-r18b/bin
+
+@subsection S1_3_1_library How to build the library ?
+
+To cross-compile the library in debug mode, with Arm® Neon™ only support, for Android 32bit:
+
+	CXX=clang++ CC=clang scons Werror=1 -j8 debug=1 neon=1 opencl=0 os=android arch=armv7a
+
+To cross-compile the library in asserts mode, with OpenCL only support, for Android 64bit:
+
+	CXX=clang++ CC=clang scons Werror=1 -j8 debug=0 asserts=1 neon=0 opencl=1 embed_kernels=1 os=android arch=armv8a
+
+@subsection S1_3_2_examples How to manually build the examples ?
+
+The examples get automatically built by scons as part of the build process of the library described above. This section just describes how you can build and link your own application against our library.
+
+@note The following command lines assume the arm_compute libraries are present in the current directory or in the system library path. If this is not the case you can specify the location of the pre-built libraries with the compiler option -L. When building the OpenCL example the commands below assume that the CL headers are located in the include folder where the command is executed.
+
+Once you've got your Android standalone toolchain built and added to your path you can do the following:
+
+To cross compile a Arm® Neon™ example:
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/neon_cnn.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -L. -o neon_cnn_arm -static-libstdc++ -pie
+	#64 bit:
+	aarch64-linux-android-clang++ examples/neon_cnn.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -L. -o neon_cnn_aarch64 -static-libstdc++ -pie
+
+To cross compile an OpenCL example:
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/cl_sgemm.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -L. -o cl_sgemm_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
+	#64 bit:
+	aarch64-linux-android-clang++ examples/cl_sgemm.cpp utils/Utils.cpp -I. -Iinclude -std=c++14 -larm_compute-static -L. -o cl_sgemm_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
+
+To cross compile the examples with the Graph API, such as graph_lenet.cpp, you need to link the library arm_compute_graph also.
+
+	#32 bit:
+	arm-linux-androideabi-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -L. -o graph_lenet_arm -static-libstdc++ -pie -DARM_COMPUTE_CL
+	#64 bit:
+	aarch64-linux-android-clang++ examples/graph_lenet.cpp utils/Utils.cpp utils/GraphUtils.cpp utils/CommonGraphOptions.cpp -I. -Iinclude -std=c++14 -Wl,--whole-archive -larm_compute_graph-static -Wl,--no-whole-archive -larm_compute-static -L. -o graph_lenet_aarch64 -static-libstdc++ -pie -DARM_COMPUTE_CL
+
+@note Due to some issues in older versions of the Arm® Mali™ OpenCL DDK (<= r13p0), we recommend to link arm_compute statically on Android.
+@note When linked statically the arm_compute_graph library currently needs the --whole-archive linker flag in order to work properly
+
+Then you need to do is upload the executable and the shared library to the device using ADB:
+
+	adb push neon_cnn_arm /data/local/tmp/
+	adb push cl_sgemm_arm /data/local/tmp/
+	adb push gc_absdiff_arm /data/local/tmp/
+	adb shell chmod 777 -R /data/local/tmp/
+
+And finally to run the example:
+
+	adb shell /data/local/tmp/neon_cnn_arm
+	adb shell /data/local/tmp/cl_sgemm_arm
+	adb shell /data/local/tmp/gc_absdiff_arm
+
+For 64bit:
+
+	adb push neon_cnn_aarch64 /data/local/tmp/
+	adb push cl_sgemm_aarch64 /data/local/tmp/
+	adb push gc_absdiff_aarch64 /data/local/tmp/
+	adb shell chmod 777 -R /data/local/tmp/
+
+And finally to run the example:
+
+	adb shell /data/local/tmp/neon_cnn_aarch64
+	adb shell /data/local/tmp/cl_sgemm_aarch64
+	adb shell /data/local/tmp/gc_absdiff_aarch64
+
+@note Examples accept different types of arguments, to find out what they are run the example with \a --help as an argument. If no arguments are specified then random values will be used to execute the graph.
+
+For example:
+	adb shell /data/local/tmp/graph_lenet --help
+
+In this case the first argument of LeNet (like all the graph examples) is the target (i.e 0 to run on Neon™, 1 to run on OpenCL if available, 2 to run on OpenCL using the CLTuner), the second argument is the path to the folder containing the npy files for the weights and finally the third argument is the number of batches to run.
+
+@section S1_4_macos Building for macOS
+
+The library was successfully natively built for Apple Silicon under macOS 11.1 using clang v12.0.0.
+
+To natively compile the library with accelerated CPU support:
+
+	scons Werror=1 -j8 neon=1 opencl=0 os=macos arch=armv8a build=native
+
+@note Initial support disables feature discovery through HWCAPS and thread scheduling affinity controls
+
+@section S1_5_bare_metal Building for bare metal
+
+For bare metal, the library was successfully built using linaro's latest (gcc-linaro-6.3.1-2017.05) bare metal toolchains:
+ - arm-eabi for armv7a
+ - aarch64-elf for armv8a
+
+Download linaro for <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/arm-eabi/">armv7a</a> and <a href="https://releases.linaro.org/components/toolchain/binaries/6.3-2017.05/aarch64-elf/">armv8a</a>.
+
+@note Make sure to add the toolchains to your PATH: export PATH=$PATH:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_aarch64-elf/bin:$MY_TOOLCHAINS/gcc-linaro-6.3.1-2017.05-x86_64_arm-eabi/bin
+
+@subsection S1_5_1_library How to build the library ?
+
+To cross-compile the library with Arm® Neon™ support for baremetal armv8a:
+
+	scons Werror=1 -j8 debug=0 neon=1 opencl=0 os=bare_metal arch=armv8a build=cross_compile cppthreads=0 openmp=0 standalone=1
+
+@subsection S1_5_2_examples How to manually build the examples ?
+
+Examples are disabled when building for bare metal. If you want to build the examples you need to provide a custom bootcode depending on the target architecture and link against the compute library. More information about bare metal bootcode can be found <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0527a/index.html">here</a>.
+
+@section S1_6_windows_host Building on a Windows® host system (cross-compile)
+
+Using `scons` directly from the Windows® command line is known to cause
+problems. The reason seems to be that if `scons` is setup for cross-compilation
+it gets confused about Windows® style paths (using backslashes). Thus it is
+recommended to follow one of the options outlined below.
+
+@subsection S1_6_1_ubuntu_on_windows Bash on Ubuntu on Windows® (cross-compile)
+
+The best and easiest option is to use
+<a href="https://msdn.microsoft.com/en-gb/commandline/wsl/about">Ubuntu on Windows®</a>.
+This feature is still marked as *beta* and thus might not be available.
+However, if it is building the library is as simple as opening a *Bash on
+Ubuntu on Windows®* shell and following the general guidelines given above.
+
+@subsection S1_6_2_cygwin Cygwin (cross-compile)
+
+If the Windows® subsystem for Linux is not available <a href="https://www.cygwin.com/">Cygwin</a>
+can be used to install and run `scons`, the minimum Cygwin version must be 3.0.7 or later. In addition
+to the default packages installed by Cygwin `scons` has to be selected in the installer. (`git` might
+also be useful but is not strictly required if you already have got the source
+code of the library.) Linaro provides pre-built versions of
+<a href="http://releases.linaro.org/components/toolchain/binaries/">GCC cross-compilers</a>
+that can be used from the Cygwin terminal. When building for Android the
+compiler is included in the Android standalone toolchain. After everything has
+been set up in the Cygwin terminal the general guide on building the library
+can be followed.
+
+@subsection S1_6_3_WoA Windows® on Arm™ (native build)
+
+    Native builds on Windows® are experimental and some features from the library interacting with the OS are missing.
+
+It's possible to build Compute Library natively on a Windows® system running on Arm™.
+
+Windows® on Arm™ (WoA) systems provide compatibility emulating x86 binaries on aarch64. Unfortunately Visual Studio 2022 does not work on aarch64 systems because it's an x86_64bit application and these binaries cannot be exectuted on WoA yet.
+
+Because we cannot use Visual Studio to build Compute Library we have to set up a native standalone toolchain to compile C++ code for arm64 on Windows®.
+
+Native arm64 toolchain installation for WoA:
+- LLVM+Clang-12 which can be downloaded from: https://github.com/llvm/llvm-project/releases/download/llvmorg-12.0.0/LLVM-12.0.0-woa64.exe
+- Arm64 VC Runtime which can be downloaded from  https://aka.ms/vs/17/release/vc_redist.arm64.exe
+
+- While full VS22 cannot be installed on WoA, we can install some components
+    -# Desktop development with C++ and all Arm64 components for Visual Studio, refer to:  https://developer.arm.com/documentation/102528/0100/Install-Visual-Studio
+    -# VS22 build tools: https://visualstudio.microsoft.com/downloads/#build-tools-for-visual-studio-2022
+
+There are some additional tools we need to install to build Compute Library:
+
+- git https://git-scm.com/download/win
+- python 3 https://www.python.org/downloads/windows/
+- scons can be installed with pip install scons
+
+In order to use clang to build Windows® binaries natively we have to initialize the environment variables from VS22 correctly so that the compiler could find the arm64 C++ libraries. This can be done by pressing the key windows + r  and running the command:
+
+    cmd /k "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvarsx86_arm64.bat"
+
+To build Compute Library type:
+
+     scons opencl=0 neon=1 os=windows examples=0 validation_tests=1 benchmark_examples=0 build=native arch=armv8a Werror=0 exceptions=1 standalone=1
+
+@section S1_7_cl_requirements OpenCL DDK Requirements
+
+@subsection S1_7_1_cl_hard_requirements Hard Requirements
+
+Compute Library requires OpenCL 1.1 and above with support of non uniform workgroup sizes, which is officially supported in the Arm® Mali™ OpenCL DDK r8p0 and above as an extension (respective extension flag is \a -cl-arm-non-uniform-work-group-size).
+
+Enabling 16-bit floating point calculations require \a cl_khr_fp16 extension to be supported. All Arm® Mali™ GPUs with compute capabilities have native support for half precision floating points.
+
+@subsection S1_7_2_cl_performance_requirements Performance improvements
+
+Integer dot product built-in function extensions (and therefore optimized kernels) are available with Arm® Mali™ OpenCL DDK r22p0 and above for the following GPUs : G71, G76. The relevant extensions are \a cl_arm_integer_dot_product_int8, \a cl_arm_integer_dot_product_accumulate_int8 and \a cl_arm_integer_dot_product_accumulate_int16.
+
+OpenCL kernel level debugging can be simplified with the use of printf, this requires the \a cl_arm_printf extension to be supported.
+
+SVM allocations are supported for all the underlying allocations in Compute Library. To enable this OpenCL 2.0 and above is a requirement.
+
+@section S1_8_experimental_builds Experimental Bazel and CMake builds
+
+In addition to the scons build the repository includes experimental Bazel and CMake builds.
+These builds currently support a limited range of options. Both are similar to the scons multi_isa build. It compiles all libraries with Neon (TM) support, as well as SVE and SVE2 libraries. The build is CPU only, not including OpenCL support. Only Linux environment is targeted for now. Both were successfully built with gcc / g++ version 10.2.
+
+@subsection S1_8_1_bazel_build Bazel build
+
+@subsubsection S1_8_1_1_file_structure File structure
+
+File structure for all files included in the Bazel build:
+
+	.
+	├──  .bazelrc
+	├──  BUILD
+	├──  WORKSPACE
+	├── arm_compute
+	│   └── BUILD
+	├── examples
+	│   └── BUILD
+	├── include
+	│   └── BUILD
+	├── scripts
+	│   ├── print_version_file.py
+	│   └── BUILD
+	├── src
+	│   └── BUILD
+	├── support
+	│   └── BUILD
+	├── tests
+	│   ├── BUILD
+	│   └── framework
+	│       └── BUILD
+	└── utils
+		└── BUILD
+
+@subsubsection S1_8_1_2_build_options Build options
+
+Available build options:
+
+	- debug: Enable ['-O0','-g','-gdwarf-2'] compilation flags
+	- Werror: Enable -Werror compilation flag
+	- logging: Enable logging
+	- cppthreads: Enable C++11 threads backend
+	- openmp: Enable OpenMP backend
+
+@subsubsection S1_8_1_3_example_builds Example builds
+
+Build everything (libraries, examples, tests):
+
+	bazel build //...
+
+Build libraries:
+
+	bazel build //:all
+
+Build arm_compute only:
+
+	bazel build //:arm_compute
+
+Build examples:
+
+	bazel build //examples:all
+
+Build resnet50 example:
+
+	bazel build //examples:graph_resnet50
+
+Build validation and benchmarking:
+
+	bazel build //tests:all
+
+@subsection S1_8_2_cmake_build CMake build
+
+@subsubsection S1_8_2_1_file_structure File structure
+
+File structure for all files included in the CMake build:
+
+	.
+	├──  CMakeLists.txt
+	├── cmake
+	│   ├── Options.cmake
+	│   ├── Version.cmake
+	│   └── toolchains
+	│       └── aarch64_linux_toolchain.cmake
+	├── examples
+	│   └── CMakeLists.txt
+	├── src
+	│   └── CMakeLists.txt
+	└── tests
+		├── CMakeLists.txt
+		├── benchmark
+		│   └── CMakeLists.txt
+		└── validation
+			└── CMakeLists.txt
+
+@subsubsection S1_8_2_2_build_options Build options
+
+Available build options:
+
+	- CMAKE_BUILD_TYPE: "Release" (default) enables ['-O3', '-DNDEBUG'] compilation flags, "Debug" enables ['-O0','-g','-gdwarf-2', '-DARM_COMPUTE_ASSERTS_ENABLED']
+	- ARM_COMPUTE_WERROR: Enable -Werror compilation flag
+	- ARM_COMPUTE_EXCEPTIONS: If disabled ARM_COMPUTE_EXCEPTIONS_DISABLED is enabled
+	- ARM_COMPUTE_LOGGING: Enable logging
+	- ARM_COMPUTE_BUILD_EXAMPLES: Build examples
+	- ARM_COMPUTE_BUILD_TESTING: Build tests
+	- ARM_COMPUTE_CPPTHREADS: Enable C++11 threads backend
+	- ARM_COMPUTE_OPENMP: Enable OpenMP backend
+
+@subsubsection S1_8_2_3_example_builds Example builds
+
+To build libraries, examples and tests:
+
+	mkdir build
+	cd build
+	cmake .. -DCMAKE_BUILD_TYPE=Release -DARM_COMPUTE_OPENMP=1 -DARM_COMPUTE_WERROR=0 -DARM_COMPUTE_BUILD_EXAMPLES=1 -DARM_COMPUTE_BUILD_TESTING=1 -DCMAKE_INSTALL_LIBDIR=.
+	cmake --build . -j32
+
+@section S1_9_fixed_format Building with support for fixed format kernels
+
+@subsection S1_9_1_intro_to_fixed_format_kernels What are fixed format kernels?
+
+The GEMM kernels used for convolutions and fully-connected layers in Compute Library employ memory layouts optimized for each kernel implementation. This then requires the supplied weights to be re-ordered into a buffer ready for consumption by the GEMM kernel. Where Compute Library is being called from a framework or library which implements operator caching, the re-ordering of the inputted weights into an intermediate buffer may no longer be desirable. When using a cached operator, the caller may wish to re-write the weights tensor, and re-run the operator using the updated weights. With the default GEMM kernels in Compute Library, the GEMM will be executed with the old weights, leading to incorrect results.
+
+To address this, Compute Library provides a set of GEMM kernels which use a common blocked memory format. These kernels consume the input weights directly from the weights buffer and do not execute an intermediate pre-transpose step. With this approach, it is the responsibility of the user (in this case the calling framework) to ensure that the weights are re-ordered into the required memory format. @ref NEGEMM::has_opt_impl is a static function that queries whether there exists fixed-format kernel, and if so will return in the expected weights format. The supported weight formats are enumerated in @ref arm_compute::WeightFormat.
+
+@subsection S1_9_2_building_fixed_format Building with fixed format kernels
+
+Fixed format kernels are only available for the CPU backend. To build Compute Library with fixed format kernels set fixed_format_kernels=1:
+
+        scons Werror=1 debug=0 neon=1 opencl=0 embed_kernels=0 os=linux multi_isa=1 build=native cppthreads=1 openmp=0 fixed_format_kernels=1
+
+@section S1_10_doxygen Building the Doxygen Documentation
+
+This documentation has been generated using the following shell command:
+
+        $ ./scripts/generate_documentation.sh
+
+This requires Doxygen to be installed and available on your system.
+
+*/
+
+} // namespace arm_compute
diff --git a/docs/user_guide/introduction.dox b/docs/user_guide/introduction.dox
new file mode 100644
index 0000000000..15c95f7103
--- /dev/null
+++ b/docs/user_guide/introduction.dox
@@ -0,0 +1,120 @@
+///
+/// Copyright (c) 2017-2024 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@mainpage Introduction
+@copydoc introduction
+
+@page introduction Introduction
+
+@tableofcontents
+
+The Compute Library is a collection of low-level machine learning functions optimized for both Arm CPUs and GPUs using SIMD technologies.
+
+Several builds of the library are available using various configurations:
+ - OS: Linux®, Android™, macOS or bare metal.
+ - Architecture: armv7a (32bit) or armv8a (64bit).
+ - Technology: Arm® Neon™ / OpenCL / Arm® Neon™ and OpenCL.
+ - Debug / Asserts / Release: Use a build with asserts enabled to debug your application and enable extra validation. Once you are sure your application works as expected you can switch to a release build of the library for maximum performance.
+
+@warning Depecation Notice from 24.01: NCHW data format specific optimizations will gradually be removed from the code base in
+    future releases. The implication of this is that the user is expected to translate NCHW models into NHWC in
+    order to benefit from the optimizations.
+
+@b Minimum toolchains requirements are shown below:
+
+<table>
+<tr>
+  <th>Operating System
+  <th>Architecture
+  <th>Minimum Toolchain
+<tr>
+  <td rowspan="4">Linux®
+  <td>armv7a
+  <td>gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf
+  <tr>
+  <td>armv8a
+  <td rowspan="2">gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu
+  <tr>
+  <td>armv8.2-a
+  <tr>
+  <td>armv8.2-a-sve
+  <td>gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu
+<tr>
+  <td rowspan="3">Android™
+  <td>armv8a
+  <td rowspan="2">NDK r20b
+  <tr>
+  <td>armv8.2-a
+  <tr>
+  <td>armv8.2-a-sve
+  <td>NDK r23b
+<tr>
+  <td rowspan="1">macOS
+  <td>armv8.2-a
+  <td>Monterey (OS version): clang 13 (native)
+</table>
+
+@section S0_1_contact Contact / Support
+
+Please create an issue on <a href="https://github.com/ARM-software/ComputeLibrary/issues">Github</a>.
+
+In order to facilitate the work of the support team please provide the build information of the library you are using. To get the version of the library you are using simply run:
+
+    $ strings android-armv8a-cl-asserts/libarm_compute.so | grep arm_compute_version
+    arm_compute_version=v16.12 Build options: {'embed_kernels': '1', 'opencl': '1', 'arch': 'armv8a', 'neon': '0', 'asserts': '1', 'debug': '0', 'os': 'android', 'Werror': '1'} Git hash=f51a545d4ea12a9059fe4e598a092f1fd06dc858
+
+@section S0_2_prebuilt_binaries Pre-built binaries
+
+For each release we provide some pre-built binaries of the library [here](https://github.com/ARM-software/ComputeLibrary/releases).
+
+These binaries have been built using the following toolchains:
+            - Linux® armv7a: gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf
+            - Linux® armv8a: gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu
+            - Linux® armv8.2-a: gcc-linaro-7.2.1-2017.11-x86_64_aarch64-linux-gnu
+            - Linux® armv8.2-a (multi-ISA binary): gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu
+            - Linux® armv8.2-a-sve: gcc-arm-10.2-2020.11-x86_64-aarch64-none-linux-gnu
+            - Android™ armv8a: clang++ / libc++ NDK r20b
+            - Android™ armv8.2-a: clang++ / libc++ NDK r20b
+            - Android™ armv8.2-a-sve: clang++ / libc++ NDK r23b
+
+@warning Make sure to use a compatible toolchain to build your application or you will get some std::bad_alloc errors at runtime.
+
+@section S0_3_file_organisation File organisation
+
+This archive contains:
+ - The arm_compute header and source files
+ - The latest Khronos OpenCL 1.2 C headers from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a>
+ - The latest Khronos cl2.hpp from the <a href="https://www.khronos.org/registry/cl/">Khronos OpenCL registry</a> (API version 2.1 when this document was written)
+ - The latest Khronos EGL 1.5 C headers from the <a href="https://www.khronos.org/registry/gles/">Khronos EGL registry</a>
+ - The sources for a stub version of libOpenCL.so, libGLESv1_CM.so, libGLESv2.so and libEGL.so to help you build your application.
+ - An examples folder containing a few examples to compile and link against the library.
+ - A utils folder containing headers with some boiler plate code used by the examples.
+ - This documentation.
+
+ For detailed information about file organization, please refer to Files -> File List section of this documentation.
+
+*/
+} // namespace arm_compute
diff --git a/docs/01_library.dox b/docs/user_guide/library.dox
index 742a246582..5a337c374b 100644
--- a/docs/01_library.dox
+++ b/docs/user_guide/library.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2017-2020 Arm Limited.
+/// Copyright (c) 2017-2021, 2023-2024 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -24,45 +24,27 @@
 namespace arm_compute
 {
 /**
-@page architecture Library architecture
+@page architecture Library Architecture
 
 @tableofcontents
 
-@section S4_1_1 Core vs Runtime libraries
+@section architecture_compute_library Compute Library architecture
 
-The Core library is a low level collection of algorithms implementations, it is designed to be embedded in existing projects and applications:
+The Compute Library is a collection of low level algorithm implementations known as kernels @ref IKernel.
+These kernels are implemented as operators @ref IOperator that do not allocate any memory (i.e. all the memory allocations/mappings have to be handled by the caller)
+and are are designed to be embedded in existing projects and applications.
 
-- It doesn't allocate any memory (All the memory allocations/mappings have to be handled by the caller).
-- It doesn't perform any kind of multi-threading (but provide information to the caller about how the workload can be split).
+A higher-level interface wraps the operators into functions @ref IFunction that:
+- Performs memory allocation of images and tensors through the use of standard malloc().
+- Enables multi-threading of Arm® Neon™ code in a very basic way using a very simple pool of threads.
+- For OpenCL, uses the default CLScheduler command queue for all mapping operations and kernels.
 
-The Runtime library is a very basic wrapper around the Core library which can be used for quick prototyping, it is basic in the sense that:
+For maximum performance, it is expected that the users would re-implement an equivalent to the function interface which suits better their needs (With a more clever multi-threading strategy, load-balancing between Arm® Neon™ and OpenCL, etc.)
 
-- It allocates images and tensors by using standard malloc().
-- It multi-threads NEON code in a very basic way using a very simple pool of threads.
-- For OpenCL it uses the default CLScheduler command queue for all mapping operations and kernels.
-
-For maximum performance, it is expected that the users would re-implement an equivalent to the runtime library which suits better their needs (With a more clever multi-threading strategy, load-balancing between NEON and OpenCL, etc.)
-
-@section S4_1_2 Data-type and Data-layout support
-
-Compute Library supports a wide list of data-types, information can been directly found in the documentation of each kernel/function.
-The main data-types that the Machine Learning functions support are the following:
-- BFLOAT16: 16-bit non-standard brain floating point
-- F16: 16-bit half precision floating point
-- F32: 32-bit single precision floating point
-- QASYMM8: 8-bit unsigned asymmetric quantized
-- QASYMM8_SIGNED: 8-bit signed asymmetric quantized
-- QSYMM8_PER_CHANNEL: 8-bit signed symmetric quantized (Used for the weights)
-
-Moreover, Compute Library supports the following data layouts (fast changing dimension from right to left):
-- NHWC: The native layout of Compute Library that delivers the best performance where channels are in the fastest changing dimension
-- NCHW: Legacy layout where width is in the fastest changing dimension
-where N = batches, C = channels, H = height, W = width
-
-@section S4_1_3 Fast-math support
+@section architecture_fast_math Fast-math support
 
 Compute Library supports different types of convolution methods, fast-math flag is only used for the Winograd algorithm.
-When the fast-math flag is enabled, both NEON and CL convolution layers will try to dispatch the fastest implementation available, which may introduce a drop in accuracy as well. The different scenarios involving the fast-math flag are presented below:
+When the fast-math flag is enabled, both Arm® Neon™ and CL convolution layers will try to dispatch the fastest implementation available, which may introduce a drop in accuracy as well. The different scenarios involving the fast-math flag are presented below:
 - For FP32:
     - no-fast-math: Only supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7
     - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
@@ -70,183 +52,34 @@ When the fast-math flag is enabled, both NEON and CL convolution layers will try
     - no-fast-math: No Winograd support
     - fast-math: Supports Winograd 3x3,3x1,1x3,5x1,1x5,7x1,1x7,5x5,7x7
 
-@section S4_1_4 Thread-safety
-
-Although the library supports multi-threading during workload dispatch, thus parallelizing the execution of the workload at multiple threads, the current runtime module implementation is not thread-safe in the sense of executing different functions from separate threads.
-This lies to the fact that the provided scheduling mechanism wasn't designed with thread-safety in mind.
-As it is true with the rest of the runtime library a custom scheduling mechanism can be re-implemented to account for thread-safety if needed and be injected as the library's default scheduler.
-
-@section S4_2_windows_kernels_mt_functions Windows, kernels, multi-threading and functions
+@section bf16_acceleration BF16 acceleration
 
-@subsection S4_2_1_windows Windows
+Required toolchain: android-ndk-r23-beta5 or later.
 
-A @ref Window represents a workload to execute, it can handle up to @ref Coordinates::num_max_dimensions dimensions.
-Each dimension is defined by a start, end and step.
+To build for BF16: "neon" flag should be set "=1" and "arch" has to be "=armv8.6-a", "=armv8.6-a-sve", or "=armv8.6-a-sve2". For example:
 
-It can split into subwindows as long as *all* the following rules remain true for all the dimensions:
+	scons arch=armv8.6-a-sve neon=1 opencl=0 extra_cxx_flags="-fPIC" benchmark_tests=0 validation_tests=0 validation_examples=1 os=android Werror=0 toolchain_prefix=aarch64-linux-android29
 
-- max[n].start() <= sub[n].start() < max[n].end()
-- sub[n].start() < sub[n].end() <= max[n].end()
-- max[n].step() == sub[n].step()
-- (sub[n].start() - max[n].start()) % max[n].step() == 0
-- (sub[n].end() - sub[n].start()) % max[n].step() == 0
-
-@subsection S4_2_2 Kernels
-
-Each implementation of the @ref IKernel interface (base class of all the kernels in the core library) works in the same way:
-
-OpenCL kernels:
-
-@code{.cpp}
-// Initialize the CLScheduler with the default context and default command queue
-// Implicitly initializes the CLKernelLibrary to use ./cl_kernels as location for OpenCL kernels files and sets a default device for which OpenCL programs are built.
-CLScheduler::get().default_init();
-
-cl::CommandQueue q = CLScheduler::get().queue();
-//Create a kernel object:
-MyKernel kernel;
-// Initialize the kernel with the input/output and options you want to use:
-kernel.configure( input, output, option0, option1);
-// Retrieve the execution window of the kernel:
-const Window& max_window = kernel.window();
-// Run the whole kernel in the current thread:
-kernel.run( q, max_window ); // Enqueue the kernel to process the full window on the default queue
-
-// Wait for the processing to complete:
-q.finish();
-@endcode
-
-NEON / CPP kernels:
-
-@code{.cpp}
-//Create a kernel object:
-MyKernel kernel;
-// Initialize the kernel with the input/output and options you want to use:
-kernel.configure( input, output, option0, option1);
-// Retrieve the execution window of the kernel:
-const Window& max_window = kernel.window();
-// Run the whole kernel in the current thread:
-kernel.run( max_window ); // Run the kernel on the full window
-@endcode
+To enable BF16 acceleration when running FP32 "fast-math" has to be enabled and that works only for Neon convolution layer using cpu gemm.
+In this scenario on CPU: the CpuGemmConv2d kernel performs the conversion from FP32, type of input tensor, to BF16 at block level to exploit the arithmetic capabilities dedicated to BF16. Then transforms back to FP32, the output tensor type.
 
-@subsection S4_2_3 Multi-threading
+@section architecture_thread_safety Thread-safety
 
-The previous section shows how to run a NEON / CPP kernel in the current thread, however if your system has several CPU cores, you will probably want the kernel to use several cores. Here is how this can be done:
-
-@code{.cpp}
-    ThreadInfo info;
-    info.cpu_info = &_cpu_info;
-
-    const Window      &max_window     = kernel->window();
-    const unsigned int num_iterations = max_window.num_iterations(split_dimension);
-    info.num_threads                  = std::min(num_iterations, _num_threads);
-
-    if(num_iterations == 0)
-    {
-        return;
-    }
-
-    if(!kernel->is_parallelisable() || info.num_threads == 1)
-    {
-        kernel->run(max_window, info);
-    }
-    else
-    {
-        int  t         = 0;
-        auto thread_it = _threads.begin();
-
-        for(; t < info.num_threads - 1; ++t, ++thread_it)
-        {
-            Window win     = max_window.split_window(split_dimension, t, info.num_threads);
-            info.thread_id = t;
-            thread_it->start(kernel, win, info);
-        }
-
-        // Run last part on main thread
-        Window win     = max_window.split_window(split_dimension, t, info.num_threads);
-        info.thread_id = t;
-        kernel->run(win, info);
-
-        try
-        {
-            for(auto &thread : _threads)
-            {
-                thread.wait();
-            }
-        }
-        catch(const std::system_error &e)
-        {
-            std::cerr << "Caught system_error with code " << e.code() << " meaning " << e.what() << '\n';
-        }
-    }
-@endcode
-
-This is a very basic implementation which was originally used in the NEON runtime library by all the NEON functions.
-
-@sa CPPScheduler
-
-@note Some kernels like for example @ref NEHistogramKernel need some local temporary buffer to perform their calculations. In order to avoid memory corruption between threads, the local buffer must be of size: ```memory_needed_per_thread * num_threads``` and a unique thread_id between 0 and num_threads must be assigned to the @ref ThreadInfo object passed to the ```run``` function.
-
-@subsection S4_2_4 Functions
-
-Functions will automatically allocate the temporary buffers mentioned above, and will automatically multi-thread kernels' executions using the very basic scheduler described in the previous section.
-
-Simple functions only call a single kernel (e.g @ref NEConvolution3x3), while more complex ones consist of several kernels pipelined together (e.g @ref NEGaussianPyramid, @ref NEHarrisCorners). Check their documentation to find out which kernels are used by each function.
-
-@code{.cpp}
-//Create a function object:
-MyFunction function;
-// Initialize the function with the input/output and options you want to use:
-function.configure( input, output, option0, option1);
-// Execute the function:
-function.run();
-@endcode
-
-@warning The Compute Library requires Mali OpenCL DDK r8p0 or higher (OpenCL kernels are compiled using the -cl-arm-non-uniform-work-group-size flag)
-
-@note All OpenCL functions and objects in the runtime library use the command queue associated with CLScheduler for all operations, a real implementation would be expected to use different queues for mapping operations and kernels in order to reach a better GPU utilization.
-
-@subsection S4_4_1_cl_scheduler OpenCL Scheduler and kernel library
-
-The Compute Library runtime uses a single command queue and context for all the operations.
-
-The user can get / set this context and command queue through CLScheduler's interface.
-
-The user can get / set the target GPU device through the CLScheduler's interface.
-
-@attention Make sure the application is using the same context as the library as in OpenCL it is forbidden to share objects across contexts. This is done by calling @ref CLScheduler::init() or @ref CLScheduler::default_init() at the beginning of your application.
-
-@attention Make sure the scheduler's target is not changed after function classes are created.
-
-All OpenCL kernels used by the library are built and stored in @ref CLKernelLibrary.
-If the library is compiled with embed_kernels=0 the application can set the path to the OpenCL kernels by calling @ref CLKernelLibrary::init(), by default the path is set to "./cl_kernels"
-
-@subsection S4_4_2_events_sync OpenCL events and synchronization
-
-In order to block until all the jobs in the CLScheduler's command queue are done executing the user can call @ref CLScheduler::sync() or create a sync event using @ref CLScheduler::enqueue_sync_event()
-
-For example:
-@snippet cl_events.cpp OpenCL events
-
-@subsection S4_4_2_cl_neon OpenCL / NEON interoperability
-
-You can mix OpenCL and NEON kernels and functions. However it is the user's responsibility to handle the mapping/unmapping of OpenCL objects, for example:
-
-@snippet neoncl_scale_median_gaussian.cpp NEON / OpenCL Interop
-
-@sa main_neoncl_scale_median_gaussian
+Although the library supports multi-threading during workload dispatch, thus parallelizing the execution of the workload at multiple threads, the current runtime module implementation is not thread-safe in the sense of executing different functions from separate threads.
+This lies to the fact that the provided scheduling mechanism wasn't designed with thread-safety in mind.
+As it is true with the rest of the runtime library a custom scheduling mechanism can be re-implemented to account for thread-safety if needed and be injected as the library's default scheduler.
 
-@section S4_5_algorithms Algorithms
+@section architecture__algorithms Algorithms
 
 All computer vision algorithms in this library have been implemented following the [OpenVX 1.1 specifications](https://www.khronos.org/registry/vx/specs/1.1/html/). Please refer to the Khronos documentation for more information.
 
-@section S4_6_images_tensors Images, padding, border modes and tensors
+@section architecture_images_tensors Images, padding, border modes and tensors
 
 Most kernels and functions in the library process images, however, in order to be future proof most of the kernels actually accept tensors. See below for more information about how they are related.
 
 @attention Each memory object can be written by only one kernel, however it can be read by several kernels. Writing to the same object from several kernels will result in undefined behavior. The kernel writing to an object must be configured before the kernel(s) reading from it.
 
-@subsection S4_6_1_padding_and_border Padding and border modes
+@subsection architecture_images_tensors_padding_and_border Padding and border modes
 
 Several algorithms require a neighborhood around the current pixel to compute it's value. This means the algorithm will not be able to process the borders of the image unless you give it more information about how those border pixels should be processed. The @ref BorderMode enum is used for this purpose.
 
@@ -256,29 +89,41 @@ You have 3 types of @ref BorderMode :
 - @ref BorderMode::REPLICATE : Neighbor pixels outside of the image are treated as having the same value as the closest valid pixel.
 - @ref BorderMode::CONSTANT : Neighbor pixels outside of the image are treated as having the same constant value. (The user can choose what this value should be).
 
-Moreover both OpenCL and NEON use vector loads and stores instructions to access the data in buffers, so in order to avoid having special cases to handle for the borders all the images and tensors used in this library must be padded.
+Moreover both OpenCL and Arm® Neon™ use vector loads and stores instructions to access the data in buffers, so in order to avoid having special cases to handle for the borders all the images and tensors used in this library must be padded.
 
-@subsubsection padding Padding
+@subsubsection architecture_images_tensors_padding Padding
 
 There are different ways padding can be calculated:
 
 - Accurate padding:
 
-@snippet neon_convolution.cpp Accurate padding
-
 @note It's important to call allocate @b after the function is configured: if the image / tensor is already allocated then the function will shrink its execution window instead of increasing the padding. (See below for more details).
 
-- Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions). In that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (which translates into a smaller valid region for the output). See also @ref valid_region).
+- Manual padding / no padding / auto padding: You can allocate your images / tensors up front (before configuring your functions). In that case the function will use whatever padding is available and will shrink its execution window if there isn't enough padding available (which translates into a smaller valid region for the output). See also @ref architecture_images_tensors_valid_region).
 If you don't want to manually set the padding but still want to allocate your objects upfront then you can use auto_padding. It guarantees that the allocation will have enough padding to run any of the provided functions.
 
 @code{.cpp}
-Image     src, dst;
+Image       src{}, dst{};
+NEScale     scale{};
+
+// Create an empty grayscale 640x480 image
+src.allocator()->init(TensorInfo(640, 480, Format::U8));
+
+constexpr int scale_factor = 2;
+TensorInfo dst_tensor_info(src.info()->dimension(0) / scale_factor, src.info()->dimension(1) / scale_factor,
+                           Format::U8);
 
-// Use auto padding for the input:
-src.info()->init_auto_padding(TensorShape(640u,480u), Format::U8);
+// Configure the destination image
+dst.allocator()->init(dst_tensor_info);
 
-// Use manual padding for the destination image
-dst.info()->init(src.info()->tensor_shape(), Format::U8, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);
+// Configure Scale function object:
+scale.configure(&src, &dst, ScaleKernelInfo{
+            InterpolationPolicy::NEAREST_NEIGHBOR,
+            BorderMode::UNDEFINED,
+            PixelValue(),
+            SamplingPolicy::CENTER,
+            false
+});
 
 // Allocate all the images
 src.allocator()->allocate();
@@ -286,18 +131,15 @@ dst.allocator()->allocate();
 // Fill the input image with the content of the PPM image if a filename was provided:
 fill_image(src);
 
-NEGaussian3x3 gauss;
-
-// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)
-gauss.configure(&src, &dst, BorderMode::UNDEFINED);
-
-//Execute the functions:
-gauss.run();
+// Run the scale operation:
+scale.run();
 @endcode
 
+The full example is provided in examples/neon_scale.cpp
+
 @warning Some kernels need up to 3 neighbor values to calculate the value of a given pixel. Therefore, to be safe, we use a 4-pixel padding all around the image. In addition, some kernels read and write up to 32 pixels at the same time. To cover that case as well we add an extra 32 pixels of padding at the end of each row. As a result auto padded buffers waste a lot of memory and are less cache friendly. It is therefore recommended to use accurate padding or manual padding wherever possible.
 
-@subsubsection valid_region Valid regions
+@subsubsection architecture_images_tensors_valid_region Valid regions
 
 Some kernels (like edge detectors for example) need to read values of neighboring pixels to calculate the value of a given pixel, it is therefore not possible to calculate the values of the pixels on the edges.
 
@@ -305,7 +147,7 @@ Another case is: if a kernel processes 8 pixels per iteration and the image's di
 
 In order to know which pixels have been calculated, each kernel sets a valid region for each output image or tensor. See also @ref TensorInfo::valid_region(), @ref ValidRegion
 
-@subsection S4_6_2_tensors Tensors
+@subsection architecture_images_tensors_tensors Tensors
 
 Tensors are multi-dimensional arrays with a maximum of @ref Coordinates::num_max_dimensions dimensions.
 
@@ -313,7 +155,7 @@ Depending on the number of dimensions tensors can be interpreted as various obje
 
 @note Most algorithms process images (i.e a 2D slice of the tensor), therefore only padding along the X and Y axes is required (2D slices can be stored contiguously in memory).
 
-@subsection S4_6_3_description_conventions Images and Tensors description conventions
+@subsection architecture_images_tensors_description_conventions Images and Tensors description conventions
 
 Image objects are defined by a @ref Format and dimensions expressed as [width, height, batch]
 
@@ -335,7 +177,7 @@ For example, to read the element located at the coordinates (x,y) of a float ten
 float value = *reinterpret_cast<float*>(input.buffer() + input.info()->offset_element_in_bytes(Coordinates(x,y)));
 @endcode
 
-@subsection S4_6_4_working_with_objects Working with Images and Tensors using iterators
+@subsection architecture_images_tensors_working_with_objects Working with Images and Tensors using iterators
 
 The library provides some iterators to access objects' data.
 Iterators are created by associating a data object (An image or a tensor for example) with an iteration window.
@@ -349,7 +191,7 @@ Here are a couple of examples of how to use the iterators to fill / read tensors
 
 @snippet examples/neon_copy_objects.cpp Copy objects example
 
-@subsection S4_6_5_sub_tensors Sub-tensors
+@subsection architecture_images_tensors_sub_tensors Sub-tensors
 
 Sub-tensors are aliases to existing Tensors, as a result creating a sub-tensor does not result in any underlying memory allocation.
 
@@ -368,13 +210,13 @@ Where \a parent is the parent tensor which we want to create an alias for, \a te
 
 @warning Limitation of the sub-tensor is that it cannot be extracted spatially, meaning sub-tensors should have the same width and height as the parent tensor. The main reasons for this is the fact that individual kernels might need to operate with a step size that is not a multiple of the sub-tensor spatial dimension. This could lead to elements being overwritten by different kernels operating on different sub-tensors of the same underlying tensor.
 
-@section S4_7_memory_manager MemoryManager
+@section architecture_memory_manager MemoryManager
 
 @ref IMemoryManager is a memory managing interface that can be used to reduce the memory requirements of a given pipeline by recycling temporary buffers.
 
-@subsection S4_7_1_memory_manager_components MemoryGroup, MemoryPool and MemoryManager Components
+@subsection architecture_memory_manager_component MemoryGroup, MemoryPool and MemoryManager Components
 
-@subsubsection S4_7_1_1_memory_group MemoryGroup
+@subsubsection architecture_memory_manager_component_memory_group MemoryGroup
 
 @ref IMemoryGroup defines the memory managing granularity.
 
@@ -382,13 +224,13 @@ MemoryGroup binds a number of objects to a bucket of memory requirements that ne
 
 Requesting backing memory for a specific group can be done using @ref IMemoryGroup::acquire and releasing the memory back using @ref IMemoryGroup::release.
 
-@subsubsection S4_7_1_2_memory_pool MemoryPool
+@subsubsection architecture_memory_manager_component_memory_pool MemoryPool
 
 @ref IMemoryPool defines a pool of memory that can be used to provide backing memory to a memory group.
 
 @note @ref BlobMemoryPool is currently implemented which models the memory requirements as a vector of distinct memory blobs.
 
-@subsubsection S4_7_1_2_memory_manager_components MemoryManager Components
+@subsubsection architecture_memory_manager_component_memory_manager_components MemoryManager Components
 
 @ref IMemoryManager consists of two components:
 - @ref ILifetimeManager that keeps track of the lifetime of the registered objects of the memory groups and given an @ref IAllocator creates an appropriate memory pool that fulfils the memory requirements of all the registered memory groups.
@@ -396,7 +238,7 @@ Requesting backing memory for a specific group can be done using @ref IMemoryGro
 
 @note @ref BlobLifetimeManager is currently implemented which models the memory requirements as a vector of distinct memory blobs.
 
-@subsection S4_7_2_working_with_memory_manager Working with the Memory Manager
+@subsection architecture_memory_manager_working_with_memory_manager Working with the Memory Manager
 Using a memory manager to reduce the memory requirements of a pipeline can be summed in the following steps:
 
 Initially a memory manager must be set-up:
@@ -433,7 +275,7 @@ tmp2.allocator()->allocate();       // Flag that the lifetime of object tmp2 has
 tmp3.allocator()->allocate();       // Flag that the lifetime of object tmp3 has ended
 @endcode
 
-@warning The configuration step should be done sequentially by a single thread so that all the lifetimes are captured correclty.
+@warning The configuration step should be done sequentially by a single thread so that all the lifetimes are captured correctly.
 
 When configuration of all the operations is finished then the memory manager have to be populated:
 @code{.cpp}
@@ -452,7 +294,7 @@ memory_group.release(); // Release memory so that it can be reused
 @note Execution of a pipeline can be done in a multi-threading environment as memory acquisition/release are thread safe.
 @note If you are handling sensitive data and it's required to zero out the memory buffers before freeing, make sure to also zero out the intermediate buffers. You can access the buffers through the memory group's mappings.
 
-@subsection S4_7_3_memory_manager_function_support Function support
+@subsection architecture_memory_manager_function_support Function support
 
 Most of the library's function have been ported to use @ref IMemoryManager for their internal temporary buffers.
 
@@ -479,11 +321,11 @@ conv1.run();
 conv2.run();
 @endcode
 
-@section S4_8_import_memory Import Memory Interface
+@section architecture_import_memory Import Memory Interface
 
 The implemented @ref TensorAllocator and @ref CLTensorAllocator objects provide an interface capable of importing existing memory to a tensor as backing memory.
 
-A simple NEON example can be the following:
+A simple Arm® Neon™ example can be the following:
 @code{.cpp}
 // External backing memory
 void* external_ptr = ...;
@@ -501,7 +343,7 @@ It is important to note the following:
 - The tensor mustn't be memory managed.
 - Padding requirements should be accounted by the client code. In other words, if padding is required by the tensor after the function configuration step, then the imported backing memory should account for it. Padding can be checked through the @ref TensorInfo::padding() interface.
 
-@section S4_9_opencl_tuner OpenCL Tuner
+@section architecture_opencl_tuner OpenCL Tuner
 
 OpenCL kernels when dispatched to the GPU take two arguments:
 - The Global Workgroup Size (GWS): That's the number of times to run an OpenCL kernel to process all the elements we want to process.
@@ -517,13 +359,33 @@ However this process takes quite a lot of time, which is why it cannot be enable
 
 But, when the @ref CLTuner is disabled ( Target = 1 for the graph examples), the @ref graph::Graph will try to reload the file containing the tuning parameters, then for each executed kernel the Compute Library will use the fine tuned LWS if it was present in the file or use a default LWS value if it's not.
 
-@section S4_10_weights_manager Weights Manager
+@section architecture_cl_queue_priorities OpenCL Queue Priorities
+
+OpenCL 2.1 exposes the `cl_khr_priority_hints` extensions that if supported by an underlying implementation allows the user to specify priority hints to the created command queues.
+Is important to note that this does not specify guarantees or the explicit scheduling behavior, this is something that each implementation needs to expose.
+
+In some cases, priority queues can be used when there is an implicit internal priority between graphics and compute queues and thus allow some level of priority control between them.
+At the moment three priority level can be specified:
+- CL_QUEUE_PRIORITY_HIGH_KHR
+- CL_QUEUE_PRIORITY_MED_KHR
+- CL_QUEUE_PRIORITY_LOW_KHR
+
+Compute Library allows extraction of the internal OpenCL queue or the ability to inject directly a user-defined queue to the @ref CLScheduler.
+This way the user can utilize this extension to define priorities between the queues and setup the OpenCL scheduler mechanism to utilize them.
+
+@code{.cpp}
+cl_queue_properties queue_properties[] = {CL_QUEUE_PRIORITY_KHR, CL_QUEUE_PRIORITY_HIGH_KHR, 0};
+cl_command_queue priority_queue = clCreateCommandQueueWithProperties(ctx, dev, queue_properties, &error);
+CLScheduler::get().set_queue(::cl::CommandQueue(priority_queue));
+@endcode
+
+@section architecture_weights_manager Weights Manager
 
 @ref IWeightsManager is a weights managing interface that can be used to reduce the memory requirements of a given pipeline by reusing transformed weights across multiple function executions.
 @ref IWeightsManager is responsible for managing weight tensors alongside with their transformations.
 @ref ITransformWeights provides an interface for running the desired transform function. This interface is used by the weights manager.
 
-@subsection S4_10_1_working_with_weights_manager Working with the Weights Manager
+@subsection architecture_weights_manager_working_with_weights_manager Working with the Weights Manager
 Following is a simple example that uses the weights manager:
 
 Initially a weights manager must be set-up:
@@ -538,9 +400,49 @@ wm->acquire(weights, &_reshape_weights_managed_function); // Acquire the address
 wm->run(weights, &_reshape_weights_managed_function);     // Run the transpose function
 @endcode
 
-@section S5_0_experimental Experimental Features
+@section programming_model Programming Model
+@subsection programming_model_functions Functions
+
+Functions will automatically allocate the temporary buffers mentioned above, and will automatically multi-thread kernels' executions using the very basic scheduler described in the previous section.
+
+Simple functions only call a single kernel (e.g NEConvolution3x3), while more complex ones consist of several kernels pipelined together (e.g @ref NEFullyConnectedLayer ). Check their documentation to find out which kernels are used by each function.
+
+@code{.cpp}
+//Create a function object:
+MyFunction function;
+// Initialize the function with the input/output and options you want to use:
+function.configure( input, output, option0, option1);
+// Execute the function:
+function.run();
+@endcode
+
+@warning The Compute Library requires Arm® Mali™ OpenCL DDK r8p0 or higher (OpenCL kernels are compiled using the -cl-arm-non-uniform-work-group-size flag)
+
+@note All OpenCL functions and objects in the runtime library use the command queue associated with CLScheduler for all operations, a real implementation would be expected to use different queues for mapping operations and kernels in order to reach a better GPU utilization.
+
+@subsection programming_model_scheduler OpenCL Scheduler
+
+The Compute Library runtime uses a single command queue and context for all the operations.
+
+The user can get / set this context and command queue through CLScheduler's interface.
+
+The user can get / set the target GPU device through the CLScheduler's interface.
+
+@attention Make sure the application is using the same context as the library as in OpenCL it is forbidden to share objects across contexts. This is done by calling @ref CLScheduler::init() or @ref CLScheduler::default_init() at the beginning of your application.
+
+@attention Make sure the scheduler's target is not changed after function classes are created.
 
-@subsection S5_1_run_time_context Run-time Context
+@subsection programming_model__events_sync OpenCL events and synchronization
+
+In order to block until all the jobs in the CLScheduler's command queue are done executing the user can call @ref CLScheduler::sync() or create a sync event using @ref CLScheduler::enqueue_sync_event()
+
+@subsection programming_model_cl_neon OpenCL / Arm® Neon™ interoperability
+
+You can mix OpenCL and Arm® Neon™ kernels and functions. However it is the user's responsibility to handle the mapping/unmapping of OpenCL objects.
+
+@section architecture_experimental Experimental Features
+
+@subsection architecture_experimental_run_time_context Run-time Context
 
 Some of the Compute Library components are modelled as singletons thus posing limitations to supporting some use-cases and ensuring a more client-controlled API.
 Thus, we are introducing an aggregate service interface @ref IRuntimeContext which will encapsulate the services that the singletons were providing and allow better control of these by the client code.
@@ -550,6 +452,164 @@ Consequently, this will allow finer control of these services among pipelines wh
 This feature introduces some changes to our API.
 All the kernels/functions will now accept a Runtime Context object which will allow the function to use the mentioned services.
 
-Finally, we will try to adapt our code-base progressively to use the new mechanism but will continue supporting the legacy mechanism to allow a smooth transition. Changes will apply to all our three backends: NEON, OpenCL and OpenGL ES.
+Finally, we will try to adapt our code-base progressively to use the new mechanism but will continue supporting the legacy mechanism to allow a smooth transition. Changes will apply to all our backends: Neon™ and OpenCL.
+
+@subsection architecture_experimental_clvk CLVK
+
+Compute Library offers experimental support for [CLVK](https://github.com/kpet/clvk). If CLVK is installed in the system, users can select the backend when running a graph example with --target=clvk.
+If no target is specified and more that one OpenCL implementations are present, Compute Library will pick the first available.
+
+@section architecture_experimental_api Experimental Application Programming Interface
+
+@subsection architecture_experimental_api_overview Overview
+
+In this section we present Compute Library's experimental application programming interface (API) architecture along with
+a detailed explanation of its components. Compute Library's API consists of multiple high-level operators and
+even more internally distinct computational blocks that can be executed on a command queue.
+Operators can be bound to multiple Tensor objects and executed concurrently or asynchronously if needed.
+All operators and associated objects are encapsulated in a Context-based mechanism, which provides all related
+construction services.
+
+@subsection architecture_experimental_api_objects Fundamental objects
+
+Compute Library consists of a list of fundamental objects that are responsible for creating and orchestrating operator execution.
+Below we present these objects in more detail.
+
+@subsubsection architecture_experimental_api_objects_context AclContext or Context
+
+AclContext or Context acts as a central creational aggregate service. All other objects are bound to or created from a context.
+It provides, internally, common facilities such as
+- allocators for object creation or backing memory allocation
+- serialization interfaces
+- any other modules that affect the construction of objects (e.g., program cache for OpenCL).
+
+The followings sections will describe parameters that can be given on the creation of Context.
+
+@paragraph architecture_experimental_api_object_context_target AclTarget
+Context is initialized with a backend target (AclTarget) as different backends might have a different subset of services.
+Currently the following targets are supported:
+- #AclCpu: a generic CPU target that accelerates primitives through SIMD technologies
+- #AclGpuOcl: a target for GPU acceleration using OpenCL
+
+@paragraph architecture_experimental_api_object_context_execution_mode AclExecutionMode
+An execution mode (AclExecutionMode) can be passed as an argument that affects the operator creation.
+At the moment the following execution modes are supported:
+- #AclPreferFastRerun: Provides faster re-run. It can be used when the operators are expected to be executed multiple
+times under the same execution context
+- #AclPreferFastStart: Provides faster single execution. It can be used when the operators will be executed only once,
+thus reducing their latency is important (Currently, it is not implemented)
+
+@paragraph architecture_experimental_api_object_context_capabilities AclTargetCapabilities
+Context creation can also have a list of capabilities of hardware as one of its parameters. This is currently
+available only for the CPU backend. A list of architecture capabilities can be passed to influence the selection
+of the underlying kernels. Such capabilities can be for example the enablement of SVE or the dot product
+instruction explicitly.
+@note The underlying hardware should support the given capability list.
+
+@paragraph architecture_experimental_api_object_context_allocator Allocator
+An allocator object that implements @ref AclAllocator can be passed to the Context upon its creation.
+This user-provided allocator will be used for allocation of any internal backing memory.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClContext) or set (@ref AclSetClContext) the internal OpenCL context.
+
+@subsubsection architecture_experimental_api_objects_tensor AclTensor or Tensor
+
+A tensor is a mathematical object that can describe physical properties like matrices.
+It can be also considered a generalization of matrices that can represent arbitrary
+dimensionalities. AclTensor is an abstracted interface that represents a tensor.
+
+AclTensor, in addition to the elements of the physical properties they represent,
+also contains the information such as shape, data type, data layout and strides to not only
+fully describe the characteristics of the physical properties but also provide information
+how the object stored in memory should be traversed. @ref AclTensorDescriptor is a dedicated
+object to represent such metadata.
+
+@note The allocation of an AclTensor can be deferred until external memory is imported
+as backing memory to accomplish a zero-copy context.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClMem) the internal OpenCL memory object.
+
+As Tensors can reside in different memory spaces, @ref AclMapTensor and @ref AclUnmapTensor entrypoints
+are provided to map Tensors in and out of the host memory system, respectively.
+
+@subsubsection architecture_experimental_api_objects_queue AclQueue or Queue
+
+AclQueue acts as a runtime aggregate service. It provides facilities to schedule
+and execute operators using underlying hardware. It also contains services like
+tuning mechanisms (e.g., Local workgroup size tuning for OpenCL) that can be specified
+during operator execution.
+
+@note To enable interoperability with OpenCL, additional entrypoints are provided
+to extract (@ref AclGetClQueue) or set (@ref AclSetClQueue) the internal OpenCL queue.
+
+@subsection architecture_experimental_api_internal Internal
+@subsubsection architecture_experimental_api_internal_operator_vs_kernels Operators vs Kernels
+
+Internally, Compute Library separates the executable primitives in two categories: kernels and operators
+which operate in a hierarchical way.
+
+A kernel is the lowest-level computation block whose responsibility is performing a task on a given group of data.
+For design simplicity, kernels computation does NOT involve the following:
+
+- Memory allocation: All the memory manipulation should be handled by the caller.
+- Multi-threading: The information on how the workload can be split is provided by kernels,
+so the caller can effectively distribute the workload to multiple threads.
+
+On the other hand, operators combine one or multiple kernels to achieve more complex calculations.
+The responsibilities of the operators can be summarized as follows:
+
+- Defining the scheduling policy and dispatching of the underlying kernels to the hardware backend
+- Providing information to the caller required by the computation (e.g., memory requirements)
+- Allocation of any required auxiliary memory if it isn't given by its caller explicitly
+
+@subsection architecture_experimental_build_multi_isa Build multi-ISA binary
+
+Selecting multi_isa when building Compute Library, will create a library that contains all the supported ISA features.
+Based on the CPU support, the appropriate kernel will be selected at runtime for execution. Currently this option is
+supported in two configurations: (i) with armv8.2-a (ii) with armv8-a. In both cases all the supported ISA features are enabled
+in the build.
+
+The arch option in a multi_isa build sets the minimum architecture required to run the resulting binary.
+For example a multi_isa build for armv8-a will run on any armv8-a or later, when the binary is executed on a armv8.2-a device
+it will use the additional cpu features present in this architecture: FP16 and dot product.
+In order to have a binary like this (multi_isa+armv8-a) the FP16 and dot product kernels in the library are compiled for the
+target armv8.2-a and all other common code for armv8-a.
+
+@subsection architecture_experimental_per_operator_build Per-operator build
+
+Dependencies for all operators have been explicitly defined, this provides the ability to users to generate Compute Library
+binaries that include a user-defined list of operators.
+
+An experimental flag 'build_config' has been introduced where a JSON configuration file can be provided and consumed.
+An example config looks like:
+@code{.py}
+{
+    "operators": [
+        "Activation",
+        "DepthwiseConv2d",
+        "Conv2d",
+        "Permute",
+        "Pool2d",
+        "Reshape"
+    ],
+    "data_types": [
+        "NHWC"
+    ]
+}
+@endcode
+
+Supported data-types options are:
+- "NHWC"
+- "NCHW"
+
+The list of supported operators can be found in filelist.json in the root of Compute Library repo.
+
+@subsection architecture_experimental_build_high_priority_operators Build high priority operators
+
+Selecting high_priority when building Compute Library, one new library will be created: libarm_compute_hp and
+will contain a selected subset of the libary operators. Currently the operators are staticly set.
+
 */
 } // namespace arm_compute
diff --git a/docs/user_guide/operator_list.dox b/docs/user_guide/operator_list.dox
new file mode 100644
index 0000000000..e7f1823f8b
--- /dev/null
+++ b/docs/user_guide/operator_list.dox
@@ -0,0 +1,3258 @@
+///
+/// Copyright (c) 2021-2024 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/**
+@page operators_list Supported Operators
+
+@tableofcontents
+
+@section S9_1_operators_list Supported Operators
+
+Compute Library supports operators that are listed in below table.
+
+Compute Library supports a wide list of data-types, information can been directly found in the documentation of each kernel/function.
+The main data-types that the Machine Learning functions support are the following:
+  <ul>
+    <li>BFLOAT16: 16-bit non-standard brain floating point
+    <li>QASYMM8: 8-bit unsigned asymmetric quantized
+    <li>QASYMM8_SIGNED: 8-bit signed asymmetric quantized
+    <li>QSYMM8_PER_CHANNEL: 8-bit signed symmetric quantized (Used for the weights)
+    <li>QSYMM8: 8-bit unsigned symmetric quantized
+    <li>QSYMM16: 16-bit unsigned symmetric quantized
+    <li>F32: 32-bit single precision floating point
+    <li>F16: 16-bit half precision floating point
+    <li>S32: 32-bit signed integer
+    <li>U8: 8-bit unsigned char
+    <li>All: Agnostic to any specific data type
+  </ul>
+
+Compute Library supports the following data layouts (fast changing dimension from right to left):
+  <ul>
+    <li>NHWC: The native layout of Compute Library that delivers the best performance where channels are in the fastest changing dimension
+    <li>NCHW: Legacy layout where width is in the fastest changing dimension
+    <li>NDHWC: New data layout for supporting 3D operators
+    <li>All: Agnostic to any specific data layout
+  </ul>
+where N = batches, C = channels, H = height, W = width, D = depth
+
+<table>
+<caption id="multi_row"></caption>
+<tr>
+  <th>Function
+  <th>Description
+  <th>Equivalent Android NNAPI Op
+  <th>Backends
+  <th>Data Layouts
+  <th>Data Types
+<tr>
+  <td rowspan="2">ActivationLayer
+  <td rowspan="2" style="width:200px;"> Function to simulate an activation layer with the specified activation function.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_ELU
+       <li>ANEURALNETWORKS_HARD_SWISH
+       <li>ANEURALNETWORKS_LOGISTIC
+       <li>ANEURALNETWORKS_RELU
+       <li>ANEURALNETWORKS_RELU1
+       <li>ANEURALNETWORKS_RELU6
+       <li>ANEURALNETWORKS_TANH
+      </ul>
+  <td>NEActivationLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLActivationLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="1">AddMulAdd
+  <td rowspan="1" style="width:200px;"> Performs a fused Add + Mul + Add [+ Relu-based-Activation] operation.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEAddMulAdd
+  <td>
+      <ul>
+       <li>Any
+      </ul>
+  <td>
+    <table>
+    <tr><th>input1<th>input2<th>bn_mul<th>bn_add<th>add_output<th>final_output
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8<td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16<td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">ArgMinMaxLayer
+  <td rowspan="2" style="width:200px;"> Function to calculate the index of the minimum or maximum values in a tensor based on an axis.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_ARGMAX
+       <li>ANEURALNETWORKS_ARGMIN
+      </ul>
+  <td>NEArgMinMaxLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>U32, S32
+    <tr><td>QASYMM8_SIGNED<td>U32, S32
+    <tr><td>S32<td>U32, S32, S64
+    <tr><td>F16<td>U32, S32
+    <tr><td>F32<td>U32, S32
+    </table>
+<tr>
+  <td>CLArgMinMaxLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>U32, S32
+    <tr><td>QASYMM8_SIGNED<td>U32, S32
+    <tr><td>S32<td>U32, S32
+    <tr><td>F16<td>U32, S32
+    <tr><td>F32<td>U32, S32
+    </table>
+<tr>
+  <td rowspan="1">ArithmeticAddition
+  <td rowspan="1" style="width:200px;"> Function to add 2 tensors.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_ADD
+      </ul>
+  <td>NEArithmeticAddition
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>QSYMM16<td>QSYMM16<td>S32
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="1">ArithmeticSubtraction
+  <td rowspan="1" style="width:200px;"> Function to substract 2 tensors.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_SUB
+      </ul>
+  <td>NEArithmeticSubtraction
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>QSYMM16<td>QSYMM16<td>S32
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">BatchNormalizationLayer
+  <td rowspan="2" style="width:200px;"> Function to perform batch normalization.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEBatchNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td>CLBatchNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">BatchToSpaceLayer
+  <td rowspan="2" style="width:200px;"> Batch to space transformation.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_BATCH_TO_SPACE_ND
+      </ul>
+  <td>NEBatchToSpaceLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>All<td>s32<td>All
+    </table>
+<tr>
+  <td>CLBatchToSpaceLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>All<td>s32<td>All
+    </table>
+<tr>
+  <td rowspan="2">BitwiseAnd
+  <td rowspan="2" style="width:200px;"> Function to perform bitwise AND between 2 tensors.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_LOGICAL_AND
+      </ul>
+  <td>NEBitwiseAnd
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td>CLBitwiseAnd
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="2">BitwiseNot
+  <td rowspan="2" style="width:200px;"> Function to perform bitwise NOT.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_LOGICAL_NOT
+      </ul>
+  <td>NEBitwiseNot
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td>CLBitwiseNot
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="2">BitwiseOr
+  <td rowspan="2" style="width:200px;"> Function to perform bitwise OR between 2 tensors.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_LOGICAL_OR
+      </ul>
+  <td>NEBitwiseOr
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td>CLBitwiseOr
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="2">BitwiseXor
+  <td rowspan="2" style="width:200px;"> Function to perform bitwise XOR between 2 tensors.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEBitwiseXor
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td>CLBitwiseXor
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="2">BoundingBoxTransform
+  <td rowspan="2" style="width:200px;"> Transform proposal bounding boxes to target bounding box using bounding box deltas.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEBoundingBoxTransform
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM16<td>QASYMM8<td>QASYMM16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLBoundingBoxTransform
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM16<td>QASYMM8<td>QASYMM16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">Cast
+  <td rowspan="2" style="width:200px;"> Function to cast a tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CAST
+      </ul>
+  <td>NECast
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8_SIGNED<td>S16, S32, F32, F16
+    <tr><td>QASYMM8<td>U16, S16, S32, F32, F16
+    <tr><td>U8<td>U16, S16, S32, F32, F16
+    <tr><td>U16<td>U8, U32
+    <tr><td>S16<td>QASYMM8_SIGNED, U8, S32
+    <tr><td>F16<td>QASYMM8_SIGNED, QASYMM8, F32, S32, U8
+    <tr><td>S32<td>QASYMM8_SIGNED, QASYMM8, F16, F32, U8
+    <tr><td>F32<td>QASYMM8_SIGNED, QASYMM8, BFLOAT16, F16, S32, U8
+    </table>
+<tr>
+  <td>CLCast
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>S8, U16, S16, U32, S32, F16, F32
+    <tr><td>S8<td>U8, U16, S16, U32, S32, F16, F32
+    <tr><td>U16<td>U8, S8, S16, U32, S32, F16, F32
+    <tr><td>S16<td>U8, S8, U16, U32, S32, F16, F32
+    <tr><td>U32<td>U8, S8, U16, S16, S32, F16, F32
+    <tr><td>S32<td>U8, S8, U16, S16, U32, F16, F32
+    <tr><td>U64<td>U8, S8, U16, S16, U32, S32, F16, F32
+    <tr><td>S64<td>U8, S8, U16, S16, U32, S32, F16, F32
+    <tr><td>F16<td>U8, S8, U16, S16, S32, U32, F32
+    <tr><td>F32<td>U8, S8, U16, S16, S32, U32, F16
+    </table>
+<tr>
+  <td rowspan="2">ChannelShuffleLayer
+  <td rowspan="2" style="width:200px;"> Function to shuffle the channels of the input tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CHANNEL_SHUFFLE
+      </ul>
+  <td>NEChannelShuffleLayer
+  <td>
+      <ul>
+       <li>NCHW
+       <li>NHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLChannelShuffleLayer
+  <td>
+      <ul>
+       <li>NCHW
+       <li>NHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="1">Comparison
+  <td rowspan="1" style="width:200px;"> Function to compare 2 tensors.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_EQUAL
+       <li>ANEURALNETWORKS_GREATER
+       <li>ANEURALNETWORKS_GREATER_EQUAL
+       <li>ANEURALNETWORKS_LESS
+       <li>ANEURALNETWORKS_LESS_EQUAL
+       <li>ANEURALNETWORKS_NOT_EQUAL
+      </ul>
+  <td>CLComparison
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>All<td>All<td>U8
+    </table>
+<tr>
+  <td rowspan="2">ConcatenateLayer
+  <td rowspan="2" style="width:200px;"> Function to concatenate tensors along a given axis.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONCATENATION
+      </ul>
+  <td>NEConcatenateLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLConcatenateLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">ConvertFullyConnectedWeights
+  <td rowspan="2" style="width:200px;"> Function to transpose the weights for the fully connected layer.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEConvertFullyConnectedWeights
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLConvertFullyConnectedWeights
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">ConvolutionLayer
+  <td rowspan="2" style="width:200px;"> Function to compute a convolution layer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_2D
+      </ul>
+  <td>NEConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">Conv3D
+  <td rowspan="2" style="width:200px;"> Function to compute a 3d convolution layer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_3D
+      </ul>
+  <td>NEConv3D
+  <td>
+      <ul>
+       <li>NDHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLConv3D
+  <td>
+      <ul>
+       <li>NDHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">Copy
+  <td rowspan="2" style="width:200px;"> Function to copy a tensor.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NECopy
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLCopy
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="1">Crop
+  <td rowspan="1" style="width:200px;"> Performs a copy of input tensor to the output tensor.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>CLCrop
+  <td>
+      <ul>
+       <li>NHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>F32
+    </table>
+<tr>
+  <td rowspan="2">CropResize
+  <td rowspan="2" style="width:200px;"> Function to perform cropping and resizing.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NECropResize
+  <td>
+      <ul>
+       <li>NHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>All<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLCropResize
+  <td>
+      <ul>
+       <li>NHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>All<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">DeconvolutionLayer
+  <td rowspan="2" style="width:200px;"> Function to compute a deconvolution or transpose convolution.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_TRANSPOSE_CONV_2D
+      </ul>
+  <td>NEDeconvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLDeconvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="1">DeconvolutionLayerUpsample
+  <td rowspan="1" style="width:200px;"> Function to execute deconvolution upsample on OpenCL.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_TRANSPOSE_CONV_2D
+      </ul>
+  <td>CLDeconvolutionLayerUpsample
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">DepthConvertLayer
+  <td rowspan="2" style="width:200px;"> Performs a down-scaling depth conversion.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEDepthConvertLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>F16, F32
+    <tr><td>U8<td>U16, S16, S32
+    <tr><td>U16<td>U8, U32
+    <tr><td>S16<td>U8, S32
+    <tr><td>BFLOAT16<td>F32
+    <tr><td>F16<td>QASYMM8, F32
+    <tr><td>F32<td>QASYMM8, F16, BFLOAT16
+    </table>
+<tr>
+  <td>CLDepthConvertLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>S8, U16, S16, U32, S32, F16, F32
+    <tr><td>U16<td>U8, S8, S16, U32, S32, F16, F32
+    <tr><td>S16<td>U8, S8, U16, U32, S32, F16, F32
+    <tr><td>U32<td>U8, S8, U16, S16, S32, F16, F32
+    <tr><td>S32<td>U8, S8, U16, S16, U32, F16, F32
+    <tr><td>F16<td>U8, S8, U16, S16, U32, F32
+    <tr><td>F32<td>U8, S8, U16, S16, U32, F16
+    </table>
+<tr>
+  <td rowspan="2">DepthToSpaceLayer
+  <td rowspan="2" style="width:200px;"> Depth to Space transformation.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_DEPTH_TO_SPACE
+      </ul>
+  <td>NEDepthToSpaceLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLDepthToSpaceLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">DepthwiseConvolutionLayer
+  <td rowspan="2" style="width:200px;"> Function to perform depthwise separable convolution.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_DEPTHWISE_CONV_2D
+      </ul>
+  <td>NEDepthwiseConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLDepthwiseConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">DequantizationLayer
+  <td rowspan="2" style="width:200px;"> Function to dequantize the values in a tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_DEQUANTIZE
+      </ul>
+  <td>NEDequantizationLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>F16, F32
+    <tr><td>QASYMM8_SIGNED<td>F16, F32
+    <tr><td>QSYMM8_PER_CHANNEL<td>F16, F32
+    <tr><td>QSYMM8<td>F16, F32
+    <tr><td>QSYMM16<td>F16, F32
+    </table>
+<tr>
+  <td>CLDequantizationLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>F16, F32
+    <tr><td>QASYMM8_SIGNED<td>F16, F32
+    <tr><td>QSYMM8_PER_CHANNEL<td>F16, F32
+    <tr><td>QSYMM8<td>F16, F32
+    <tr><td>QSYMM16<td>F16, F32
+    </table>
+<tr>
+  <td rowspan="1">DetectionPostProcessLayer
+  <td rowspan="1" style="width:200px;"> Function to generate the detection output based on center size encoded boxes, class prediction and anchors by doing non maximum suppression (NMS).
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_DETECTION_POSTPROCESSING
+      </ul>
+  <td>NEDetectionPostProcessLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0 - src2<th>dst0 - dst3
+    <tr><td>QASYMM8<td>F32
+    <tr><td>QASYMM8_SIGNED<td>F32
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">DirectConvolutionLayer
+  <td rowspan="2" style="width:200px;"> Function to compute direct convolution.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_2D
+      </ul>
+  <td>NEDirectConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLDirectConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="1">DirectDeconvolutionLayer
+  <td rowspan="1" style="width:200px;"> Function to run the deconvolution layer.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_TRANSPOSE_CONV_2D
+      </ul>
+  <td>CLDirectDeconvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="13">ElementwiseOperations
+  <td rowspan="13" style="width:200px;"> Function to perform in Cpu: - Div - Max - Min - Pow - SquaredDiff - Comparisons (Equal, greater, greater_equal, less, less_equal, not_equal) Function to perform in CL: - Add - Sub - Div - Max - Min - Pow - SquaredDiff
+  <td rowspan="13">
+      <ul>
+       <li>ANEURALNETWORKS_MAXIMUM
+       <li>ANEURALNETWORKS_MINIMUM
+       <li>ANEURALNETWORKS_POW
+       <li>ANEURALNETWORKS_DIV
+       <li>ANEURALNETWORKS_ADD
+       <li>ANEURALNETWORKS_SUB
+       <li>ANEURALNETWORKS_EQUAL
+       <li>ANEURALNETWORKS_GREATER
+       <li>ANEURALNETWORKS_GREATER_EQUAL
+       <li>ANEURALNETWORKS_LESS
+       <li>ANEURALNETWORKS_LESS_EQUAL
+       <li>ANEURALNETWORKS_NOT_EQUAL
+      </ul>
+  <td>NEElementwiseMax
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>NEElementwiseMin
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>NEElementwiseSquaredDiff
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>NEElementwiseDivision
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>NEElementwisePower
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>NEElementwiseComparison
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>U8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>U8
+    <tr><td>S32<td>S32<td>U8
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>S16<td>S16<td>U8
+    <tr><td>F16<td>F16<td>U8
+    <tr><td>F32<td>F32<td>U8
+    </table>
+<tr>
+  <td>CLArithmeticAddition
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>U8<td>U8<td>S16
+    <tr><td>U8<td>S16<td>S16
+    <tr><td>S16<td>U8<td>S16
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLArithmeticSubtraction
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>U8<td>U8<td>S16
+    <tr><td>U8<td>S16<td>S16
+    <tr><td>S16<td>U8<td>S16
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLArithmeticDivision
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLElementwiseMax
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>U32<td>U32<td>U32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLElementwiseMin
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>S32<td>S32<td>S32
+    <tr><td>U32<td>U32<td>U32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLElementwiseSquaredDiff
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLElementwisePower
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="8">ElementwiseUnaryLayer
+  <td rowspan="8" style="width:200px;"> Function to perform: - Rsqrt - Exp - Neg - Log - Abs - Round - Sin
+  <td rowspan="8">
+      <ul>
+       <li>ANEURALNETWORKS_ABS
+       <li>ANEURALNETWORKS_EXP
+       <li>ANEURALNETWORKS_LOG
+       <li>ANEURALNETWORKS_NEG
+       <li>ANEURALNETWORKS_RSQRT
+       <li>ANEURALNETWORKS_SIN
+      </ul>
+  <td>NEElementwiseUnaryLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>S32<td>S32
+    </table>
+<tr>
+  <td>CLRsqrtLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLExpLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLNegLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>S32<td>S32
+    </table>
+<tr>
+  <td>CLSinLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLLogLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLAbsLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLRoundLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">FFT1D
+  <td rowspan="2" style="width:200px;"> Fast Fourier Transform 1D.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEFFT1D
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLFFT1D
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">FFT2D
+  <td rowspan="2" style="width:200px;"> Fast Fourier Transform 2D.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEFFT2D
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLFFT2D
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">FFTConvolutionLayer
+  <td rowspan="2" style="width:200px;"> Fast Fourier Transform Convolution.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_2D
+      </ul>
+  <td>NEFFTConvolutionLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLFFTConvolutionLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">Fill
+  <td rowspan="2" style="width:200px;"> Set the values of a tensor with a given value.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_FILL
+      </ul>
+  <td>NEFill
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLFill
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="1">FillBorder
+  <td rowspan="1" style="width:200px;"> Function to fill the borders within the XY-planes.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEFillBorder
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">FlattenLayer
+  <td rowspan="2" style="width:200px;"> Reshape a tensor to be 1D
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_RESHAPE
+      </ul>
+  <td>NEFlattenLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLFlattenLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Floor
+  <td rowspan="2" style="width:200px;"> Round the value to the lowest number.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_FLOOR
+      </ul>
+  <td>NEFloor
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td>CLFloor
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">FullyConnectedLayer
+  <td rowspan="2" style="width:200px;"> Function to perform a fully connected / dense layer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_FULLY_CONNECTED
+      </ul>
+  <td>NEFullyConnectedLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLFullyConnectedLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">FuseBatchNormalization
+  <td rowspan="2" style="width:200px;"> Function to fuse the batch normalization node to a preceding convolution node.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEFuseBatchNormalization
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td>CLFuseBatchNormalization
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">Gather
+  <td rowspan="2" style="width:200px;"> Performs the Gather operation along the chosen axis.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_GATHER
+      </ul>
+  <td>NEGather
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLGather
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">GEMM
+  <td rowspan="2" style="width:200px;"> General Matrix Multiplication.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEGEMM
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>BFLOAT16<td>BFLOAT16<td>BFLOAT16<td>BFLOAT16
+    </table>
+<tr>
+  <td>CLGEMM
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>F16<td>F16<td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="1">GEMMConv2d
+  <td rowspan="1" style="width:200px;"> General Matrix Multiplication.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_2D
+      </ul>
+  <td>NEGEMMConv2d
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>BFLOAT16<td>BFLOAT16<td>BFLOAT16<td>BFLOAT16
+    </table>
+<tr>
+  <td rowspan="2">GEMMConvolutionLayer
+  <td rowspan="2" style="width:200px;"> General Matrix Multiplication.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_2D
+      </ul>
+  <td>NEGEMMConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>BFLOAT16<td>BFLOAT16<td>BFLOAT16<td>BFLOAT16
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLGEMMConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="1">GEMMDeconvolutionLayer
+  <td rowspan="1" style="width:200px;"> General Matrix Multiplication.
+  <td rowspan="1">
+      <ul>
+       <li>ANEURALNETWORKS_TRANSPOSE_CONV_2D
+      </ul>
+  <td>CLGEMMDeconvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">GEMMLowpMatrixMultiplyCore
+  <td rowspan="2" style="width:200px;"> General Matrix Multiplication.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEGEMMLowpMatrixMultiplyCore
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>S32
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>S32
+    <tr><td>QASYMM8<td>QSYMM8<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLGEMMLowpMatrixMultiplyCore
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QSYMM8<td>S32<td>QASYMM8
+    <tr><td>QASYMM8<td>QASYMM8<td>S32<td>S32
+    <tr><td>QASYMM8<td>QSYMM8_PER_CHANNEL<td>S32<td>S32
+    <tr><td>QASYMM8<td>QSYMM8<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8<td>S32<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8_PER_CHANNEL<td>S32<td>S32
+    <tr><td>QASYMM8_SIGNED<td>QSYMM8<td>S32<td>S32
+    </table>
+<tr>
+  <td rowspan="2">GEMMLowpOutputStage
+  <td rowspan="2" style="width:200px;"> General Matrix Multiplication.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEGEMMLowpOutputStage
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>S32<td>S32<td>QASYMM8
+    <tr><td>S32<td>S32<td>QASYMM8_SIGNED
+    <tr><td>S32<td>S32<td>QSYMM16
+    </table>
+<tr>
+  <td>CLGEMMLowpOutputStage
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>S32<td>S32<td>QASYMM8
+    <tr><td>S32<td>S32<td>QASYMM8_SIGNED
+    <tr><td>S32<td>S32<td>QSYMM16
+    </table>
+<tr>
+  <td rowspan="2">GenerateProposalsLayer
+  <td rowspan="2" style="width:200px;"> Function to generate proposals for a RPN (Region Proposal Network).
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_GENERATE_PROPOSALS
+      </ul>
+  <td>NEGenerateProposalsLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QSYMM8<td>QSYMM16<td>QASYMM8
+    </table>
+<tr>
+  <td>CLGenerateProposalsLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QSYMM8<td>QSYMM16<td>QASYMM8
+    </table>
+<tr>
+  <td rowspan="2">InstanceNormalizationLayer
+  <td rowspan="2" style="width:200px;"> Function to perform a Instance normalization on a given axis.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_INSTANCE_NORMALIZATION
+      </ul>
+  <td>NEInstanceNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLInstanceNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">L2NormalizeLayer
+  <td rowspan="2" style="width:200px;"> Function to perform a L2 normalization on a given axis.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_L2_NORMALIZATION
+      </ul>
+  <td>NEL2NormalizeLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLL2NormalizeLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="3">Logical
+  <td rowspan="3" style="width:200px;"> Function to perform: - Logical AND - Logical OR - Logical NOT
+  <td rowspan="3">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NELogicalAnd
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>U8<td>U8<td>U8
+    </table>
+<tr>
+  <td>NELogicalOr
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>U8<td>U8<td>U8
+    </table>
+<tr>
+  <td>NELogicalNot
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="1">LogicalAnd
+  <td rowspan="1" style="width:200px;"> Function to perform Logical AND.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>CLLogicalAnd
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>U8<td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="1">LogicalOr
+  <td rowspan="1" style="width:200px;"> Function to perform Logical OR.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>CLLogicalOr
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>U8<td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="1">LogicalNot
+  <td rowspan="1" style="width:200px;"> Function to perform Logical NOT.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>CLLogicalNot
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>U8<td>U8
+    </table>
+<tr>
+  <td rowspan="2">LSTMLayer
+  <td rowspan="2" style="width:200px;"> Function to perform a single time step in a Long Short-Term Memory (LSTM) layer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_LSTM
+      </ul>
+  <td>NELSTMLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0 - src13<th>dst0 - dst3
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLLSTMLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0 - src13<th>dst0 - dst3
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">LSTMLayerQuantized
+  <td rowspan="2" style="width:200px;"> Function to perform quantized LSTM (Long Short-Term Memory)
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_QUANTIZED_LSTM
+       <li>ANEURALNETWORKS_QUANTIZED_16BIT_LSTM
+      </ul>
+  <td>NELSTMLayerQuantized
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0 - src8<th>src9 - src12<th>src13<th>src14<th>dst0<th>dst1
+    <tr><td>QASYMM8<td>S32<td>QSYMM16<td>QASYMM8<td>QSYMM16<td>QASYMM8
+    </table>
+<tr>
+  <td>CLLSTMLayerQuantized
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0 - src8<th>src9 - src12<th>src13<th>src14<th>dst0<th>dst1
+    <tr><td>QASYMM8<td>S32<td>QSYMM16<td>QASYMM8<td>QSYMM16<td>QASYMM8
+    </table>
+<tr>
+  <td rowspan="2">MatMul
+  <td rowspan="2" style="width:200px;"> Computes a matrix multiplication in batches.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_BATCH_MATMUL
+      </ul>
+  <td>NEMatMul
+  <td>
+      <ul>
+       <li>Any
+      </ul>
+  <td>
+    <table>
+    <tr><th>lhs<th>rhs<th>dst
+    <tr><td>F32<td>F32<td>F32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>BFLOAT16<td>BFLOAT16<td>BFLOAT16
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    </table>
+<tr>
+  <td>CLMatMul
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>lhs<th>rhs<th>dst
+    <tr><td>F32<td>F32<td>F32
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    </table>
+<tr>
+  <td rowspan="2">MaxUnpoolingLayer
+  <td rowspan="2" style="width:200px;"> Function to perform MaxUnpooling.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEMaxUnpoolingLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLMaxUnpoolingLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">MeanStdDevNormalizationLayer
+  <td rowspan="2" style="width:200px;"> Function to execute mean and standard deviation normalization.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEMeanStdDevNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td>CLMeanStdDevNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="2">NormalizationLayer
+  <td rowspan="2" style="width:200px;"> Function to compute normalization layer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_LOCAL_RESPONSE_NORMALIZATION
+      </ul>
+  <td>NENormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td>CLNormalizationLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    </table>
+<tr>
+  <td rowspan="1">NormalizePlanarYUVLayer
+  <td rowspan="1" style="width:200px;"> Function to compute normalization planar YUV layer.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>CLNormalizePlanarYUVLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    <tr><td>F16<td>F16
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">PadLayer
+  <td rowspan="2" style="width:200px;"> Function to pad a tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_PAD
+       <li>ANEURALNETWORKS_PAD_V2
+      </ul>
+  <td>NEPadLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLPadLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Permute
+  <td rowspan="2" style="width:200px;"> Function to transpose an ND tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_TRANSPOSE
+      </ul>
+  <td>NEPermute
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLPermute
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">PixelWiseMultiplication
+  <td rowspan="2" style="width:200px;"> Function to perform a multiplication.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_MUL
+      </ul>
+  <td>NEPixelWiseMultiplication
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>QSYMM16<td>QSYMM16<td>S32
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>U8<td>U8<td>S16
+    <tr><td>U8<td>S16<td>S16
+    <tr><td>S16<td>U8<td>S16
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>S32<td>F32
+    </table>
+<tr>
+  <td>CLPixelWiseMultiplication
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>QASYMM8<td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>QSYMM16<td>QSYMM16<td>QASYMM16
+    <tr><td>QSYMM16<td>QSYMM16<td>S32
+    <tr><td>U8<td>U8<td>U8
+    <tr><td>U8<td>U8<td>S16
+    <tr><td>U8<td>S16<td>S16
+    <tr><td>S16<td>U8<td>S16
+    <tr><td>S16<td>S16<td>S16
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    <tr><td>S32<td>S32<td>S32
+    </table>
+<tr>
+  <td rowspan="2">PoolingLayer
+  <td rowspan="2" style="width:200px;"> Function to perform pooling with the specified pooling operation.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_AVERAGE_POOL_2D
+       <li>ANEURALNETWORKS_L2_POOL_2D
+       <li>ANEURALNETWORKS_MAX_POOL_2D
+      </ul>
+  <td>NEPoolingLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLPoolingLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">Pooling3dLayer
+  <td rowspan="2" style="width:200px;"> Function to perform pooling 3D with the specified pooling operation.
+  <td rowspan="2">
+      <ul>
+       <li>N/A
+      </ul>
+  <td>NEPooling3dLayer
+  <td>
+      <ul>
+       <li>NDHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLPooling3dLayer
+  <td>
+      <ul>
+       <li>NDHWC
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">PReluLayer
+  <td rowspan="2" style="width:200px;"> Function to compute the activation layer with the PRELU activation function.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_PRELU
+      </ul>
+  <td>NEPReluLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLPReluLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">PriorBoxLayer
+  <td rowspan="2" style="width:200px;"> Function to compute prior boxes and clip.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEPriorBoxLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLPriorBoxLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">QLSTMLayer
+  <td rowspan="2" style="width:200px;"> Function to perform quantized LSTM (Long Short-Term Memory).
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_QUANTIZED_LSTM
+       <li>ANEURALNETWORKS_QUANTIZED_16BIT_LSTM
+      </ul>
+  <td>NEQLSTMLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1 - src6<th>src7 -src9<th>src10<th>src11<th>dst0<th>dst1 - dst2
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8<td>S32<td>QSYMM16<td>QASYMM8_SIGNED<td>QSYMM16<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLQLSTMLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1 - src6<th>src7 -src9<th>src10<th>src11<th>dst0<th>dst1 - dst2
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8<td>S32<td>QSYMM16<td>QASYMM8_SIGNED<td>QSYMM16<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">QuantizationLayer
+  <td rowspan="2" style="width:200px;"> Function to perform quantization layer
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_QUANTIZE
+      </ul>
+  <td>NEQuantizationLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    <tr><td>F16<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    <tr><td>F32<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    </table>
+<tr>
+  <td>CLQuantizationLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    <tr><td>F16<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    <tr><td>F32<td>QASYMM8, QASYMM8_SIGNED, QASYMM16
+    </table>
+<tr>
+  <td rowspan="2">Range
+  <td rowspan="2" style="width:200px;"> Function to generates a sequence of numbers starting from START and extends by increments of 'STEP' up to but not including 'END'.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NERange
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>dst
+    <tr><td>U8
+    <tr><td>S8
+    <tr><td>U16
+    <tr><td>S16
+    <tr><td>U32
+    <tr><td>S32
+    <tr><td>F16
+    <tr><td>F32
+    </table>
+<tr>
+  <td>CLRange
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>dst
+    <tr><td>U8
+    <tr><td>S8
+    <tr><td>QASYMM8
+    <tr><td>U16
+    <tr><td>S16
+    <tr><td>U32
+    <tr><td>S32
+    <tr><td>F16
+    <tr><td>F32
+    </table>
+<tr>
+  <td rowspan="2">ReduceMean
+  <td rowspan="2" style="width:200px;"> Function to perform reduce mean operation.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_MEAN
+      </ul>
+  <td>NEReduceMean
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLReduceMean
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">ReductionOperation
+  <td rowspan="2" style="width:200px;"> Function to perform reduce with the following operations - ARG_IDX_MAX: Index of the max value - ARG_IDX_MIN: Index of the min value - MEAN_SUM:    Mean of sum - PROD:        Product - SUM_SQUARE:  Sum of squares - SUM:         Sum - MIN:         Min - MAX:         Max
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_REDUCE_ALL
+       <li>ANEURALNETWORKS_REDUCE_ANY
+       <li>ANEURALNETWORKS_REDUCE_MAX
+       <li>ANEURALNETWORKS_REDUCE_MIN
+       <li>ANEURALNETWORKS_REDUCE_PROD
+       <li>ANEURALNETWORKS_REDUCE_SUM
+      </ul>
+  <td>NEReductionOperation
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>S32<td>S32
+    </table>
+<tr>
+  <td>CLReductionOperation
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>S32<td>S32
+    </table>
+<tr>
+  <td rowspan="1">ReorderLayer
+  <td rowspan="1" style="width:200px;"> Reorders a tensor to a different weights format.
+  <td rowspan="1">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEReorderLayer
+  <td>
+      <ul>
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">ReorgLayer
+  <td rowspan="2" style="width:200px;"> Performs a reorganization layer of input tensor to the output tensor.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEReorgLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLReorgLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">ReshapeLayer
+  <td rowspan="2" style="width:200px;"> Function to reshape a tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_RESHAPE
+       <li>ANEURALNETWORKS_SQUEEZE
+      </ul>
+  <td>NEReshapeLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLReshapeLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Reverse
+  <td rowspan="2" style="width:200px;"> Function to reverse tensor according to axis.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEReverse
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>All<td>U32, S32<td>All
+    </table>
+<tr>
+  <td>CLReverse
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>All<td>U32, S32<td>All
+    </table>
+<tr>
+  <td rowspan="2">RNNLayer
+  <td rowspan="2" style="width:200px;"> Function to perform recurrent neural network layer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_RNN
+      </ul>
+  <td>NERNNLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>src3<th>dst0<th>dst1
+    <tr><td>F16<td>F16<td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLRNNLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>src3<th>dst0<th>dst1
+    <tr><td>F16<td>F16<td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">ROIAlignLayer
+  <td rowspan="2" style="width:200px;"> Function to perform ROI alignment.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_ROI_ALIGN
+      </ul>
+  <td>NEROIAlignLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM16<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM16<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td>CLROIAlignLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32
+    <tr><td>QASYMM8<td>QASYMM16<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM16<td>QASYMM8_SIGNED
+    </table>
+<tr>
+  <td rowspan="2">ROIPoolingLayer
+  <td rowspan="2" style="width:200px;"> Function to perform ROI pooling.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_ROI_POOLING
+      </ul>
+  <td>NEROIPoolingLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F32<td>U16<td>F32
+    <tr><td>QASYMM8<td>U16<td>QASYMM8
+    </table>
+<tr>
+  <td>CLROIPoolingLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>dst
+    <tr><td>F16<td>U16<td>F16
+    <tr><td>F32<td>U16<td>F32
+    <tr><td>QASYMM8<td>U16<td>QASYMM8
+    </table>
+<tr>
+  <td rowspan="2">Scale
+  <td rowspan="2" style="width:200px;"> Function to perform resize a tensor using to interpolate: - Bilinear - Nearest neighbor
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_RESIZE_BILINEAR
+       <li>ANEURALNETWORKS_RESIZE_NEAREST_NEIGHBOR
+      </ul>
+  <td>NEScale
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>U8<td>U8
+    <tr><td>S8<td>S8
+    <tr><td>S16<td>S16
+    </table>
+<tr>
+  <td>CLScale
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    <tr><td>U8<td>U8
+    <tr><td>S16<td>S16
+    </table>
+<tr>
+  <td rowspan="2">Select
+  <td rowspan="2" style="width:200px;"> Function to select values from 2 tensors depending on an input tensor of booleans.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_SELECT
+      </ul>
+  <td>NESelect
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>U8<td>All<td>All<td>All
+    </table>
+<tr>
+  <td>CLSelect
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>U8<td>All<td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Slice
+  <td rowspan="2" style="width:200px;"> Function to perform tensor slicing.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_SLICE
+      </ul>
+  <td>NESlice
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLSlice
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">SoftmaxLayer
+  <td rowspan="2" style="width:200px;"> Function to compute a SoftmaxLayer and a Log SoftmaxLayer.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_LOG_SOFTMAX
+       <li>ANEURALNETWORKS_SOFTMAX
+      </ul>
+  <td>NESoftmaxLayerGeneric
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td>CLSoftmaxLayerGeneric
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>QASYMM8<td>QASYMM8
+    <tr><td>QASYMM8_SIGNED<td>QASYMM8_SIGNED
+    <tr><td>F16<td>F16
+    <tr><td>F32<td>F32
+    </table>
+<tr>
+  <td rowspan="2">SpaceToBatchLayer
+  <td rowspan="2" style="width:200px;"> Function to divide a tensor spatially.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_SPACE_TO_BATCH_ND
+      </ul>
+  <td>NESpaceToBatchLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>All<td>S32<td>S32<td>All
+    </table>
+<tr>
+  <td>CLSpaceToBatchLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>All<td>S32<td>S32<td>All
+    </table>
+<tr>
+  <td rowspan="2">SpaceToDepthLayer
+  <td rowspan="2" style="width:200px;"> Function to rearrange blocks of spatial data into depth.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_SPACE_TO_DEPTH
+      </ul>
+  <td>NESpaceToDepthLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLSpaceToDepthLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Split
+  <td rowspan="2" style="width:200px;"> Function to split a tensor along a given axis.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_SPLIT
+      </ul>
+  <td>NESplit
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLSplit
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">StackLayer
+  <td rowspan="2" style="width:200px;"> Function to stack tensors along an axis.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEStackLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLStackLayer
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">StridedSlice
+  <td rowspan="2" style="width:200px;"> Function to extract a strided slice of a tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_STRIDED_SLICE
+      </ul>
+  <td>NEStridedSlice
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLStridedSlice
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Tile
+  <td rowspan="2" style="width:200px;"> Function to construct a tensor by tiling a given tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_TILE
+      </ul>
+  <td>NETile
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLTile
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Transpose
+  <td rowspan="2" style="width:200px;"> Function to transpose a 2D tensor.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_TRANSPOSE
+      </ul>
+  <td>NETranspose
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLTranspose
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">Unstack
+  <td rowspan="2" style="width:200px;"> Function to unpack a rank-R tensor into rank-(R-1) tensors.
+  <td rowspan="2">
+      <ul>
+       <li>n/a
+      </ul>
+  <td>NEUnstack
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td>CLUnstack
+  <td>
+      <ul>
+       <li>All
+      </ul>
+  <td>
+    <table>
+    <tr><th>src<th>dst
+    <tr><td>All<td>All
+    </table>
+<tr>
+  <td rowspan="2">WinogradConvolutionLayer
+  <td rowspan="2" style="width:200px;"> Function to do Winograd Convolution.
+  <td rowspan="2">
+      <ul>
+       <li>ANEURALNETWORKS_CONV_2D
+      </ul>
+  <td>NEWinogradConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    </table>
+<tr>
+  <td>CLWinogradConvolutionLayer
+  <td>
+      <ul>
+       <li>NHWC
+       <li>NCHW
+      </ul>
+  <td>
+    <table>
+    <tr><th>src0<th>src1<th>src2<th>dst
+    <tr><td>F16<td>F16<td>F16<td>F16
+    <tr><td>F32<td>F32<td>F32<td>F32
+    </table>
+</table>
+
+*/
+} // namespace
diff --git a/docs/user_guide/release_version_and_change_log.dox b/docs/user_guide/release_version_and_change_log.dox
new file mode 100644
index 0000000000..d9c2c8476d
--- /dev/null
+++ b/docs/user_guide/release_version_and_change_log.dox
@@ -0,0 +1,1723 @@
+///
+/// Copyright (c) 2017-2024 Arm Limited.
+///
+/// SPDX-License-Identifier: MIT
+///
+/// Permission is hereby granted, free of charge, to any person obtaining a copy
+/// of this software and associated documentation files (the "Software"), to
+/// deal in the Software without restriction, including without limitation the
+/// rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+/// sell copies of the Software, and to permit persons to whom the Software is
+/// furnished to do so, subject to the following conditions:
+///
+/// The above copyright notice and this permission notice shall be included in all
+/// copies or substantial portions of the Software.
+///
+/// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+/// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+/// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+/// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+/// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+/// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+/// SOFTWARE.
+///
+namespace arm_compute
+{
+/** @page versions_changelogs Release Versions and Changelog
+
+@tableofcontents
+
+@section S2_1_versions Release versions
+
+All releases are numbered vYY.MM Where YY are the last two digits of the year, and MM the month number.
+If there is more than one release in a month then an extra sequential number is appended at the end:
+
+	v17.03 (First release of March 2017)
+	v17.03.1 (Second release of March 2017)
+	v17.04 (First release of April 2017)
+
+@note We're aiming at releasing one major public release with new features per quarter. All releases in between will only contain bug fixes.
+@note Starting from release 22.05, 'master' branch is no longer being used, it has been replaced by 'main'. Please update your clone jobs accordingly.
+
+@section S2_2_changelog Changelog
+
+v24.08 Public major release
+ - Optimize CPU activation functions using LUT-based implementation:
+   - Tanh function for FP16.
+
+v24.05 Public major release
+ - Add @ref CLScatter operator for FP32/16, S32/16/8, U32/16/8 data types
+ - Various fixes to enable FP16 kernels in armv8a multi_isa builds.
+ - Updated logic in the OpenMP scheduler to exclude LITTLE cores.
+
+v24.04 Public major release
+ - Add Bfloat16 data type support for @ref NEMatMul.
+ - Add support for SoftMax in SME2 for FP32, FP16, QASYMM8 and QASYMM8_SIGNED.
+ - Add support for in place accumulation to CPU GEMM kernels.
+ - Add low-precision Int8 * Int8 -> FP32 CPU GEMM which dequantizes after multiplication
+ - Add is_dynamic flag to QuantizationInfo to signal to operators that it may change after configuration
+ - Performance optimizations:
+   - Optimize start-up time of @ref NEConvolutionLayer for some input configurations where GeMM is selected as the convolution algorithm
+   - Optimize @ref NEConvolutionLayer for input tensor size > 1e7 bytes and weight tensor height > 7
+   - Optimize @ref NESoftmaxLayer for axis != 0 by natively supporting higher axes up to axis 3.
+
+v24.02.1 Public patch release
+ - Fix performance regression in fixed-format kernels
+ - Fix compile and runtime errors in arm_compute_validation for Windows on Arm(WoA)
+
+v24.02 Public major release
+ - Replace template writer with compute kernel writer in dynamic fusion.
+ - Performance optimizations:
+   - Parallelize @ref NEDepthwiseConvolutionLayer over batches if there is only 1 row
+
+v24.01 Public major release
+ - Remove the legacy 'libarm_compute_core' library. This library is an artifact of Compute Library's legacy library architecture and no longer serves any purpose.
+  You should link only to the main `libarm_compute` library for core functionality.
+ - Expand GPUTarget list with Mali™ G720 and G620.
+ - Optimize CPU activation functions using LUT-based implementation:
+   - Sigmoid function for FP16.
+ - New features
+   - Add support for FP16 in all multi_isa builds.
+ - Performance optimizations:
+   - Optimize @ref NESoftmaxLayer
+   - Optimize @ref NEDepthToSpaceLayer.
+
+v23.11 Public major release
+ - New features
+   - Add support for input data type U64/S64 in CLCast and NECast.
+   - Add support for output data type S64 in NEArgMinMaxLayer and CLArgMinMaxLayer
+   - Port the following kernels in the experimental Dynamic Fusion interface to use the new Compute Kernel Writer interface:
+     - @ref experimental::dynamic_fusion::GpuCkwResize
+     - @ref experimental::dynamic_fusion::GpuCkwPool2d
+     - @ref experimental::dynamic_fusion::GpuCkwDepthwiseConv2d
+     - @ref experimental::dynamic_fusion::GpuCkwMatMul
+   - Add support for OpenCL™ comand buffer with mutable dispatch extension.
+   - Add support for Arm® Cortex®-A520 and Arm® Cortex®-R82.
+   - Add support for negative axis values and inverted axis values in @ref arm_compute::NEReverse and @ref arm_compute::CLReverse.
+   - Add new OpenCL™ kernels:
+     - @ref opencl::kernels::ClMatMulLowpNativeMMULKernel support for QASYMM8 and QASYMM8_SIGNED, with batch support
+ - Performance optimizations:
+   - Optimize @ref cpu::CpuReshape
+   - Optimize @ref opencl::ClTranspose
+   - Optimize @ref NEStackLayer
+   - Optimize @ref CLReductionOperation.
+   - Optimize @ref CLSoftmaxLayer.
+   - Optimize start-up time of @ref NEConvolutionLayer for some input configurations where GeMM is selected as the convolution algorithm
+   - Reduce CPU Overhead by optimal flushing of CL kernels.
+ - Deprecate support for Bfloat16 in @ref cpu::CpuCast.
+ - Support for U32 axis in @ref arm_compute::NEReverse and @ref arm_compute::CLReverse will be deprecated in 24.02.
+ - Remove legacy PostOps interface. PostOps was the experimental interface for kernel fusion and is replaced by the new Dynamic Fusion interface.
+ - Update OpenCL™ API headers to v2023.04.17
+
+v23.08 Public major release
+ - Deprecate the legacy 'libarm_compute_core' library. This library is an artifact of Compute Library's legacy library architecture and no longer serves any purpose.
+ Users must no longer link their applications to this library and instead link only to the main `libarm_compute` library for core functionality.
+ - New features
+   - Rewrite CLArgMinMaxLayer for axis 0 and enable S64 output.
+   - Add multi-sketch support for dynamic fusion.
+   - Break up arm_compute/core/Types.h and utils/Utils.h a bit to reduce unused code in each inclusion of these headers.
+   - Add Fused Activation to CLMatMul.
+   - Implement FP32/FP16 @ref opencl::kernels::ClMatMulNativeMMULKernel using the MMUL extension.
+   - Use MatMul in fully connected layer with dynamic weights when supported.
+   - Optimize CPU depthwise convolution with channel multiplier.
+   - Add support in CpuCastKernel for conversion of S64/U64 to F32.
+   - Add new OpenCL™ kernels:
+     - @ref opencl::kernels::ClMatMulNativeMMULKernel support for FP32 and FP16, with batch support
+   - Enable transposed convolution with non-square kernels on CPU and GPU.
+   - Add support for input data type U64/S64 in CLCast.
+   - Add new Compute Kernel Writer (CKW) subproject that offers a C++ interface to generate tile-based OpenCL code in just-in-time fashion.
+   - Port the following kernels in the experimental Dynamic Fusion interface to use the new Compute Kernel Writer interface with support for FP16/FP32 only:
+     - @ref experimental::dynamic_fusion::GpuCkwActivation
+     - @ref experimental::dynamic_fusion::GpuCkwCast
+     - @ref experimental::dynamic_fusion::GpuCkwDirectConv2d
+     - @ref experimental::dynamic_fusion::GpuCkwElementwiseBinary
+     - @ref experimental::dynamic_fusion::GpuCkwStore
+ - Various optimizations and bug fixes.
+
+v23.05.1 Public patch release
+ - Enable CMake and Bazel option to build multi_isa without FP16 support.
+ - Fix compilation error in NEReorderLayer (aarch64 only).
+ - Disable invalid (false-negative) validation test with CPU scale layer on FP16.
+ - Various bug fixes
+
+v23.05 Public major release
+ - New features:
+   - Add new Arm® Neon™ kernels / functions:
+      - @ref NEMatMul for QASYMM8, QASYMM8_SIGNED, FP32 and FP16, with batch support.
+      - NEReorderLayer (aarch64 only)
+   - Add new OpenCL™ kernels / functions:
+      - @ref CLMatMul support for QASYMM8, QASYMM8_SIGNED, FP32 and FP16, with batch support.
+   - Add support for the multiple dimensions in the indices parameter for both the Arm® Neon™ and OpenCL™ implementations of the Gather Layer.
+   - Add support for dynamic weights in @ref CLFullyConnectedLayer and @ref NEFullyConnectedLayer for all data types.
+   - Add support for cropping in the Arm® Neon™ and OpenCL™: implementations of the BatchToSpace Layer for all data types.
+   - Add support for quantized data types for the ElementwiseUnary Operators for Arm® Neon™.
+   - Implement RSQRT for quantized data types on OpenCL™.
+   - Add FP16 depthwise convolution kernels for SME2.
+ - Performance optimizations:
+   - Improve CLTuner exhaustive mode tuning time.
+ - Deprecate dynamic block shape in @ref NEBatchToSpaceLayer and @ref CLBatchToSpaceLayer.
+ - Various optimizations and bug fixes.
+
+v23.02.1 Public patch release
+ - Allow mismatching data layouts between the source tensor and weights for \link cpu::CpuGemmDirectConv2d CpuGemmDirectConv2d \endlink with fixed format kernels.
+ - Fixes for experimental CPU only Bazel and CMake builds.
+
+v23.02 Public major release
+ - New features:
+   - Rework the experimental dynamic fusion interface by identifying auxiliary and intermediate tensors, and specifying an explicit output operator.
+   - Add the following operators to the experimental dynamic fusion API:
+     - GpuAdd, GpuCast, GpuClamp, GpuDepthwiseConv2d, GpuMul, GpuOutput, GpuPool2d, GpuReshape, GpuResize, GpuSoftmax, GpuSub.
+   - Add SME/SME2 kernels for GeMM, Winograd convolution, Depthwise convolution and Pooling.
+   - Add new CPU operator AddMulAdd for float and quantized types.
+   - Add new flag @ref ITensorInfo::lock_paddings() to tensors to prevent extending tensor paddings.
+   - Add experimental support for CPU only Bazel and CMake builds.
+ - Performance optimizations:
+   - Optimize CPU base-e exponential functions for FP32.
+   - Optimize CPU StridedSlice by copying first dimension elements in bulk where possible.
+   - Optimize CPU quantized Subtraction by reusing the quantized Addition kernel.
+   - Optimize CPU ReduceMean by removing quantization steps and performing the operation in integer domain.
+   - Optimize GPU Scale and Dynamic Fusion GpuResize by removing quantization steps and performing the operation in integer domain.
+   - Update the heuristic for CLDepthwiseConvolutionNative kernel.
+   - Add new optimized OpenCL kernel to compute indirect convolution:
+     - \link opencl::kernels::ClIndirectConv2dKernel ClIndirectConv2dKernel \endlink
+   - Add new optimized OpenCL kernel to compute transposed convolution:
+     - \link opencl::kernels::ClTransposedConvolutionKernel ClTransposedConvolutionKernel \endlink
+ - Update recommended/minimum NDK version to r20b.
+ - Various optimizations and bug fixes.
+
+v22.11 Public major release
+ - New features:
+   - Add new experimental dynamic fusion API.
+   - Add CPU batch matrix multiplication with adj_x = false and adj_y = false for FP32.
+   - Add CPU MeanStdDevNorm for QASYMM8.
+   - Add CPU and GPU GELU activation function for FP32 and FP16.
+   - Add CPU swish activation function for FP32 and FP16.
+ - Performance optimizations:
+   - Optimize CPU bilinear scale for FP32, FP16, QASYMM8, QASYMM8_SIGNED, U8 and S8.
+   - Optimize CPU activation functions using LUT-based implementation:
+     - Sigmoid function for QASYMM8 and QASYMM8_SIGNED.
+     - Hard swish function for QASYMM8_SIGNED.
+   - Optimize CPU addition for QASYMM8 and QASYMM8_SIGNED using fixed-point arithmetic.
+   - Optimize CPU multiplication, subtraction and activation layers by considering tensors as 1D.
+   - Optimize GPU depthwise convolution kernel and heuristic.
+   - Optimize GPU Conv2d heuristic.
+   - Optimize CPU MeanStdDevNorm for FP16.
+   - Optimize CPU tanh activation function for FP16 using rational approximation.
+ - Improve GPU GeMMLowp start-up time.
+ - Various optimizations and bug fixes.
+
+v22.08 Public major release
+ - Various bug fixes.
+ - Disable unsafe FP optimizations causing accuracy issues in:
+   - \link opencl::kernels::ClDirectConv2dKernel ClDirectConv2dKernel \endlink
+   - \link opencl::kernels::ClDirectConv2dKernel ClDirectConv3dKernel \endlink
+   - @ref CLDepthwiseConvolutionLayerNativeKernel
+ - Add Dynamic Fusion of Elementwise Operators: Div, Floor, Add.
+ - Optimize the gemm_reshaped_rhs_nly_nt OpenCL kernel using the arm_matrix_multiply extension available for Arm® Mali™-G715 and Arm® Mali™-G615.
+ - Add support for the arm_matrix_multiply extension in the gemmlowp_mm_reshaped_only_rhs_t OpenCL kernel.
+ - Expand GPUTarget list with missing Mali™ GPUs product names: G57, G68, G78AE, G610, G510, G310.
+ - Extend the direct convolution 2d interface to configure the block size.
+ - Update ClConv2D heuristic to use direct convolution.
+ - Use official Khronos® OpenCL extensions:
+   - Add cl_khr_integer_dot_product extension support.
+   - Add support of OpenCL 3.0 non-uniform workgroup.
+ - Cpu performance optimizations:
+   - Add LUT-based implementation of Hard Swish and Leaky ReLU activation function for aarch64 build.
+   - Optimize Add layer by considering the input tensors as 1D array.
+ - Add fixed-format BF16, FP16 and FP32 Neon™ GEMM kernels to support variable weights.
+ - Add new winograd convolution kernels implementation and update the ACL \link arm_compute::cpu::CpuWinogradConv2d CpuWinogradConv2d\endlink operator.
+ - Add experimental support for native builds for Windows® on Arm™.
+ - Build flag interpretation change: arch=armv8.6-a now translates to -march=armv8.6-a CXX flag instead of march=armv8.2-a + explicit selection of feature extensions.
+ - Build flag change: toolchain_prefix, compiler_prefix:
+   - Use empty string "" to suppress any prefixes.
+   - Use "auto" to use default (auto) prefixes chosen by the build script. This is the default behavior when unspecified.
+   - Any other string will be used as custom prefixes to the compiler and the rest of toolchain tools.
+   - The default behaviour when prefix is unspecified does not change, but its signifier has been changed from empty string "" to "auto".
+ - armv7a with Android build will no longer be tested or maintained.
+
+v22.05 Public major release
+ - Various bug fixes.
+ - Various optimizations.
+ - Add support for NDK r23b.
+ - Inclusive language adjustment. Please refer to @ref S5_0_inc_lang for details.
+ - New Arm® Neon™ kernels / functions :
+   - \link opencl::kernels::ClPool3dKernel ClPool3dKernel \endlink
+ - New OpenCL kernels / functions :
+   - \link cpu::kernels::CpuPool3dKernel CpuPool3dKernel \endlink
+ - Improve the start-up times for the following OpenCL kernels:
+   - \link opencl::kernels::ClWinogradInputTransformKernel ClWinogradInputTransformKernel \endlink
+   - \link opencl::kernels::ClWinogradOutputTransformKernel ClWinogradOutputTransformKernel \endlink
+   - \link opencl::kernels::ClWinogradFilterTransformKernel ClWinogradFilterTransformKernel \endlink
+   - \link opencl::kernels::ClHeightConcatenateKernel ClHeightConcatenateKernel \endlink
+ - Decouple the implementation of the following Cpu kernels into various data types (fp32, fp16, int):
+   - \link cpu::kernels::CpuDirectConv2dKernel CpuDirectConv2dKernel \endlink
+   - \link cpu::kernels::CpuDepthwiseConv2dNativeKernel CpuDepthwiseConv2dNativeKernel \endlink
+   - \link cpu::kernels::CpuGemmMatrixAdditionKernel CpuGemmMatrixAdditionKernel \endlink
+   - \link cpu::kernels::CpuGemmMatrixMultiplyKernel CpuGemmMatrixMultiplyKernel \endlink
+   - @ref NEFuseBatchNormalizationKernel
+   - @ref NEL2NormalizeLayerKernel
+
+v22.02 Public major release
+ - Various bug fixes.
+ - Various optimizations.
+ - Update A510 arm_gemm cpu Kernels.
+ - Inclusive language adjustment. Please refer to @ref S5_0_inc_lang for details.
+ - Improve the start-up time for the following OpenCL kernels:
+   - @ref CLScale
+   - @ref CLGEMM
+   - @ref CLDepthwiseConvolutionLayer
+   - \link opencl::kernels::ClIm2ColKernel ClIm2ColKernel \endlink
+   - \link opencl::kernels::ClDirectConv2dKernel ClDirectConv2dKernel \endlink
+ - Remove functions:
+   - CLRemap
+   - NERemap
+ - Remove padding from OpenCL kernels:
+   - \link opencl::kernels::ClDirectConv2dKernel ClDirectConv2dKernel \endlink
+ - Remove padding from Cpu kernels:
+   - \link cpu::kernels::CpuDirectConv2dKernel CpuDirectConv2dKernel \endlink
+ - Decouple the implementation of the following Cpu kernels into various data types (fp32, fp16, int):
+   - \link cpu::kernels::CpuActivationKernel CpuActivationKernel \endlink
+   - \link cpu::kernels::CpuAddKernel CpuAddKernel \endlink
+   - \link cpu::kernels::CpuElementwiseKernel CpuElementwiseKernel \endlink
+   - \link cpu::CpuSoftmaxGeneric CpuSoftmaxKernel \endlink
+   - @ref NEBoundingBoxTransformKernel
+   - @ref NECropKernel
+   - @ref NEComputeAllAnchorsKernel
+   - @ref NEInstanceNormalizationLayerKernel
+   - NEMaxUnpoolingLayerKernel
+   - @ref NEMeanStdDevNormalizationKernel
+   - @ref NERangeKernel
+   - @ref NEROIAlignLayerKernel
+   - @ref NESelectKernel
+
+v21.11 Public major release
+ - Various bug fixes.
+ - Various optimizations:
+   - Improve performance of bilinear and nearest neighbor Scale on both CPU and GPU for FP32, FP16, Int8, Uint8 data types
+   - Improve performance of Softmax on GPU for Uint8/Int8
+ - New OpenCL kernels / functions:
+   - @ref CLConv3D
+ - New Arm® Neon™ kernels / functions:
+   - @ref NEConv3D
+ - Support configurable build by a selected subset of operator list
+ - Support MobileBert on Neon™ backend
+ - Improve operator/function logging
+ - Remove padding from OpenCL kernels:
+   - ClPool2dKernel
+   - ClScaleKernel
+   - ClGemmMatrixMultiplyReshapedKernel
+ - Remove padding from Cpu kernels:
+   - CpuPool2dKernel
+ - Remove Y padding from OpenCL kernels:
+   - ClGemmMatrixMultiplyKernel
+   - ClGemmReshapedRHSMatrixKernel
+ - Remove legacy GeMM kernels in gemm_v1.cl
+
+v21.08 Public major release
+ - Various bug fixes.
+ - Various optimizations:
+  - Improve LWS (Local-Workgroup-Size) heuristic in OpenCL for GeMM, Direct Convolution and Winograd Transformations when OpenCL tuner is not used
+  - Improve QASYMM8/QSYMM8 performance on OpenCL for various Arm® Mali™ GPU architectures
+  - Add dynamic weights support in Fully connected layer (CPU/GPU)
+  - Various performance optimizations for floating-point data types (CPU/GPU)
+ - Add a reduced core library build arm_compute_core_v2
+ - Expose Operator API
+ - Support fat binary build for arm8.2-a via fat_binary build flag
+ - Add CPU discovery capabilities
+ - Add data type f16 support for:
+  - CLRemapKernel
+ - Port the following functions to stateless API:
+   - @ref CLConvolutionLayer
+   - @ref CLFlattenLayer
+   - @ref CLFullyConnectedLayer
+   - @ref CLGEMM
+   - @ref CLGEMMConvolutionLayer
+   - @ref CLGEMMLowpMatrixMultiplyCore
+   - @ref CLWinogradConvolutionLayer
+   - @ref NEConvolutionLayer
+   - @ref NEFlattenLayer
+   - @ref NEFullyConnectedLayer
+   - @ref NEGEMM
+   - @ref NEGEMMConv2d
+   - @ref NEGEMMConvolutionLayer
+   - @ref NEGEMMLowpMatrixMultiplyCore
+   - @ref NEWinogradConvolutionLayer
+ - Remove the following functions:
+   - CLWinogradInputTransform
+ - Remove CLCoreRuntimeContext
+ - Remove ICPPSimpleKernel
+ - Rename file arm_compute/runtime/CL/functions/CLElementWiseUnaryLayer.h to arm_compute/runtime/CL/functions/CLElementwiseUnaryLayer.h
+
+v21.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Various documentation updates:
+   - Add supported operators and corresponding Android NNAPI operators.
+   - Documentation reorg into user guide and contributor guide.
+ - Add support for a global allocator for OpenCL tensors
+ - Add experimental support for [CLVK](https://github.com/kpet/clvk).
+ - Add data type S32 support for:
+  - @ref opencl::kernels::ClArithmeticKernel
+ - Add data type QASYMM8 support for:
+  - @ref CLROIPoolingLayer
+  - @ref CLROIPoolingLayerKernel
+  - @ref NEROIPoolingLayer
+  - @ref NEROIPoolingLayerKernel
+ - Add per-channel quantization support for:
+  - @ref CLDeconvolutionLayer
+  - @ref CLDirectDeconvolutionLayer
+  - @ref NEConvolutionLayer
+  - @ref NEDeconvolutionLayer
+ - Remove padding from OpenCL kernels:
+   - @ref CLL2NormalizeLayerKernel
+   - CLDepthwiseConvolutionLayer3x3NHWCKernel
+   - @ref CLNormalizationLayerKernel
+   - @ref CLNormalizePlanarYUVLayerKernel
+   - @ref opencl::kernels::ClMulKernel
+   - @ref CLReductionOperationKernel
+   - @ref CLROIPoolingLayerKernel
+ - Remove computer vision support from Arm® Neon™ backend
+ - Remove the following functions:
+   - NEAbsoluteDifference
+   - NEAccumulate
+   - NEBox3x3
+   - NECannyEdge
+   - NEChannelCombine
+   - NEChannelExtract
+   - NEColorConvert
+   - NEConvolution
+   - NEDerivative
+   - NEDilate
+   - NEEqualizeHistogram
+   - NEErode
+   - NEFastCorners
+   - NEGaussian3x3
+   - NEGaussian5x5
+   - NEGaussianPyramid
+   - NEHOGDescriptor
+   - NEHOGDetector
+   - NEHOGGradient
+   - NEHOGMultiDetection
+   - NEHarrisCorners
+   - NEHistogram
+   - NEIntegralImage
+   - NELaplacianPyramid
+   - NELaplacianReconstruct
+   - NEMagnitude
+   - NEMeanStdDev
+   - NEMedian3x3
+   - NEMinMaxLocation
+   - NENonLinearFilter
+   - NEOpticalFlow
+   - NEPhase
+   - NEScharr3x3
+   - NESobel3x3
+   - NESobel5x5
+   - NESobel7x7
+   - NETableLookup
+   - NEThreshold
+   - NEWarpAffine
+   - NEWarpPerspectiveKernel
+ - Remove all GLES kernels / functions / tests / examples
+ - Remove computer vision support from CL backend
+ - Remove the following functions:
+   - CLAbsoluteDifference
+   - CLAccumulate
+   - CLBox3x3
+   - CLCannyEdge
+   - CLChannelCombine
+   - CLChannelExtract
+   - CLColorConvert
+   - CLConvolution
+   - CLDerivative
+   - CLDilate
+   - CLEqualizeHistogram
+   - CLErode
+   - CLFastCorners
+   - CLGaussian3x3
+   - CLGaussian5x5
+   - CLGaussianPyramid
+   - CLHOGDescriptor
+   - CLHOGDetector
+   - CLHOGGradient
+   - CLHOGMultiDetection
+   - CLHarrisCorners
+   - CLHistogram
+   - CLIntegralImage
+   - CLLaplacianPyramid
+   - CLLaplacianReconstruct
+   - CLMagnitude
+   - CLMeanStdDev
+   - CLMedian3x3
+   - CLMinMaxLocation
+   - CLNonLinearFilter
+   - CLOpticalFlow
+   - CLPhase
+   - CLScharr3x3
+   - CLSobel3x3
+   - CLSobel5x5
+   - CLSobel7x7
+   - CLTableLookup
+   - CLThreshold
+   - CLWarpAffine
+   - CLWarpPerspective
+
+v21.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Upgrade C++ standard to C++14
+ - Add macOS support
+ - Add Armv8-R AArch64 architecture support
+ - Add SVE/SVE2 support for:
+   - NEScaleKernel
+   - @ref NEActivationLayer
+   - @ref NEArithmeticAddition
+   - @ref NEBatchNormalizationLayerKernel
+   - cpu::kernels::CpuLogits1DSoftmaxKernel
+   - cpu::kernels::CpuLogits1DMaxKernel
+   - @ref cpu::kernels::CpuElementwiseUnaryKernel
+ - Remove padding from OpenCL kernels:
+   - CLDirectConvolutionLayerKernel
+   - @ref CLArgMinMaxLayerKernel
+   - @ref CLPadLayerKernel
+   - @ref CLROIAlignLayerKernel
+   - @ref CLRangeKernel
+   - CLScaleKernel
+   - @ref CLSelectKernel
+   - @ref CLBitwiseKernel
+   - @ref opencl::kernels::ClFloorKernel
+   - CLTransposeKernel
+ - Deprecate functions in CLTuner:
+    - add_lws_to_table
+    - import_lws_table
+    - lws_table
+ - Remove functions:
+   - NELocallyConnectedLayer / CLLocallyConnectedLayer
+   - NEIm2Col
+   - NECol2Im
+   - NEGEMMInterleave4x4
+   - NEGEMMTranspose1xW
+   - NEComputeAllAnchors / CLComputeAllAnchors
+   - NEGEMMAssemblyDispatch
+   - NEUpsampleLayer / CLUpsampleLayer
+ - Remove kernels:
+   - NEGEMMMatrixVectorMultiplyKernel
+   - NELocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedMatrixMultiplyKernel
+   - NEUpsampleLayerKernel / CLUpsampleLayerKernel
+ - Extend OpenCL tuner with workgroup batch size support
+   - Experimental extension for the OpenCL tuner to tune the batches of work groups distribute to compute units
+ - Add functionality to load the OpenCL GEMM heuristics at runtime
+   - The GEMM heuristic file (MLGO) can be used to update the default GEMM heuristics available for OpenCL
+ - Note: there might be performance regressions against v20.08 in Inception v3 using int8 data types on Arm Mali-G77 GPUs. Currently under investigation
+ - Note: data-type decoupling is in progress and experimental. Warning of unused symbols might be raised
+
+v20.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Performance regressions can be noted when executing Depthwise Convolution on Arm® Neon™ with a depth multiplier > 1 for quantized data type.
+   This is planned to be resolved in 21.02 release.
+ - Added new data type QASYMM8_SIGNED support for @ref NEROIAlignLayer.
+ - Added new data type S32 support for:
+   - NEArithmeticSubtraction
+   - NEArithmeticSubtractionKernel
+   - @ref NEPixelWiseMultiplication
+   - NEPixelWiseMultiplicationKernel
+   - NEElementwiseDivision
+   - NEDivisionOperationKernel
+ - Interface change
+   - Properly support softmax axis to have the same meaning as other major frameworks. That is, axis now defines the dimension
+     on which Softmax/Logsoftmax is performed. E.g. for input of shape 4x5x6 and axis=1, softmax will be applied to 4x6=24 vectors of size 5.
+     The supported value range of axis is [-rank, rank).
+     This change applies to the following functions:
+      - @ref NESoftmaxLayer
+      - @ref NELogSoftmaxLayer
+      - @ref CLSoftmaxLayer
+      - @ref CLLogSoftmaxLayer
+      - GCSoftmaxLayer
+ - New OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
+   - @ref CLLogicalNot
+   - @ref CLLogicalAnd
+   - @ref CLLogicalOr
+ - New Arm® Neon™ kernels / functions:
+   - @ref NELogicalNot
+   - @ref NELogicalAnd
+   - @ref NELogicalOr
+ - Removed padding from Arm® Neon™ kernels:
+   - NEComplexPixelWiseMultiplicationKernel
+   - NENonMaximaSuppression3x3Kernel
+   - NERemapKernel
+   - NEGEMMInterleave4x4Kernel
+   - NEDirectConvolutionLayerKernel
+   - NEScaleKernel
+   - NELocallyConnectedMatrixMultiplyKernel
+   - NEGEMMLowpOffsetContributionKernel
+   - NEGEMMTranspose1xWKernel
+   - NEPoolingLayerKernel
+   - NEConvolutionKernel
+   - NEDepthwiseConvolutionLayerNativeKernel
+   - NEGEMMLowpMatrixMultiplyKernel
+   - NEGEMMMatrixMultiplyKernel
+   - NEDirectConvolutionLayerOutputStageKernel
+   - @ref NEReductionOperationKernel
+   - NEGEMMLowpMatrixAReductionKernel
+   - NEGEMMLowpMatrixBReductionKernel
+ - Removed padding from OpenCL kernels:
+   - CLBatchConcatenateLayerKernel
+   - CLElementwiseOperationKernel
+   - @ref CLBatchNormalizationLayerKernel
+   - CLPoolingLayerKernel
+   - CLWinogradInputTransformKernel
+   - CLGEMMLowpMatrixMultiplyNativeKernel
+   - CLGEMMLowpMatrixAReductionKernel
+   - CLGEMMLowpMatrixBReductionKernel
+   - CLGEMMLowpOffsetContributionOutputStageKernel
+   - CLGEMMLowpOffsetContributionKernel
+   - CLWinogradOutputTransformKernel
+   - CLGEMMLowpMatrixMultiplyReshapedKernel
+   - @ref CLFuseBatchNormalizationKernel
+   - @ref CLDepthwiseConvolutionLayerNativeKernel
+   - CLDepthConvertLayerKernel
+   - CLCopyKernel
+   - CLDepthwiseConvolutionLayer3x3NHWCKernel
+   - CLActivationLayerKernel
+   - CLWinogradFilterTransformKernel
+   - CLWidthConcatenateLayerKernel
+   - CLWidthConcatenate4TensorsKernel
+   - CLWidthConcatenate2TensorsKernel
+   - CLLogits1DMaxShiftExpSumKernel
+   - CLLogits1DNormKernel
+   - CLHeightConcatenateLayerKernel
+   - CLGEMMMatrixMultiplyKernel
+   - CLGEMMLowpQuantizeDownInt32ScaleKernel
+   - CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
+   - CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+   - CLDepthConcatenateLayerKernel
+   - CLGEMMLowpQuantizeDownInt32ScaleByFixedPointKernel
+ - Removed OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+   - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel
+   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel
+ - Deprecated OpenCL kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - CLLocallyConnectedLayer
+     - CLLocallyConnectedMatrixMultiplyKernel
+     - CLAbsoluteDifference
+     - CLAbsoluteDifferenceKernel
+     - CLAccumulate
+     - CLAccumulateKernel
+     - CLAccumulateSquared
+     - CLAccumulateSquaredKernel
+     - CLAccumulateWeighted
+     - CLAccumulateWeightedKernel
+     - CLAccumulateWeightedFP16Kernel
+     - CLBox3x3
+     - CLBox3x3Kernel
+     - CLBox3x3FP16Kernel
+     - CLCannyEdge
+     - CLChannelCombine
+     - CLChannelCombineKernel
+     - CLChannelExtract
+     - CLChannelExtractKernel
+     - CLColorConvert
+     - CLColorConvertKernel
+     - CLConvolution3x3
+     - CLConvolutionRectangle
+     - CLConvolutionRectangleKernel
+     - CLConvolutionSquare
+     - CLConvolutionKernel
+     - CLDerivative
+     - CLDerivativeKernel
+     - CLDilate
+     - CLDilateKernel
+     - CLEqualizeHistogram
+     - CLErode
+     - CLErodeKernel
+     - CLFastCorners
+     - CLFastCornersKernel
+     - CLGaussian3x3
+     - CLGaussian3x3Kernel
+     - CLGaussian5x5
+     - CLGaussian5x5HorKernel
+     - CLGaussian5x5VertKernel
+     - CLGaussianPyramid
+     - CLGaussianPyramidHalf
+     - CLGaussianPyramidOrb
+     - CLHarrisCorners
+     - CLHarrisScoreKernel
+     - CLHarrisScoreFP16Kernel
+     - CLHistogram
+     - CLHistogramKernel
+     - CLHOGOrientationBinningKernel
+     - CLHOGBlockNormalizationKernel
+     - CLHOGDetectorKernel
+     - CLHOGNonMaximaSuppressionKernel
+     - CLHOGDescriptor
+     - CLHOGDetector
+     - CLHOGGradient
+     - CLHOGMultiDetection
+     - CLHOGOrientationBinningKernel
+     - CLHOGBlockNormalizationKernel
+     - CLHOGDetectorKernel
+     - CLIntegralImage
+     - CLIntegralImageKernel
+     - CLLaplacianReconstruct
+     - CLLaplacianPyramid
+     - CLMagnitude
+     - CLMagnitudePhaseKernel
+     - CLMedian3x3
+     - CLMedian3x3Kernel
+     - CLMinMaxLocation
+     - CLMinMaxLocationKernel
+     - CLNonLinearFilter
+     - CLNonLinearFilterKernel
+     - CLNonMaximaSuppression3x3
+     - CLNonMaximaSuppression3x3FP16Kernel
+     - CLNonMaximaSuppression3x3Kernel
+     - CLOpticalFlow
+     - CLPhase
+     - CLRemap
+     - CLRemapKernel
+     - CLScharr3x3
+     - CLScharr3x3Kernel
+     - CLSobel3x3
+     - CLSobel3x3Kernel
+     - CLSobel5x5
+     - CLSobel5x5HorKernel
+     - CLSobel5x5VertKernel
+     - CLSobel7x7
+     - CLSobel7x7HorKernel
+     - CLSobel7x7VertKernel
+     - CLThreshold
+     - CLThresholdKernel
+     - CLWarpAffine
+     - CLWarpAffineKernel
+     - CLWarpPerspective
+     - CLWarpPerspectiveKernel
+ - Deprecated Arm® Neon™ kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - NELocallyConnectedLayer
+     - NELocallyConnectedMatrixMultiplyKernel
+     - NEAbsoluteDifference
+     - NEAbsoluteDifferenceKernel
+     - NEAccumulate
+     - NEAccumulateKernel
+     - NEAccumulateSquared
+     - NEAccumulateSquaredKernel
+     - NEAccumulateWeighted
+     - NEAccumulateWeightedKernel
+     - NEAccumulateWeightedFP16Kernel
+     - NEBox3x3
+     - NEBox3x3Kernel
+     - NEBox3x3FP16Kernel
+     - NECannyEdge
+     - NEChannelCombine
+     - NEChannelCombineKernel
+     - NEChannelExtract
+     - NEChannelExtractKernel
+     - NEColorConvert
+     - NEColorConvertKernel
+     - NEConvolution3x3
+     - NEConvolutionRectangle
+     - NEConvolutionRectangleKernel
+     - NEConvolutionSquare
+     - NEConvolutionKernel
+     - NEDerivative
+     - NEDerivativeKernel
+     - NEDilate
+     - NEDilateKernel
+     - NEEqualizeHistogram
+     - NEErode
+     - NEErodeKernel
+     - NEFastCorners
+     - NEFastCornersKernel
+     - NEGaussian3x3
+     - NEGaussian3x3Kernel
+     - NEGaussian5x5
+     - NEGaussian5x5HorKernel
+     - NEGaussian5x5VertKernel
+     - NEGaussianPyramid
+     - NEGaussianPyramidHalf
+     - NEGaussianPyramidOrb
+     - NEHarrisCorners
+     - NEHarrisScoreKernel
+     - NEHarrisScoreFP16Kernel
+     - NEHistogram
+     - NEHistogramKernel
+     - NEHOGOrientationBinningKernel
+     - NEHOGBlockNormalizationKernel
+     - NEHOGDetectorKernel
+     - NEHOGNonMaximaSuppressionKernel
+     - NEHOGDescriptor
+     - NEHOGDetector
+     - NEHOGGradient
+     - NEHOGMultiDetection
+     - NEHOGOrientationBinningKernel
+     - NEHOGBlockNormalizationKernel
+     - NEHOGDetectorKernel
+     - NEIntegralImage
+     - NEIntegralImageKernel
+     - NELaplacianReconstruct
+     - NELaplacianPyramid
+     - NEMagnitude
+     - NEMagnitudePhaseKernel
+     - NEMedian3x3
+     - NEMedian3x3Kernel
+     - NEMinMaxLocation
+     - NEMinMaxLocationKernel
+     - NENonLinearFilter
+     - NENonLinearFilterKernel
+     - NENonMaximaSuppression3x3
+     - NENonMaximaSuppression3x3FP16Kernel
+     - NENonMaximaSuppression3x3Kernel
+     - NEOpticalFlow
+     - NEPhase
+     - NERemap
+     - NERemapKernel
+     - NEScharr3x3
+     - NEScharr3x3Kernel
+     - NESobel3x3
+     - NESobel3x3Kernel
+     - NESobel5x5
+     - NESobel5x5HorKernel
+     - NESobel5x5VertKernel
+     - NESobel7x7
+     - NESobel7x7HorKernel
+     - NESobel7x7VertKernel
+     - NEThreshold
+     - NEThresholdKernel
+     - NEWarpAffine
+     - NEWarpAffineKernel
+     - NEWarpPerspective
+     - NEWarpPerspectiveKernel
+ - Deprecated GLES kernels / functions (If a kernel is used only by the function that is being deprecated, the kernel is deprecated together):
+     - GCAbsoluteDifference
+     - GCActivationLayer
+     - GCArithmeticAddition
+     - GCBatchNormalizationLayer
+     - GCConcatenateLayer
+     - GCConvolutionLayer
+     - GCDepthwiseConvolutionLayer
+     - GCDirectConvolutionLayer
+     - GCDropoutLayer
+     - GCFillBorder
+     - GCFullyConnectedLayer
+     - GCGEMM
+     - GCGEMMInterleave4x4
+     - GCGEMMTranspose1xW
+     - GCNormalizationLayer
+     - GCNormalizePlanarYUVLayer
+     - GCPixelWiseMultiplication
+     - GCPoolingLayer
+     - GCScale
+     - GCSoftmaxLayer
+     - GCTensorShift
+     - GCTranspose
+
+
+v20.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Added new data type QASYMM8_SIGNED support for:
+   - @ref CLArgMinMaxLayer
+   - @ref CLArgMinMaxLayerKernel
+ - Added new data type U8 support for:
+   - @ref NECropKernel
+   - CLCropKernel
+ - Added align_corner support for nearest neighbor interpolation in:
+   - NEScaleKernel
+   - CLScaleKernel
+ - New OpenCL kernels / functions:
+   - @ref CLMaxUnpoolingLayerKernel
+ - New Arm® Neon™ kernels / functions:
+   - NEMaxUnpoolingLayerKernel
+ - New graph example:
+   - graph_yolov3_output_detector
+ - GEMMTuner improvements:
+   - Added fp16 support
+   - Output json files for easier integration
+   - Enabled tuning for export_to_cl_image_rhs option for RHS tensors
+   - More robust script for running benchmarks
+ - Removed padding from:
+   - NEPixelWiseMultiplicationKernel
+   - NEHeightConcatenateLayerKernel
+   - NEThresholdKernel
+   - NEBatchConcatenateLayerKernel
+   - NETransposeKernel
+   - @ref NEBatchNormalizationLayerKernel
+   - NEArithmeticSubtractionKernel
+   - @ref NEBoundingBoxTransformKernel
+   - NELogits1DMaxKernel
+   - NELogits1DSoftmaxKernel
+   - @ref NEROIPoolingLayerKernel
+   - @ref NEROIAlignLayerKernel
+   - NEYOLOLayerKernel
+   - NEUpsampleLayerKernel
+   - NEFloorKernel
+   - NEWidthConcatenateLayerKernel
+   - NEDepthConcatenateLayerKernel
+   - @ref NENormalizationLayerKernel
+   - @ref NEL2NormalizeLayerKernel
+   - NEFillArrayKernel
+   - NEDepthConvertLayerKernel
+   - @ref NERangeKernel
+   - @ref NEPriorBoxLayer
+ - Removed OpenCL kernels / functions:
+   - CLGEMMLowpQuantizeDownInt32ToUint8Scale
+   - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
+ - Removed Arm® Neon™ kernels / functions:
+   - NEGEMMLowpQuantizeDownInt32ToUint8Scale
+   - NEGEMMMatrixAccumulateBiasesKernel
+ - Deprecated functions / interfaces:
+   - Non-descriptor based interfaces for NEThreshold, CLThreshold
+   - Non-descriptor based interfaces for @ref NEScale, @ref CLScale and GCScale
+   - In @ref NESoftmaxLayer, @ref NELogSoftmaxLayer, @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer :
+      The default "axis" value for @ref CLSoftmaxLayer, @ref CLLogSoftmaxLayer and GCSoftmaxLayer is changed from 1 to 0.
+      Only axis 0 is supported.
+      The default "axis" value for @ref NESoftmaxLayer, @ref NELogSoftmaxLayer is changed from 1 to 0.
+      Only axis 0 is supported.
+ - The support for quantized data types has been removed from @ref CLLogSoftmaxLayer due to implementation complexity.
+ - Removed padding requirement for the input (e.g. LHS of GEMM) and output in CLGEMMMatrixMultiplyNativeKernel, CLGEMMMatrixMultiplyReshapedKernel, CLGEMMMatrixMultiplyReshapedOnlyRHSKernel and CLIm2ColKernel (NHWC only)
+   - This change allows to use @ref CLGEMMConvolutionLayer without extra padding for the input and output.
+   - Only the weights/bias of @ref CLGEMMConvolutionLayer could require padding for the computation.
+   - Only on Arm® Mali™ Midgard GPUs, @ref CLGEMMConvolutionLayer could require padding since CLGEMMMatrixMultiplyKernel is called and currently requires padding.
+ - Added support for exporting the OpenCL buffer object to the OpenCL image object in CLGEMMMatrixMultiplyReshapedKernel and CLGEMMMatrixMultiplyReshapedOnlyRHSKernel.
+   - This support allows to export the OpenCL buffer used for the reshaped RHS matrix to the OpenCL image object.
+   - The padding requirement for the OpenCL image object is considered into the CLGEMMReshapeRHSMatrixKernel.
+   - The reshaped RHS matrix stores the weights when GEMM is used to accelerate CLGEMMConvolutionLayer.
+
+v20.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r18b.
+ - Updated recommended gcc version to Linaro 6.3.1.
+ - Added Bfloat16 type support
+ - Added Bfloat16 support in:
+     - NEWeightsReshapeKernel
+     - NEConvolutionLayerReshapeWeights
+     - NEIm2ColKernel
+     - NEIm2Col
+     - NEDepthConvertLayerKernel
+     - @ref NEDepthConvertLayer
+     - @ref NEGEMMConvolutionLayer
+     - NEGEMMAssemblyDispatch
+ - Added new data type QASYMM8_SIGNED support for:
+     - @ref CLDirectConvolutionLayer
+     - @ref CLDeconvolutionLayer
+     - @ref CLDirectDeconvolutionLayer
+     - @ref CLGEMMDeconvolutionLayer
+     - CLGEMMLowpMatrixMultiplyReshapedKernel
+     - CLGEMMLowpQuantizeDownInt32ScaleKernel
+     - CLGEMMLowpQuantizeDownInt32ScaleByFloatKernel
+     - @ref CLReductionOperation
+     - @ref CLReduceMean
+     - @ref NEScale
+     - NEScaleKernel
+     - NEUpsampleLayer
+     - @ref NECast
+     - @ref NEReductionOperation
+     - @ref NEReduceMean
+     - @ref NEArgMinMaxLayer
+     - @ref NEDeconvolutionLayer
+     - NEGEMMLowpQuantizeDownInt32ScaleKernel
+     - @ref CPPBoxWithNonMaximaSuppressionLimit
+     - @ref CPPDetectionPostProcessLayer
+     - @ref CPPPermuteKernel
+     - @ref CPPPermute
+     - @ref CPPTopKVKernel
+     - @ref CPPTopKV
+     - @ref CPPUpsample
+     - @ref CPPUpsampleKernel
+ - New OpenCL kernels / functions:
+     - @ref CLQLSTMLayer
+     - @ref CLQLSTMLayerNormalizationKernel
+ - New Arm® Neon™ kernels / functions:
+     - @ref NEQLSTMLayer
+     - @ref NEQLSTMLayerNormalizationKernel
+ - Added HARD_SWISH support in:
+     - CLActivationLayerKernel
+     - NEActivationLayerKernel
+ - Deprecated OpenCL kernels / functions:
+     - CLGEMMLowpQuantizeDownInt32ToUint8Scale
+     - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFloat
+ - Deprecated Arm® Neon™ kernels / functions:
+     - NEGEMMLowpQuantizeDownInt32ToUint8Scale
+ - Removed CPP kernels / functions:
+     - CPPFlipWeightsKernel
+ - Removed PoolingLayerInfo constructors without Data Layout.
+ - Removed CLDepthwiseConvolutionLayer3x3
+ - Removed NEDepthwiseConvolutionLayerOptimized
+ - Added support for Winograd 3x3,4x4 on Arm® Neon™ FP16:
+     - @ref NEWinogradConvolutionLayer
+     - CpuWinogradConv2dTransformInputKernel
+     - CpuWinogradConv2dTransformOutputKernel
+     - CpuWinogradConv2dTransformWeightsKernel
+ - Added CLCompileContext
+ - Added Arm® Neon™ GEMM kernel with 2D window support
+
+v20.02.1 Maintenance release
+ - Added Android-NN build script.
+
+v20.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Added new data type QASYMM8_SIGNED support for:
+     - @ref CLDepthwiseConvolutionLayer
+     - CLDepthwiseConvolutionLayer3x3
+     - @ref CLGEMMConvolutionLayer
+     - CLGEMMLowpMatrixMultiplyCore
+     - CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+     - CLGEMMLowpMatrixMultiplyNativeKernel
+     - @ref NEActivationLayer
+     - NEComparisonOperationKernel
+     - @ref NEConvolutionLayer
+     - @ref NEDepthwiseConvolutionLayer
+     - NEDepthwiseConvolutionLayer3x3Kernel
+     - NEDirectConvolutionLayerOutputStageKernel
+     - @ref NEElementwiseComparison
+     - @ref NEElementwiseMax
+     - @ref NEElementwiseMin
+     - @ref NEElementwiseSquaredDiff
+     - @ref NEFullyConnectedLayer
+     - NEGEMMMatrixVectorMultiplyKernel
+     - @ref NEPixelWiseMultiplication
+     - @ref NEPoolingLayer
+     - @ref NEPReluLayer
+ - Added support for QSYMM8_PER_CHANNEL in:
+     - NEDepthwiseConvolutionLayer3x3Kernel
+ - Added support for split sizes in:
+     - @ref CLSplit
+     - @ref NESplit
+ - New OpenCL kernels / functions:
+     - @ref CLFill
+     - CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / CLGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
+ - New Arm® Neon™ kernels / functions:
+     - @ref NEFill
+     - NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPointKernel / NEGEMMLowpQuantizeDownInt32ToInt8ScaleByFixedPoint
+ - Deprecated Arm® Neon™ functions / interfaces:
+     - CLDepthwiseConvolutionLayer3x3
+     - NEDepthwiseConvolutionLayerOptimized
+     - PoolingLayerInfo constructors without Data Layout.
+ - Added support for quantization with multiplier greater than 1 on Arm® Neon™ and CL.
+ - Added support for quantized inputs of type QASYMM8_SIGNED and QASYMM8 to @ref CLQuantizationLayer.
+ - Added the ability to build bootcode for bare metal.
+ - Added support for generating synthetic QASYMM8 graphs.
+ - Added support for F16 datatype in VGG16.
+ - Removed pre-built binaries for GLES.
+
+v19.11.1 Public maintenance release
+ - Fix offset calculation in NEReductionOperationKernel.
+ - Fix data layout in NEScaleKernel for nhwc.
+ - Retain configuration step data layout to avoid side-effects.
+ - Perform sqrt in double domain for L2 pooling.
+ - Fix output shape calculation for Reduce Mean
+ - Restrict cases where optimized NEPadLayer runs.
+
+v19.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r17c.
+ - Deprecated OpenCL kernels / functions:
+    - CLDepthwiseConvolutionLayerReshapeWeightsGenericKernel
+    - CLDepthwiseIm2ColKernel
+    - CLDepthwiseSeparableConvolutionLayer
+    - CLDepthwiseVectorToTensorKernel
+    - CLDirectConvolutionLayerOutputStageKernel
+ - Deprecated Arm® Neon™ kernels / functions:
+    - NEDepthwiseWeightsReshapeKernel
+    - NEDepthwiseIm2ColKernel
+    - NEDepthwiseSeparableConvolutionLayer
+    - NEDepthwiseVectorToTensorKernel
+    - NEDepthwiseConvolutionLayer3x3
+ - New OpenCL kernels / functions:
+    - @ref CLInstanceNormalizationLayerKernel / @ref CLInstanceNormalizationLayer
+    - @ref CLDepthwiseConvolutionLayerNativeKernel to replace the old generic depthwise convolution (see Deprecated
+      OpenCL kernels / functions)
+    - @ref CLLogSoftmaxLayer
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBoundingBoxTransformKernel / @ref NEBoundingBoxTransform
+    - @ref NEComputeAllAnchorsKernel / NEComputeAllAnchors
+    - @ref NEDetectionPostProcessLayer
+    - @ref NEGenerateProposalsLayer
+    - @ref NEInstanceNormalizationLayerKernel / @ref NEInstanceNormalizationLayer
+    - @ref NELogSoftmaxLayer
+    - @ref NEROIAlignLayerKernel / @ref NEROIAlignLayer
+ - Added QASYMM8 support for:
+    - @ref CLGenerateProposalsLayer
+    - @ref CLROIAlignLayer
+    - @ref CPPBoxWithNonMaximaSuppressionLimit
+ - Added QASYMM16 support for:
+    - @ref CLBoundingBoxTransform
+ - Added FP16 support for:
+    - CLGEMMMatrixMultiplyReshapedKernel
+ - Added new data type QASYMM8_PER_CHANNEL support for:
+    - CLDequantizationLayer
+    - @ref NEDequantizationLayer
+ - Added new data type QSYMM8_PER_CHANNEL support for:
+    - @ref CLConvolutionLayer
+    - @ref NEConvolutionLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref NEDepthwiseConvolutionLayer
+ - Added FP16 mixed-precision support for:
+    - CLGEMMMatrixMultiplyReshapedKernel
+    - CLPoolingLayerKernel
+ - Added FP32 and FP16 ELU activation for:
+    - @ref CLActivationLayer
+    - @ref NEActivationLayer
+ - Added asymmetric padding support for:
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+    - @ref NEDeconvolutionLayer
+ - Added SYMMETRIC and REFLECT modes for @ref CLPadLayerKernel / @ref CLPadLayer.
+ - Replaced the calls to NECopyKernel and NEMemsetKernel with @ref NEPadLayer in @ref NEGenerateProposalsLayer.
+ - Replaced the calls to CLCopyKernel and CLMemsetKernel with @ref CLPadLayer in @ref CLGenerateProposalsLayer.
+ - Improved performance for CL Inception V3 - FP16.
+ - Improved accuracy for CL Inception V3 - FP16 by enabling FP32 accumulator (mixed-precision).
+ - Improved Arm® Neon™ performance by enabling fusing batch normalization with convolution and depth-wise convolution layer.
+ - Improved Arm® Neon™ performance for MobileNet-SSD by improving the output detection performance.
+ - Optimized @ref CLPadLayer.
+ - Optimized CL generic depthwise convolution layer by introducing @ref CLDepthwiseConvolutionLayerNativeKernel.
+ - Reduced memory consumption by implementing weights sharing.
+
+v19.08.1 Public maintenance release
+ - Fix offset calculation in NEReductionOperationKernel.
+ - Fix data layout in NEScaleKernel for nhwc.
+ - Retain configuration step data layout to avoid side-effects.
+ - Perform sqrt in double domain for L2 pooling.
+ - Fix output shape calculation for Reduce Mean
+ - Fix broadcast CLPixelwiseMultiplication with 5D tensors
+
+v19.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Deprecated Arm® Neon™ functions
+    - NEDepthConcatenateLayer
+    - NEWidthConcatenateLayer
+ - Deprecated OpenCL kernels / functions
+    - CLDepthConcatenateLayer
+    - CLGEMMInterleave4x4Kernel / CLGEMMInterleave4x4
+    - CLGEMMTranspose1xWKernel / CLGEMMTranspose1xW
+    - CLWidthConcatenateLayer
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEAbsLayer
+    - @ref NECast
+    - @ref NEElementwisePower
+    - @ref NELogLayer
+    - @ref NELSTMLayerQuantized
+    - @ref NENegLayer
+    - @ref NEPReluLayer
+    - @ref NESinLayer
+    - NEBatchConcatenateLayerKernel
+    - @ref NEDepthToSpaceLayerKernel / @ref NEDepthToSpaceLayer
+    - NEDepthwiseConvolutionLayerNativeKernel
+    - NEGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+    - @ref NEMeanStdDevNormalizationKernel / @ref NEMeanStdDevNormalizationLayer
+    - @ref NESpaceToDepthLayerKernel / @ref NESpaceToDepthLayer
+ - New OpenCL kernels / functions:
+    - @ref CLAbsLayer
+    - @ref CLElementwisePower
+    - @ref CLLogLayer
+    - @ref CLLSTMLayerQuantized
+    - @ref CLNegLayer
+    - @ref CLPReluLayer
+    - @ref CLSinLayer
+    - CLBatchConcatenateLayerKernel
+    - @ref CLDepthToSpaceLayerKernel / @ref CLDepthToSpaceLayer
+    - CLGEMMLowpMatrixMultiplyNativeKernel
+    - CLGEMMLowpQuantizeDownInt32ToInt16ScaleByFixedPointKernel
+    - CLGEMMMatrixMultiplyNativeKernel
+    - CLMeanStdDevNormalizationKernel /CLMeanStdDevNormalizationLayer
+    - @ref CLSpaceToDepthLayerKernel / @ref CLSpaceToDepthLayer
+ - New examples:
+    - neon_opticalflow
+    - cl_cache
+    - neon_permute
+ - Added support for FP16 in @ref NEDeconvolutionLayer
+ - Added support for FP16 in @ref CLDeconvolutionLayer
+ - Added support for REDUCE_MIN and REDUCE_MAX in @ref ReductionOperation
+ - Enable the fusion of batch normalization with convolution and depthwise convolution layer for FP32 in the graph API (OpenCL only)
+ - Added support for fusing activation function and broadcast addition with the matrix multiplication for FP32 (OpenCL only)
+ - Re-factored the depthwise convolution layer kernel on Arm® Neon™ for generic cases
+ - Added an optimized depthwise convolution layer kernel for 5x5 filters (Neon™ only)
+ - Added support to enable OpenCL kernel cache. Added example showing how to load the prebuilt OpenCL kernels from a binary cache file
+ - Altered @ref QuantizationInfo interface to support per-channel quantization.
+ - The CLDepthwiseConvolutionLayer3x3 will be included by @ref CLDepthwiseConvolutionLayer to accommodate for future optimizations.
+ - The NEDepthwiseConvolutionLayerOptimized will be included by @ref NEDepthwiseConvolutionLayer to accommodate for future optimizations.
+ - Removed inner_border_right and inner_border_top parameters from @ref CLDeconvolutionLayer interface
+ - Removed inner_border_right and inner_border_top parameters from @ref NEDeconvolutionLayer interface
+ - Optimized the Arm® Neon™ assembly kernel for GEMMLowp. The new implementation fuses the output stage and quantization with the matrix multiplication kernel
+
+v19.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBatchToSpaceLayerKernel / @ref NEBatchToSpaceLayer
+    - NEComplexPixelWiseMultiplicationKernel / @ref NEComplexPixelWiseMultiplication
+    - @ref NECropKernel / @ref NECropResize
+    - NEDepthwiseConvolutionAssemblyDispatch
+    - @ref NEFFTDigitReverseKernel
+    - @ref NEFFTRadixStageKernel
+    - @ref NEFFTScaleKernel
+    - NEGEMMLowpOffsetContributionOutputStageKernel
+    - NEHeightConcatenateLayerKernel
+    - @ref NESpaceToBatchLayerKernel / @ref NESpaceToBatchLayer
+    - @ref NEFFT1D
+    - @ref NEFFT2D
+    - @ref NEFFTConvolutionLayer
+ - New OpenCL kernels / functions:
+    - CLComplexPixelWiseMultiplicationKernel / @ref CLComplexPixelWiseMultiplication
+    - CLCropKernel / @ref CLCropResize
+    - @ref CLDeconvolutionReshapeOutputKernel
+    - @ref CLFFTDigitReverseKernel
+    - @ref CLFFTRadixStageKernel
+    - @ref CLFFTScaleKernel
+    - CLGEMMLowpMatrixMultiplyReshapedOnlyRHSKernel
+    - CLGEMMMatrixMultiplyReshapedOnlyRHSKernel
+    - CLHeightConcatenateLayerKernel
+    - @ref CLDirectDeconvolutionLayer
+    - @ref CLFFT1D
+    - @ref CLFFT2D
+    - @ref CLFFTConvolutionLayer
+    - @ref CLGEMMDeconvolutionLayer
+ - New OpenGLES kernels / functions:
+    - GCConcatenateLayer
+ - Deprecated functions/interfaces
+    - GCDepthConcatenateLayer
+    - NEWidthConcatenateLayer
+    - NEDepthConcatenateLayer
+    - CLWidthConcatenateLayer
+    - CLDepthConcatenateLayer
+    - CLGEMMInterleave4x4
+    - CLGEMMTranspose1xW
+ - Support different quantization info in CLConcatLayer.
+ - Add checks on different input/output quantization info were not supported.
+ - Tensors have different quantization information.
+ - Add FP16 support checks.
+ - Fix output quantization CLDeptwiseConv3x3 when activation is fused.
+ - New graph examples:
+     - graph_convolution
+     - graph_fully_connected
+     - graph_depthwise_convolution
+     - Deepspeech v0.4.1
+ - Add support for QASYMM8 in NEArithmeticSubtractionKernel.
+ - Add support for QASYMM8 in NEPixelWiseMultiplicationKernel.
+ - Add support for QASYMM8 NEDeconvolution.
+ - Add support for DequantizationLayer for Neon/CL.
+ - Add support for dilation in CLDepthwiseConvolution.
+ - Fuse offset contribution with the output stage when we use NEGEMMLowpMatrixMultiplyCore.
+ - Optimize CLDeconvolution.
+ - Add StackLayer to the graph API.
+ - Add support for "reflect" padding mode in NEPad.
+ - Winograd 7x7 NHWC on OpenCL.
+ - Rework CL ML layers to run exclusively on CL.
+ - Support different quantization info in PoolingLayer.
+ - Implement and test import memory interfaces.
+ - Added new tests and removed old ones.
+ - Various clang-tidy fixes.
+
+v19.02 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NETileKernel / @ref NETile
+    - @ref NEFuseBatchNormalizationKernel / @ref NEFuseBatchNormalization
+    - NEElementwiseOperationKernel
+    - @ref NEElementwiseMax
+    - @ref NEElementwiseMin
+    - @ref NEElementwiseSquaredDiff
+    - @ref NESelectKernel / @ref NESelect
+    - @ref NESplit
+    - @ref NESlice
+    - @ref NEUnstack
+    - @ref NEStridedSliceKernel / @ref NEStridedSlice
+    - NEElementwiseUnaryKernel
+    - @ref NERsqrtLayer
+    - @ref NEExpLayer
+    - @ref NEReverseKernel / @ref NEReverse
+    - @ref NEArgMinMaxLayer
+    - @ref NEStackLayerKernel / @ref NEStackLayer
+    - @ref NERangeKernel / @ref NERange
+    - @ref NEPadLayer
+    - NEMemsetKernel
+    - @ref NEGatherKernel / @ref NEGather
+    - @ref NEElementwiseComparison
+    - @ref NEElementwiseComparisonStatic
+    - NEComparisonOperationKernel
+    - @ref NEElementwiseDivision
+ - New OpenCL kernels / functions:
+    - @ref CLSelectKernel / @ref CLSelect
+    - @ref CLTileKernel / @ref CLTile
+    - @ref CLComparisonKernel / @ref CLComparison
+    - @ref CLArgMinMaxLayer
+    - @ref CLElementwiseMax
+    - @ref CLElementwiseMin
+    - @ref CLElementwiseSquaredDiff
+    - @ref CLStackLayerKernel / @ref CLStackLayer
+    - @ref CLReverse / @ref CLReverseKernel
+    - @ref CLRsqrtLayer
+    - @ref CLExpLayer
+    - CLElementWiseUnaryLayerKernel
+    - CLGEMMReshapeLHSMatrixKernel
+    - CLGEMMReshapeRHSMatrixKernel
+    - CLGEMMMatrixMultiplyReshapedKernel
+    - @ref CLRangeKernel / @ref CLRange
+    - @ref CLUnstack
+    - @ref CLGatherKernel / @ref CLGather
+    - CLGEMMLowpMatrixMultiplyReshapedKernel
+ - New CPP kernels / functions:
+    - @ref CPPDetectionOutputLayer
+    - @ref CPPTopKV / @ref CPPTopKVKernel
+ - Added new examples:
+    - graph_ssd_mobilenet.cpp
+    - graph_mobilenet_v2.cpp
+    - graph_resnet12.cpp
+    - graph_srcnn955.cpp
+    - graph_vgg_vdsr.cpp
+    - graph_inception_resnet_v1.cpp
+ - Add 4D tensors support to
+    - @ref NESoftmaxLayer
+ - Fused activation in @ref CLWinogradConvolutionLayer
+ - Extended @ref NEPermute to support more cases
+ - Added Neon™/SVE GEMM Hybrid kernels
+ - Added u8 and s8 hybrid assembly kernels
+ - Introduced GEMM strategy name in NEGEMMAssemblyWrapper
+ - Improved @ref CLTuner
+ - Fused the bias addition within @ref CLGEMM
+ - Added support for QASYMM8 LOGISTIC activation in @ref NEActivationLayer
+ - Added NHWC data layout support to:
+    - @ref NEScale for F16
+    - @ref CLNormalizationLayer IN_MAP_2D for FP32/FP16
+    - @ref NEL2NormalizeLayer for FP32/FP16
+    - @ref NENormalizationLayer IN_MAP_2D for FP32/FP16
+    - @ref CLROIAlignLayer
+    - @ref CLGenerateProposalsLayer
+ - Added QASYMM8 support to the following kernels:
+    - NEArithmeticAdditionKernel
+    - @ref NEScale
+ - Added new tests and improved validation and benchmarking suites.
+ - Deprecated functions/interfaces
+    - Usage of inner_border_right and inner_border_top has been deprecated in @ref CLDeconvolutionLayer and @ref NEDeconvolutionLayer
+
+v18.11 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEChannelShuffleLayer / @ref NEChannelShuffleLayerKernel
+    - @ref NEReduceMean
+    - @ref NEReorgLayer / @ref NEReorgLayerKernel
+    - @ref NEPriorBoxLayer / @ref NEPriorBoxLayerKernel
+    - NEUpsampleLayer / NEUpsampleLayerKernel
+    - NEYOLOLayer / NEYOLOLayerKernel
+ - New OpenCL kernels / functions:
+    - @ref CLBatchToSpaceLayer / @ref CLBatchToSpaceLayerKernel
+    - @ref CLBoundingBoxTransform / @ref CLBoundingBoxTransformKernel
+    - @ref CLComputeAllAnchorsKernel
+    - @ref CLGenerateProposalsLayer
+    - @ref CLNormalizePlanarYUVLayer / @ref CLNormalizePlanarYUVLayerKernel
+    - @ref CLReorgLayer / @ref CLReorgLayerKernel
+    - @ref CLSpaceToBatchLayer / @ref CLSpaceToBatchLayerKernel
+    - @ref CLPadLayer
+    - @ref CLReduceMean
+    - @ref CLPriorBoxLayer / @ref CLPriorBoxLayerKernel
+    - @ref CLROIAlignLayer / @ref CLROIAlignLayerKernel
+    - @ref CLSlice
+    - @ref CLSplit
+    - @ref CLStridedSlice / @ref CLStridedSliceKernel
+    - CLUpsampleLayer / CLUpsampleLayerKernel
+    - CLYOLOLayer / CLYOLOLayerKernel
+ - New CPP kernels / functions:
+    - @ref CPPBoxWithNonMaximaSuppressionLimit / @ref CPPBoxWithNonMaximaSuppressionLimitKernel
+ - Added the validate method in:
+    - @ref NEDepthConvertLayer
+    - @ref NEFloor / @ref CLFloor
+    - NEGEMMMatrixAdditionKernel
+    - @ref NEReshapeLayer / @ref CLReshapeLayer
+    - @ref CLScale
+ - Added new examples:
+    - graph_shufflenet.cpp
+    - graph_yolov3.cpp
+ - Added documentation for add a new function or kernel.
+ - Improved doxygen documentation adding a list of the existing functions.
+ - Add 4D tensors support to
+    - CLWidthConcatenateLayer
+    - CLFlattenLayer
+    - @ref CLSoftmaxLayer
+ - Add dot product support for CLDepthwiseConvolutionLayer3x3NHWCKernel non-unit stride
+ - Add SVE support
+ - Fused batch normalization into convolution layer weights in @ref CLFuseBatchNormalization
+ - Fuses activation in CLDepthwiseConvolutionLayer3x3NCHWKernel, CLDepthwiseConvolutionLayer3x3NHWCKernel and @ref NEGEMMConvolutionLayer
+ - Added NHWC data layout support to:
+    - @ref CLChannelShuffleLayer
+    - @ref CLDeconvolutionLayer
+    - @ref CLL2NormalizeLayer
+ - Added QASYMM8 support to the following kernels:
+    - CLScaleKernel
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - CLPixelWiseMultiplicationKernel
+ - Added FP16 support to the following kernels:
+    - CLDepthwiseConvolutionLayer3x3NHWCKernel
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - @ref CLNormalizePlanarYUVLayerKernel
+    - @ref CLWinogradConvolutionLayer (5x5 kernel)
+ - More tests added to both validation and benchmarking suites.
+
+v18.08 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Updated recommended NDK version to r17b.
+ - Removed support for QS8/QS16 data types.
+ - Added support for grouped convolution in @ref CLConvolutionLayer.
+ - Added NHWC data layout support to:
+    - NEDepthConcatenateLayer / CLDepthConcatenateLayer
+    - @ref NEWinogradConvolutionLayer / @ref CLWinogradConvolutionLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref CLDirectConvolutionLayer
+    - @ref CLConvolutionLayer
+    - @ref CLScale
+    - CLIm2ColKernel
+ - New Arm® Neon™ kernels / functions:
+    - @ref NERNNLayer
+ - New OpenCL kernels / functions:
+    - @ref CLArithmeticDivision
+ - Introduced prepare() stage support in the graph API for GLES.
+ - Added support for memory reusage when trying to allocate smaller CLTensors.
+ - Enabled NHWC execution on graph examples.
+ - Added JPEG accessor for validation purposes.
+ - Added validate methods to some kernels / functions.
+
+v18.05 Public major release
+ - Various bug fixes.
+ - Various optimisations.
+ - Major redesign in the interface for the Neon™ kernels implemented in assembly.
+ - Removed arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore / arm_compute::NEHGEMMAArch64FP16Kernel
+ - Added NEGEMMAssemblyWrapper and AssemblyKernelGlue which are used to execute assembly kernels in Neon™ functions.
+ - Minor changes to the CPUInfo type to make it compatible with the new assembly gemm interface.
+ - Moved Neon™ assembly kernels to the folder src/core/Neon/kernels/arm_gemm.
+ - Improved doxygen documentation.
+ - Improved memory management for layer's transitions.
+ - Added support for NHWC data layout in tensors.
+ - Added NHWC data layout support to:
+    - @ref NEGEMMConvolutionLayer
+    - @ref NEDirectConvolutionLayer
+    - @ref NEPoolingLayer / @ref CLPoolingLayer
+    - @ref NEBatchNormalizationLayer / @ref CLBatchNormalizationLayer
+    - @ref NEDepthwiseConvolutionLayer
+    - @ref NEScale
+    - NEIm2Col
+ - Added support for dilated convolutions in @ref NEConvolutionLayer and @ref CLConvolutionLayer.
+ - New OpenCL kernels / functions:
+    - @ref CLChannelShuffleLayer / @ref CLChannelShuffleLayerKernel
+    - CLConvertFullyConnectedWeightsKernel / @ref CLConvertFullyConnectedWeights
+    - @ref CLCopy / CLCopyKernel
+    - @ref CLLSTMLayer
+    - @ref CLRNNLayer
+    - CLWidthConcatenateLayer / CLWidthConcatenateLayerKernel
+    - CLWinogradFilterTransformKernel / @ref CLWinogradConvolutionLayer
+    - CLWinogradInputTransformKernel / CLWinogradInputTransform
+ - New Arm® Neon™ kernels / functions:
+    - NEConvertFullyConnectedWeightsKernel / @ref NEConvertFullyConnectedWeights.
+ - Created the validate method in @ref CLDepthwiseConvolutionLayer.
+ - Beta and gamma are no longer mandatory arguments in @ref NEBatchNormalizationLayer and @ref CLBatchNormalizationLayer.
+ - Added depth multiplier support in @ref NEDepthwiseConvolutionLayer and @ref CLDepthwiseConvolutionLayer.
+ - Added broadcast multiply support in @ref NEPixelWiseMultiplication / NEPixelWiseMultiplicationKernel.
+ - Port mobilenet example to NHWC data layout.
+ - Enabled Winograd method in @ref CLConvolutionLayer.
+ - Renamed NEWinogradLayer to @ref NEWinogradConvolutionLayer.
+ - Updated @ref NEWinogradConvolutionLayer to use highly optimised assembly kernels in src/core/Neon/kernels/arm_gemm.
+ - Added memory manager support in GLES functions.
+ - Major refactoring of the graph API.
+ - Added GLES backend in the graph API.
+ - Added support for the memory manager in the graph API.
+ - Enabled Winograd Convolution method in the graph API.
+ - Added support for grouped convolutions in the graph API.
+ - Replaced NEDeconvolutionLayerUpsampleKernel with NEScaleKernel in @ref NEDeconvolutionLayer.
+ - Added fast maths flag in @ref CLConvolutionLayer.
+ - Added new tests and benchmarks in validation and benchmark frameworks
+ - Merge Activation layer with Convolution Layer (Neon™, CL, GLES)
+ - Added support to OpenCL 2.0 SVM
+ - Added support to import memory in OpenCL tensors.
+ - Added the prepare() method to perform any one off pre-processing before running the function.
+ - Added new examples:
+    - graph_inception_v4.cpp
+    - graph_resnext50.cpp
+ - Added memory measurement instrument for CL.
+
+v18.03 Public maintenance release
+ - Various bug fixes.
+ - Fixed bug in @ref NEActivationLayer
+ - Fix in @ref CLTuner when using batches.
+ - Updated recommended NDK version to r16b (And fixed warnings).
+ - Fixed bug in validation code.
+ - Added Inception v4 graph example.
+ - Renamed NEWinogradLayer.cpp to @ref NEWinogradConvolutionLayer
+
+v18.02 Public major release
+ - Various Arm® Neon™ / OpenCL / GLES optimisations.
+ - Various bug fixes.
+ - Changed default number of threads on big LITTLE systems.
+ - Refactored examples and added:
+    - graph_mobilenet_qassym8
+    - graph_resnet
+    - graph_squeezenet_v1_1
+ - Renamed @ref CLConvolutionLayer into @ref CLGEMMConvolutionLayer and created a new @ref CLConvolutionLayer to select the fastest convolution method.
+ - Renamed @ref NEConvolutionLayer into @ref NEGEMMConvolutionLayer and created a new @ref NEConvolutionLayer to select the fastest convolution method.
+ - Added in place support to:
+    - @ref CLActivationLayer
+    - @ref CLBatchNormalizationLayer
+ - Added QASYMM8 support to:
+    - @ref CLActivationLayer
+    - @ref CLDepthwiseConvolutionLayer
+    - @ref NEDepthwiseConvolutionLayer
+    - @ref NESoftmaxLayer
+ - Added FP16 support to:
+    - CLDepthwiseConvolutionLayer3x3
+    - @ref CLDepthwiseConvolutionLayer
+ - Added broadcasting support to NEArithmeticAddition / @ref CLArithmeticAddition / @ref CLPixelWiseMultiplication
+ - Added fused batched normalization and activation to @ref CLBatchNormalizationLayer and @ref NEBatchNormalizationLayer
+ - Added support for non-square pooling to @ref NEPoolingLayer and @ref CLPoolingLayer
+ - New OpenCL kernels / functions:
+    - CLDirectConvolutionLayerOutputStageKernel
+ - New Arm® Neon™ kernels / functions
+    - Added name() method to all kernels.
+    - Added support for Winograd 5x5.
+    - NEPermuteKernel / @ref NEPermute
+    - CpuWinogradConv2dTransformInputKernel / NEWinogradLayer
+    - CpuWinogradConv2dTransformOutputKernel / NEWinogradLayer
+    - CpuWinogradConv2dTransformWeightsKernel / NEWinogradLayer
+    - Renamed NEWinogradLayerKernel into NEWinogradLayerBatchedGEMMKernel
+ - New GLES kernels / functions:
+    - GCTensorShiftKernel / GCTensorShift
+
+v18.01 Public maintenance release
+ - Various bug fixes
+ - Added some of the missing validate() methods
+ - Added @ref CLDeconvolutionLayerUpsampleKernel / @ref CLDeconvolutionLayer @ref CLDeconvolutionLayerUpsample
+ - Added CLPermuteKernel / @ref CLPermute
+ - Added method to clean the programs cache in the CL Kernel library.
+ - Added GCArithmeticAdditionKernel / GCArithmeticAddition
+ - Added GCDepthwiseConvolutionLayer3x3Kernel / GCDepthwiseConvolutionLayer3x3
+ - Added GCNormalizePlanarYUVLayerKernel / GCNormalizePlanarYUVLayer
+ - Added GCScaleKernel / GCScale
+ - Added GCWeightsReshapeKernel / GCConvolutionLayer
+ - Added FP16 support to the following GLES compute kernels:
+    - GCCol2ImKernel
+    - GCGEMMInterleave4x4Kernel
+    - GCGEMMTranspose1xWKernel
+    - GCIm2ColKernel
+ - Refactored Arm® Neon™ Winograd (NEWinogradLayerKernel)
+ - Added NEDirectConvolutionLayerOutputStageKernel
+ - Added QASYMM8 support to the following Arm® Neon™ kernels:
+    - NEDepthwiseConvolutionLayer3x3Kernel
+    - @ref NEFillBorderKernel
+    - NEPoolingLayerKernel
+ - Added new examples:
+    - graph_cl_mobilenet_qasymm8.cpp
+    - graph_inception_v3.cpp
+    - gc_dc.cpp
+ - More tests added to both validation and benchmarking suites.
+
+v17.12 Public major release
+ - Most machine learning functions on OpenCL support the new data type QASYMM8
+ - Introduced logging interface
+ - Introduced opencl timer
+ - Reworked GEMMLowp interface
+ - Added new Arm® Neon™ assembly kernels for GEMMLowp, SGEMM and HGEMM
+ - Added validation method for most Machine Learning kernels / functions
+ - Added new graph examples such as googlenet, mobilenet, squeezenet, vgg16 and vgg19
+ - Added sgemm example for OpenCL
+ - Added absolute difference example for GLES compute
+ - Added new tests and benchmarks in validation and benchmark frameworks
+ - Added new kernels / functions for GLES compute
+
+ - New OpenGL ES kernels / functions
+    - GCAbsoluteDifferenceKernel / GCAbsoluteDifference
+    - GCActivationLayerKernel / GCActivationLayer
+    - GCBatchNormalizationLayerKernel / GCBatchNormalizationLayer
+    - GCCol2ImKernel
+    - GCDepthConcatenateLayerKernel / GCDepthConcatenateLayer
+    - GCDirectConvolutionLayerKernel / GCDirectConvolutionLayer
+    - GCDropoutLayerKernel / GCDropoutLayer
+    - GCFillBorderKernel / GCFillBorder
+    - GCGEMMInterleave4x4Kernel / GCGEMMInterleave4x4
+    - GCGEMMMatrixAccumulateBiasesKernel / GCGEMMMatrixAdditionKernel / GCGEMMMatrixMultiplyKernel / GCGEMM
+    - GCGEMMTranspose1xWKernel / GCGEMMTranspose1xW
+    - GCIm2ColKernel
+    - GCNormalizationLayerKernel / GCNormalizationLayer
+    - GCPixelWiseMultiplicationKernel / GCPixelWiseMultiplication
+    - GCPoolingLayerKernel / GCPoolingLayer
+    - GCLogits1DMaxKernel / GCLogits1DShiftExpSumKernel / GCLogits1DNormKernel / GCSoftmaxLayer
+    - GCTransposeKernel / GCTranspose
+
+ - New Arm® Neon™ kernels / functions
+    - arm_compute::NEGEMMLowpAArch64A53Kernel / arm_compute::NEGEMMLowpAArch64Kernel / arm_compute::NEGEMMLowpAArch64V8P4Kernel / arm_compute::NEGEMMInterleavedBlockedKernel / arm_compute::NEGEMMLowpAssemblyMatrixMultiplyCore
+    - arm_compute::NEHGEMMAArch64FP16Kernel
+    - NEDepthwiseConvolutionLayer3x3Kernel / NEDepthwiseIm2ColKernel / NEGEMMMatrixVectorMultiplyKernel / NEDepthwiseVectorToTensorKernel / @ref NEDepthwiseConvolutionLayer
+    - NEGEMMLowpOffsetContributionKernel / NEGEMMLowpMatrixAReductionKernel / NEGEMMLowpMatrixBReductionKernel / NEGEMMLowpMatrixMultiplyCore
+    - NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / NEGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
+    - NEWinogradLayer / NEWinogradLayerKernel
+
+ - New OpenCL kernels / functions
+    - CLGEMMLowpOffsetContributionKernel / CLGEMMLowpMatrixAReductionKernel / CLGEMMLowpMatrixBReductionKernel / CLGEMMLowpMatrixMultiplyCore
+    - CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPointKernel / CLGEMMLowpQuantizeDownInt32ToUint8ScaleByFixedPoint
+
+ - New graph nodes for Arm® Neon™ and OpenCL
+    - graph::BranchLayer
+    - graph::DepthConvertLayer
+    - graph::DepthwiseConvolutionLayer
+    - graph::DequantizationLayer
+    - graph::FlattenLayer
+    - graph::QuantizationLayer
+    - graph::ReshapeLayer
+
+v17.10 Public maintenance release
+ - Bug fixes:
+    - Check the maximum local workgroup size supported by OpenCL devices
+    - Minor documentation updates (Fixed instructions to build the examples)
+    - Introduced a graph::GraphContext
+    - Added a few new Graph nodes, support for branches and grouping.
+    - Automatically enable cl_printf in debug builds
+    - Fixed bare metal builds for armv7a
+    - Added AlexNet and cartoon effect examples
+    - Fixed library builds: libraries are no longer built as supersets of each other.(It means application using the Runtime part of the library now need to link against both libarm_compute_core and libarm_compute)
+
+v17.09 Public major release
+ - Experimental Graph support: initial implementation of a simple stream API to easily chain machine learning layers.
+ - Memory Manager (@ref BlobLifetimeManager, @ref BlobMemoryPool, @ref ILifetimeManager, @ref IMemoryGroup, @ref IMemoryManager, @ref IMemoryPool, @ref IPoolManager, @ref MemoryManagerOnDemand, @ref PoolManager)
+ - New validation and benchmark frameworks (Boost and Google frameworks replaced by homemade framework).
+ - Most machine learning functions support both fixed point 8 and 16 bit (QS8, QS16) for both Arm® Neon™ and OpenCL.
+ - New Arm® Neon™ kernels / functions:
+    - arm_compute::NEGEMMAssemblyBaseKernel arm_compute::NEGEMMAArch64Kernel
+    - NEDequantizationLayerKernel / @ref NEDequantizationLayer
+    - NEFloorKernel / @ref NEFloor
+    - @ref NEL2NormalizeLayerKernel / @ref NEL2NormalizeLayer
+    - NEQuantizationLayerKernel NEMinMaxLayerKernel / @ref NEQuantizationLayer
+    - @ref NEROIPoolingLayerKernel / @ref NEROIPoolingLayer
+    - @ref NEReductionOperationKernel / @ref NEReductionOperation
+    - NEReshapeLayerKernel / @ref NEReshapeLayer
+
+ - New OpenCL kernels / functions:
+    - CLDepthwiseConvolutionLayer3x3NCHWKernel CLDepthwiseConvolutionLayer3x3NHWCKernel CLDepthwiseIm2ColKernel CLDepthwiseVectorToTensorKernel CLDepthwiseWeightsReshapeKernel / CLDepthwiseConvolutionLayer3x3 @ref CLDepthwiseConvolutionLayer CLDepthwiseSeparableConvolutionLayer
+    - CLDequantizationLayerKernel / CLDequantizationLayer
+    - CLDirectConvolutionLayerKernel / @ref CLDirectConvolutionLayer
+    - CLFlattenLayer
+    - CLFloorKernel / @ref CLFloor
+    - CLGEMMTranspose1xW
+    - CLGEMMMatrixVectorMultiplyKernel
+    - @ref CLL2NormalizeLayerKernel / @ref CLL2NormalizeLayer
+    - CLQuantizationLayerKernel CLMinMaxLayerKernel / @ref CLQuantizationLayer
+    - @ref CLROIPoolingLayerKernel / @ref CLROIPoolingLayer
+    - @ref CLReductionOperationKernel / @ref CLReductionOperation
+    - CLReshapeLayerKernel / @ref CLReshapeLayer
+
+v17.06 Public major release
+ - Various bug fixes
+ - Added support for fixed point 8 bit (QS8) to the various Arm® Neon™ machine learning kernels.
+ - Added unit tests and benchmarks (AlexNet, LeNet)
+ - Added support for sub tensors.
+ - Added infrastructure to provide GPU specific optimisation for some OpenCL kernels.
+ - Added @ref OMPScheduler (OpenMP) scheduler for Neon
+ - Added @ref SingleThreadScheduler scheduler for Arm® Neon™ (For bare metal)
+ - User can specify their own scheduler by implementing the @ref IScheduler interface.
+ - New OpenCL kernels / functions:
+    - @ref CLBatchNormalizationLayerKernel / @ref CLBatchNormalizationLayer
+    - CLDepthConcatenateLayerKernel / CLDepthConcatenateLayer
+    - CLHOGOrientationBinningKernel CLHOGBlockNormalizationKernel, CLHOGDetectorKernel / CLHOGDescriptor CLHOGDetector CLHOGGradient CLHOGMultiDetection
+    - CLLocallyConnectedMatrixMultiplyKernel / CLLocallyConnectedLayer
+    - CLWeightsReshapeKernel / CLConvolutionLayerReshapeWeights
+ - New C++ kernels:
+    - CPPDetectionWindowNonMaximaSuppressionKernel
+ - New Arm® Neon™ kernels / functions:
+    - @ref NEBatchNormalizationLayerKernel / @ref NEBatchNormalizationLayer
+    - NEDepthConcatenateLayerKernel / NEDepthConcatenateLayer
+    - NEDirectConvolutionLayerKernel / @ref NEDirectConvolutionLayer
+    - NELocallyConnectedMatrixMultiplyKernel / NELocallyConnectedLayer
+    - NEWeightsReshapeKernel / NEConvolutionLayerReshapeWeights
+
+v17.05 Public bug fixes release
+ - Various bug fixes
+ - Remaining of the functions ported to use accurate padding.
+ - Library does not link against OpenCL anymore (It uses dlopen / dlsym at runtime instead to determine whether or not OpenCL is available).
+ - Added "free" method to allocator.
+ - Minimum version of g++ required for armv7 Linux changed from 4.8 to 4.9
+
+v17.04 Public bug fixes release
+
+ The following functions have been ported to use the new accurate padding:
+ -  CLColorConvertKernel
+ -  CLEdgeNonMaxSuppressionKernel
+ -  CLEdgeTraceKernel
+ -  CLGaussianPyramidHorKernel
+ -  CLGaussianPyramidVertKernel
+ -  CLGradientKernel
+ -  NEChannelCombineKernel
+ -  NEFillArrayKernel
+ -  NEGaussianPyramidHorKernel
+ -  NEGaussianPyramidVertKernel
+ -  NEHarrisScoreFP16Kernel
+ -  NEHarrisScoreKernel
+ -  NEHOGDetectorKernel
+ -  NELogits1DMaxKernel
+ -  NELogits1DShiftExpSumKernel
+ -  NELogits1DNormKernel
+ -  NENonMaximaSuppression3x3FP16Kernel
+ -  NENonMaximaSuppression3x3Kernel
+
+v17.03.1 First Major public release of the sources
+ - Renamed the library to arm_compute
+ - New CPP target introduced for C++ kernels shared between Arm® Neon™ and CL functions.
+ - New padding calculation interface introduced and ported most kernels / functions to use it.
+ - New OpenCL kernels / functions:
+   - CLGEMMLowpMatrixMultiplyKernel / CLGEMMLowp
+ - New Arm® Neon™ kernels / functions:
+   - @ref NENormalizationLayerKernel / @ref NENormalizationLayer
+   - NETransposeKernel / @ref NETranspose
+   - NELogits1DMaxKernel, NELogits1DShiftExpSumKernel, NELogits1DNormKernel / @ref NESoftmaxLayer
+   - NEIm2ColKernel, NECol2ImKernel, NEConvolutionLayerWeightsReshapeKernel / @ref NEConvolutionLayer
+   - NEGEMMMatrixAccumulateBiasesKernel / @ref NEFullyConnectedLayer
+   - NEGEMMLowpMatrixMultiplyKernel / NEGEMMLowp
+
+v17.03 Sources preview
+ - New OpenCL kernels / functions:
+   - CLGradientKernel, CLEdgeNonMaxSuppressionKernel, CLEdgeTraceKernel / CLCannyEdge
+   - GEMM refactoring + FP16 support: CLGEMMInterleave4x4Kernel, CLGEMMTranspose1xWKernel, CLGEMMMatrixMultiplyKernel, CLGEMMMatrixAdditionKernel / @ref CLGEMM
+   - CLGEMMMatrixAccumulateBiasesKernel / @ref CLFullyConnectedLayer
+   - CLTransposeKernel / @ref CLTranspose
+   - CLLKTrackerInitKernel, CLLKTrackerStage0Kernel, CLLKTrackerStage1Kernel, CLLKTrackerFinalizeKernel / CLOpticalFlow
+   - @ref CLNormalizationLayerKernel / @ref CLNormalizationLayer
+   - CLLaplacianPyramid, CLLaplacianReconstruct
+ - New Arm® Neon™ kernels / functions:
+   - NEActivationLayerKernel / @ref NEActivationLayer
+   - GEMM refactoring + FP16 support (Requires armv8.2 CPU): NEGEMMInterleave4x4Kernel, NEGEMMTranspose1xWKernel, NEGEMMMatrixMultiplyKernel, NEGEMMMatrixAdditionKernel / @ref NEGEMM
+   - NEPoolingLayerKernel / @ref NEPoolingLayer
+
+v17.02.1 Sources preview
+ - New OpenCL kernels / functions:
+   - CLLogits1DMaxKernel, CLLogits1DShiftExpSumKernel, CLLogits1DNormKernel / @ref CLSoftmaxLayer
+   - CLPoolingLayerKernel / @ref CLPoolingLayer
+   - CLIm2ColKernel, CLCol2ImKernel, CLConvolutionLayerWeightsReshapeKernel / CLConvolutionLayer
+   - CLRemapKernel / CLRemap
+   - CLGaussianPyramidHorKernel, CLGaussianPyramidVertKernel / CLGaussianPyramid, CLGaussianPyramidHalf, CLGaussianPyramidOrb
+   - CLMinMaxKernel, CLMinMaxLocationKernel / CLMinMaxLocation
+   - CLNonLinearFilterKernel / CLNonLinearFilter
+ - New Arm® Neon™ FP16 kernels (Requires armv8.2 CPU)
+   - NEAccumulateWeightedFP16Kernel
+   - NEBox3x3FP16Kernel
+   - NENonMaximaSuppression3x3FP16Kernel
+
+v17.02 Sources preview
+ - New OpenCL kernels / functions:
+   - CLActivationLayerKernel / @ref CLActivationLayer
+   - CLChannelCombineKernel / CLChannelCombine
+   - CLDerivativeKernel / CLChannelExtract
+   - CLFastCornersKernel / CLFastCorners
+   - CLMeanStdDevKernel / CLMeanStdDev
+ - New Arm® Neon™ kernels / functions:
+   - HOG / SVM: NEHOGOrientationBinningKernel, NEHOGBlockNormalizationKernel, NEHOGDetectorKernel, NEHOGNonMaximaSuppressionKernel / NEHOGDescriptor, NEHOGDetector, NEHOGGradient, NEHOGMultiDetection
+   - NENonLinearFilterKernel / NENonLinearFilter
+ - Introduced a CLScheduler to manage the default context and command queue used by the runtime library and create synchronisation events.
+ - Switched all the kernels / functions to use tensors instead of images.
+ - Updated documentation to include instructions to build the library from sources.
+
+v16.12 Binary preview release
+ - Original release
+
+ */
+} // namespace arm_compute
diff --git a/docs/02_tests.dox b/docs/user_guide/tests.dox
index c46e1f5663..510a1967ae 100644
--- a/docs/02_tests.dox
+++ b/docs/user_guide/tests.dox
@@ -1,5 +1,5 @@
 ///
-/// Copyright (c) 2017-2020 Arm Limited.
+/// Copyright (c) 2017-2021 Arm Limited.
 ///
 /// SPDX-License-Identifier: MIT
 ///
@@ -26,7 +26,7 @@ namespace arm_compute
 namespace test
 {
 /**
-@page tests Validation and benchmarks tests
+@page tests Validation and Benchmarks
 
 @tableofcontents
 
@@ -353,7 +353,7 @@ You can use the `--instruments` option to select one or more instruments to meas
 
 `PMU` will try to read the CPU PMU events from the kernel (They need to be enabled on your platform)
 
-`MALI` will try to collect Mali hardware performance counters. (You need to have a recent enough Mali driver)
+`MALI` will try to collect Arm® Mali™ hardware performance counters. (You need to have a recent enough Arm® Mali™ driver)
 
 `WALL_CLOCK_TIMER` will measure time using `gettimeofday`: this should work on all platforms.
 
@@ -371,7 +371,7 @@ To run the OpenCL precommit validation tests:
 
 	LD_LIBRARY_PATH=. ./arm_compute_validation --mode=precommit --filter="^CL.*"
 
-To run the NEON precommit benchmark tests with PMU and Wall Clock timer in miliseconds instruments enabled:
+To run the Arm® Neon™ precommit benchmark tests with PMU and Wall Clock timer in miliseconds instruments enabled:
 
 	LD_LIBRARY_PATH=. ./arm_compute_benchmark --mode=precommit --filter="^NEON.*" --instruments="pmu,wall_clock_timer_ms" --iterations=10