aboutsummaryrefslogtreecommitdiff
path: root/src/core
AgeCommit message (Collapse)Author
3 daysarm_gemm: fix SVE check on fast mode kernels.HEADrelease_candidatemainDavid Mansell
SVE BF16 kernels need to check for svebf16(), not just bf16(). Change-Id: I89494aac40166eba59719bed9822194a48ac282d Signed-off-by: David Mansell <David.Mansell@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11520 Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
3 daysChange reorder implementation to be vector length agnostic for OHWIo8 reorderRadu Salavat
As the reorder kernel is called with WeightFormat OHWIo8 for hardware that does not support it e.g. vector length 128, adapt the test case and add kernel implementation for this edge case. This fixes the mismatching values that appear when OHWIo8 fixture was run with 128 vector length. Resolves: ONCPUML-1523, COMPMID-6281 Signed-off-by: Radu Salavat <radu.salavat@arm.com> Change-Id: Iaa1a3b486d1725a2d6031051aa544082c1bbe913 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11421 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
5 daysNew SME2 heuristics.David Mansell
Change-Id: I69aa973e61df950060807a31230a1edd91add498 Signed-off-by: David Mansell <David.Mansell@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11514 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
6 daysAdd fp16 and integer data type support for ScatterNd in GpuGunes Bayir
Resolves: COMPMID-6899 Change-Id: I3743f2c9e5c21e1ec9f4c81d08c148666afad33a Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11505 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Sang Won Ha <sangwon.ha@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
7 daysDisable SME2 Gemmlowp s8f32 kernel selection in case results needs to be ↵Gunes Bayir
accumulated Similar to https://review.mlplatform.org/c/ml/ComputeLibrary/+/11500, s8f32 kernels do not support accumulate mode. This patch modifies the kernel selection and also adds more tests to stress these test cases better. Partially Resolves: COMPMID-6995 Change-Id: I40e19446c012eb7334e4511e254cce0d635aa234 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11503 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Radu Salavat <radu.salavat@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
10 daysDisable SME2 Gemm kernel selection in case results needs to be accumulatedGunes Bayir
SME2 kernels use a different accumulation buffer and destination tensor is not copied to this buffer as initial value, thus causing mismatches. This patch modifies the kernel selection algorithm such that it does not select SME2 kernels if accumulation is required. Resolves: COMPMID-6995 Change-Id: I82da3cba41729f938a046f26b41b63ff5716c02d Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11500 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
10 daysAdd update/index/output (m+1)/2d/(m+n) support for CLScatterGunes Bayir
Resolves: COMPMID-6894, COMPMID-6896 Change-Id: I9d29fd3701a7e0f28d83f81a6c42a7234c2587c3 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11477 Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Ramy Elgammal <ramy.elgammal@arm.com> Dynamic-Fusion: Ramy Elgammal <ramy.elgammal@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
11 daysAdd padding to the shift and multipliers buffersPablo Marquez Tello
* All per-channel requantizing hybrid assembly kernels require these buffers to be padded. * Resolves MLCE-1255 Change-Id: I892b8ee9b31e079189ec72f3fc6da4ce5efda974 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11491 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
14 daysScatter GPU Kernel Implementation for 1D tensors.Mohammed Suhail Munshi
Resolves: [COMPMID-6891, COMPMID-6892] Change-Id: I5b094fff1bff4c4c59cc44f7d6beab0e40133d8e Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11394 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-04-16fix compilation errors on linux with gcc12Sunita Nadampalli
Signed-off-by: Sunita Nadampalli <nadampal@amazon.com> Change-Id: I21eca31d97d6e2ca8279adb9db65f11540e72689 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11396 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
2024-04-15Add s8f32 kernels and dynamic QuantizationInfoJonathan Deakin
- Add support for QASYMM_SIGNED*QASYMM8_SIGNED->F32 in CpuGemmLowpMatrixMultiplyCore - Add s8f32 kernel using existing s8->s32 kernels with a new DequantizeFloat OutputStage, the structure is similar to Requantize32 but the opposite way around. - Add SME s8f32 kernels with integrated support for DequantizeFloat. - Add scale to CpuGemmLowpOffsetContributionKernel. - Add virtual dequantize scale to gemm_common, only implemented for gemm_interleaved. - Update year to 2024 in generate_build_files. - Add dynamic flag to QuantizationInfo which signals to operators that it can change after configuration - Add support for dynamic quantization in NEGEMMLowpMatrixMultiplyCore - Add dynamic quantization fixture by extending GEMMLowpGenericMatrixMultiplyCoreValidationFixture - Add GEMMLowpDequantizedMatrixMultiplyValidationFixture - Store k (number of cols of A) rather than k_offset in the offset contribution kernels so that we can recompute it when the other offsets change relates to: ONCPUML-1444 MLINFSW-439 Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com> Co-authored-by: David Mansell <David.Mansell@arm.com> Change-Id: I58a3acf2c09289a303e52eea6b336a696a5bc8da Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11022 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-04-11Add SME2 implementation of softmax for FP16Gunes Bayir
In addition to the softmax kernel, this patch fixes minor issues in the fp32 implementation. Resolves: COMPMID-6920 Change-Id: Ibbd9f0af5f2a93fba0e92d72ba437279c34149d3 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11402 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2024-04-11Add in place summation to CPU GEMM kernelsRadu Salavat
Instead of dispatching the sum postop for GEMM kernels to a separate kernel + add, that requires an extra destination sized allocation, plus 3 extra load/stores per element, just do it in the GEMM kernel. Resolves: ONCPUML-1442 Signed-off-by: Radu Salavat <radu.salavat@arm.com> Co-authored-by: Milos Puzovic <milos.puzovic@arm.com> Change-Id: I7a1f2da3300875fa1ac88b705a34390969518077 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11298 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-04-05Fix compiler errorPablo Marquez Tello
* This fixes the GCC 12 compiler error: Assuming signed overflow does not occur when simplifying conditional to constant [-Werror=strict-overflow] * Resolves ARMCL-1130 Change-Id: I01e10ebca2dbfcd166c1f4128921953e31016038 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11381 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-03-27Added new NEON fixed format fast math mode hybrid kernel with maximum height ↵Milos Puzovic
of 6 for accumulation and updated heuristics Change-Id: Ib52ea6825e164f4a8b8422eab7991b50af0b0d7c Signed-off-by: Milos Puzovic <milos.puzovic@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11354 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-03-21[ONCPUML-1451] Add matmul kernel to enable bf16 to bf16 operations via ↵Renato Arantes
PyTorch® autocast() function The full range of tests must be added with [MLINFSW-482] epic due to the lack of reordering kernels implemented in Acl. Co-Authored-By: David Mansell <David.Mansell@arm.com> Change-Id: I820d316295a1ec94fdc89c37e4144a268f914c36 Signed-off-by: Renato Arantes <renato.arantes@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11169 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-03-20Make Cpu/Gpu/Ref scalar/vectoral S32 division consistentGunes Bayir
- Neon(TM) implementation converts integers to float and performs the division because there is no vector integer division instructions. However, leftover loop still uses integer division, which makes results inconsistent depending on where we are in the tensor. - SVE path does it in integer domain. - OpenCL(TM) does it similar to Neon(TM) vector path. - Reference implementation does it in integer domain. These differences cause intermittent mismatches. This patch ensures all follow the same logic. On the other hand, the provided Neon(TM) implementation is faster than the Fp32 converted version. Resolves: COMPMID-6925 Change-Id: Ia12606d57f40a7d331b9b698f87fd4321496b275 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11316 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-03-18Fix quant. gemv kernel driver by adding set_quantized_bias()Gunes Bayir
arm_gemm fuses the actual bias addition with the output stage in quantized gemm. The output stage, in its very basic form, is: A_offset * B_offset - sum(A_row_i) * B_offset - sum(B_col_j) * A_offset Matrix B is usually constant (e.g. weight matrix in convolutions). Therefore, except the middle term above, the expression is constant across the same output row because the column sums of matrix B are pre-calculated. The bias is also usually constant. When it is, it makes sense to add the bias vector to the above sum and just perform a single addition on top of the output tensor. For this to happen, the column sum computation of B tensor must account for the bias. This is ensured by set_quantized_bias() method in the interface. This function passes the bias pointer and strides to arm_gemm. Gemv_pretransposed does not implement set_quantized_bias() and uses the parent function, which does nothing. Therefore, the bias is not added to the output. This causes tests to fail. Resolves: COMPMID-6928 Change-Id: Iba24fabc65fdc47edb12db6abff2fb47784c0743 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11310 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
2024-03-14arm_gemm: Fix bias handling for sme2 FP16 GEMV.David Mansell
Resolves: COMPMID-6927 Signed-off-by: David Mansell <David.Mansell@arm.com> Change-Id: Ib426fdc11ddbdbd0028d64547f3eaf312ca5fcce Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11301 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
2024-03-12Fix WoA nightly failurePablo Marquez Tello
* Resolves COMPMID-6931 Change-Id: I3ed0c509807e26bddfcd20be71b12ec4cbb5cce6 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11277 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2024-02-22Fix segfault in DWC in WoAPablo Marquez Tello
* Resolves MLCE-1219 Change-Id: If997180ec88c35d6af05a06c8c5ef95681e67c05 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11182 Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2024-02-22Fix OpenBSD® build failure caused by patch 11144Gunes Bayir
include of alloca.h should be guarded against _WIN64 and __OpenBSD__ Partially Resolves: COMPMID-6595 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Change-Id: I6a52ec129d92e290d033f75baeb4a598669daae0 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11180 Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2024-02-21Integrate new pretranspose_b_array with extra fused transpose of BGunes Bayir
This patch fuses the transposition taking place in Acl with the transformations done in arm_gemm (called pretranspose_b_array) if the underlying kernel and transform supports it. This should improve start-up time (as it's for constant Rhs matrices) and memory footprint. The transformations in arm_gemm are kernel specific. The Rhs matrix is transformed into certain layouts to improve the performance. Resolves: COMPMID-6595 Change-Id: Id2932dd966e59f903c279417bebcea83d9a42464 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11144 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-02-14Fix compiler errors in cl-clangPablo Marquez Tello
* cl-clang is used to build ACL natively in WoA * Resolves MLCE-1209 Change-Id: I040e84f526f16324138a074badf764ac099090e3 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11126 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-02-07Parallelize CPU depthwise over batch if only 1 rowJonathan Deakin
This patch also fixes a bug where the split dimension was wrong in CpuDepthwiseConv2dAssemblyDispatch::run. It was set to DimY, which is cols, but it should have been DimZ. This was rarely an issue in practice because typically the number of cols are greater than the number of threads anyway. Relates to: ONCPUML-1443 Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com> Change-Id: Ifed2fce22ddeb7cd77e6a6ae1083694427f91e04 Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11083 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2024-02-06arm_gemm: SME: Remove artificial single-thread constraint on quantized int8 ↵David Mansell
kernels. Change-Id: I81b71ecc0d2e776d132091e074798a79b3141bec Signed-off-by: David Mansell <David.Mansell@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11085 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2024-01-25arm_gemm: convolution: optimize convolver.hpp.David Mansell
The code in convolver.hpp generates pointers into either the appropriate point in the input activation tensor or the padding buffer for each kernel point of each output point of the convolution. This is done at runtime interspersed with the data transform and matrix multiplication steps. As such, it can have a significant impact on performance, particularly for low input channel counts. This change improves the performance of this code by streamlining the checks for out of range input points (which must be directed to the padding buffer). The previous implementation checked all four borders for every point. The revised code does the checks one at a time, and for any failing check applies the result to as many output points as possible without repeating the other checks. Signed-off-by: David Mansell <David.Mansell@arm.com> Change-Id: I36a4fa114b425c1bcba2be40acf36718522519f5 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11004 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
2024-01-23Fix for unchecked return value detected in Coverity checks.Anitha Raj
Resolves: COMPMID-6753 Signed-off-by: Anitha Raj <anitha.raj@arm.com> Change-Id: I80df0479eb4c7cc2c5380df708844cc9ffdd2aed Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11001 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2024-01-18Fix divide-by-zero compilation errorViet-Hoa Do
* CONVERT_TO_TENSOR4D_STRUCT_NO_STEP is implemented and used in some CL kernels in the way that causes divide-by-zero issue. - Since the steps are all zeros, the issue might have been ignored by the compiler. Resolves: COMPMID-6795 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I0fb38fc62d63671b8abefa39b3d9b3ca6f49c7fe Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10967 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-01-17Fix minor issue, clean lut codeMohammed Suhail Munshi
Resolves: [COMPMID-6799] Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Change-Id: I47baeeea75f1d03609d1fa1e9a10d2f53d5694f7 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10969 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2024-01-12Fix potential threading issue in LUTManagerMohammed Suhail Munshi
- Locks pointer before checking for validity to prevent race condition Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Change-Id: I6872b10d058ee7f3707ba641f44bb6116e26880a Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10960 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-01-12[ONCPUML-1387] Add ACL based reorder for f32 to bf16 data type conversion.Renato Arantes
The reorders supported at the moment are: ab->BA4b4a ab->BA8b4a Co-Authored-By: David Mansell <David.Mansell@arm.com> Change-Id: Ic466465629ce3bcdcee0089e251485b79b60e1f3 Signed-off-by: Renato Arantes <renato.arantes@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10775 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2024-01-10Use look up table for fp16 activationMohammed Suhail Munshi
- Enables FP16 lut for logistic activation - Adds LUTManager to re-use lut where appropriate. Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Change-Id: I94667b63b452a8e58a1eb59cb0b5866178954523 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10864 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-12-22Fix nightly issue caused by gemm_reshaped_only_rhs_mmul kernelGunes Bayir
The issue appears when this kernel is used by convolution operators because the stride calculations consider only simple matrix multiplication. In conv2d triggered runs, Rhs does not have the same dimension as Lhs and Dst. Also, cases where Lhs and Dst are interpreted as 3d, where their X and Y dimensions (in convolution sense) are collapsed into one. Resolves: COMPMID-6764 Change-Id: If443e6eb8f7a5cca1acc58b37c598122a013e69b Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10913 Benchmark: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2023-12-22Add Mali™-G720 and Mali™-G620 as GpuTargetsGunes Bayir
This patch adds adds the latest Gpus as Gpu Target and sets up kernel selection heuristics for MatMul to address some nightly issues. Resolves: COMPMID-6766 Change-Id: I29dbb08c5ecfb3fcd63230b0b1675ab557074aca Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10902 Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-12-14Fix validation error in CL generate proposals kernelGunes Bayir
This fix modifies some of the conversions done in the generate proposals kernel that causes DDK issues while compiling the kernel. The issues are mostly related to conversion from i64 to fp16, and it doesn't affect fp32. Firstly, type identifier size_t is converted into unsigned int. But, this alone was compiling but causing mismatches, even in older devices, where it was passing before. Therefore, the fp16 conversion delayed until vector construction where the integers are now converted to fp32, and then fp16. This, although may not be ideal, seems like the best solution. Resolves: COMPMID-6756 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Change-Id: Iee61216c908fe51431985b80c3653fc32add4741 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10879 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2023-12-08Fix validation error in graph_ssd_mobilenetGunes Bayir
The graph example has fixed quantization information given for certain layers, and some of the offsets exceed the 8-bit range for Int8 data type. This shouldn't have been the case and the offsets should respect the 8-bit quantization specification laid out here: https://www.tensorflow.org/lite/performance/quantization_spec However, the mechanism added in the helper function introduces robustness in case of such irregularities with little/no cost; and therefore added as a fix. Resolves: COMPMID-6748 Change-Id: If39bf323382f109fa100ee2b87ce63cc7bc89759 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10858 Reviewed-by: SiCong Li <sicong.li@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2023-12-08Fix unit tests failing in CL/UNIT/TensorAllocatorGunes Bayir
The function pointer for clImportMemoryARM should be loaded in a portable way as recommended by Khronos® as outlined here: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#getting-opencl-api-extension-function-pointers using clGetExtensionFunctionAddressForPlatform() call. All extensions should ideally be loaded using the above mentioned function. Resolves: COMPMID-6732 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Change-Id: I482b6bde721267d5e8c08301e5780d28a9c5ba85 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10852 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2023-12-07Optimize CPU depth-to-spaceViet-Hoa Do
Resolves: COMPMID-6622 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: Ibac276618bdda125dcbb9c851c547f12739b15b4 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10749 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-12-05Optimize CpuSoftmaxKernel for axis=0Gunes Bayir
Implement a single kernel instead of having two consecutive ones. In the previous setup, one kernel was calculating the maximum value in the axis, and this maximum was being subtracted from each data while calculating the softmax, i.e. softmax(x_i) = exp(x_i - max) / sum_i( exp(x_i - max) ) This patch integrates these two stages into a single kernel for Neon™ for all data types. This will save some memory because we don't need to hold the max values in a separate auxiliary tensor. It also introduces some other optimizations that will ease memory pressure when the data type is float/half, by using the dst tensor as temporary storage for already exponentiated inputs. It removes the references to SVE and SVE2 implementations, and most of the associated files; but, it leaves the implementations as these may be used in the future. Resolves: COMPMID-6500 Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Change-Id: Icff9976d1214c4c6cbe15a62ca60b8a77d3784cc Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10688 Reviewed-by: SiCong Li <sicong.li@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-11-28Changes to enable FP16 in armv8a multi_isaPablo Marquez Tello
* This is the initial patch to start working on enabling fp16 in all multi_isa builds. More changes are required in the way we register the kernels using the macro REGISTER_FP16_NEON. * In this patch we add the capability to build the fp16 files in listed in filelist.json with the correct arch option to enable FP16 * This patch is required towards building an universal multi_isa binary where fp16 is enable. * Enable REGISTER_FP16_NEON macro for all builds by removing __ARM_FEATURE_FP16_VECTOR_ARITHMETIC guard from the macro definition. The macro has to be used across all types of builds. Change-Id: I99f4c273f6ee04cad3c097e5e374200f48568fa9 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10682 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-11-27BatchNorm changes to enable fp16 in armv8a multi_isa buildsPablo Marquez Tello
* Moved NCHW kernels fp16 and fp32 to their corresponding files src/cpu/kernels/fuse_batch_normalization/nchw/neon/fp16.cpp and src/cpu/kernels/fuse_batch_normalization/nchw/neon/fp32.cpp * Changes in filelist.json to include the new fp16 and fp32 files * Moved the template batch_normalization_nchw to impl.h as we need to instantiate it from fp16.cpp and fp32.cpp * Pooling layer: removed the guard __ARM_FEATURE_FP16_VECTOR_ARITHMETIC that prevented the FP16 kernel execution. * Partially resolves MLCE-1102 Change-Id: Ia8c85e9ffb76c9e387f9ae2685e5df5e52c8dc27 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10777 Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-11-16NormalizationLayer changes to enable fp16 in armv8a multi_isa buildsPablo Marquez Tello
* Moved the template arm_compute::normalize_float to impl.h because we need to instantiate it from both NENormalizationLayerKernel.cpp and src/cpu/kernels/norm_layer/generic/neon/fp16.cpp * Changes in filelist.json: added a new fp16.cpp file for the float16_t kernels * Replaced the guard __ARM_FEATURE_FP16_VECTOR_ARITHMETIC in NENormalizationLayerKernel by ARM_COMPUTE_ENABLE_FP16 so that the fp16 kernels can be compiled in for multi_isa builds * Moved fp32 kernels to the corresponding file src/cpu/kernels/norm_layer/generic/neon/fp32.cpp * Partially resolves MLCE-1102 Change-Id: I3f2eb2ed0b6c7f68092b17872b85082fbb5f39e2 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10739 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-11-15Fix various coverity issuesSiCong Li
Resolves COMPMID-6677 Signed-off-by: SiCong Li <sicong.li@arm.com> Change-Id: I99bf2385f6edc0836faacb31f5c66ed4fb051e40 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10729 Benchmark: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2023-11-15Fix device issue with CL softmaxViet-Hoa Do
* Performing the second pass in reverse order doesn't seem to work reliably in some specific devices. This patch introduces another approach to workaround the device issue. Resolves: COMPMID-6669 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I591f05ff06f8439ebe4d32093441ae871a292f4c Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10730 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: SiCong Li <sicong.li@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2023-11-09Remove duplicate definitions of BF16 fixed format kernels.David Mansell
Change-Id: Ie68b0a19040cc6b5bf47fca406989f39aa8d7b81 Signed-off-by: David Mansell <David.Mansell@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10687 Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2023-11-08Optimize CpuGemmConv2d start-up timeSiCong Li
When weight has no holes, we can replace CpuWeightsReshapeKernel with: - Collapse by reinterpreting weight's 3 spatial dimensions - Perform CpuTranspose For more details see the documentation in src/cpu/operators/CpuGemmConv2d.cpp This is one optimization since the CpuTranspose is better performing than CpuWeightsReshapeKernel A second optimization is to fuse this transpose with other weight transformations (e.g. pretranspose_B_array in CpuGemmAssemblyDispatch) However this second optimization depends on how the underlying gemm methods (the fall back path: CpuGemmMatrixMultiplyKernel or the assembly path: CpuGemmAssemblyDispatch) chooses to fuse the transpose. Therefore, this patch moves the transpose down from CpuGemmConv2d, to the individual gemm operators where the fusion decision needs to be made, by passing an extra "transpose_b" flag to CpuGemm New transpose_b flag in different scopes (they are all the same, but with different names because pretranspose_b has a different meaning in GemmAssemblyDispatch): GEMMInfo::pretranspose_B -> AsmGemmInfo::transpose_b New auxilliary tensors holding the transposed b result: - CpuGemm optimized path: CpuGemmAssemblyDispatch::PrePretransposedB - CpuGemm fallback path: CpuGemm::PreTransposedRHS Note that this patch does not yet have the second optimization (COMPMID-6595), but it prepares for it. Relates to COMPMID-6595 Resolves COMPMID-6499 Change-Id: I999a2da9da4b2b15369a3cc06d7872c86e0190ea Signed-off-by: SiCong Li <sicong.li@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10526 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Anitha Raj <Anitha.Raj@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-11-01Fix compilation error with clang and multi-isaViet-Hoa Do
* Fix SVE header being included in non-SVE file. Resolves: COMPMID-6613 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: Ic7f662a239b761b83e67e11b6cc03f7d5f5cd051 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10573 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2023-10-31[GPU] Update Reverse layer to allow negative axis and reversed axis orderAdnan AlSinan
- Adds option to use negative axis and inverted axis. - Adds validation tests for the above. Resolves COMPMID-6459 Change-Id: I88afd845d078f92c82ec8529ce7241fccd4c417e Signed-off-by: Adnan AlSinan <adnan.alsinan@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10523 Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2023-10-31Fix SVE kernel using SVE2 instructionViet-Hoa Do
Resolves: COMPMID-6493 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I038d91ba266e1e8bf124336bcd272ec77e92038c Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10490 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Anitha Raj <Anitha.Raj@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>