Age | Commit message (Collapse) | Author |
|
- Neon(TM) implementation converts integers to float and performs the division because there is no vector integer division instructions. However, leftover loop still uses integer division, which makes results inconsistent depending on where we are in the tensor.
- SVE path does it in integer domain.
- OpenCL(TM) does it similar to Neon(TM) vector path.
- Reference implementation does it in integer domain.
These differences cause intermittent mismatches. This patch ensures all follow the same logic.
On the other hand, the provided Neon(TM) implementation is faster than the Fp32 converted version.
Resolves: COMPMID-6925
Change-Id: Ia12606d57f40a7d331b9b698f87fd4321496b275
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11316
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* Perform final sum in fp32 to avoid overflow
* Resolves ARMCL-1128
Change-Id: I89799baf81045697f7bc44017fcb6a440635caff
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11311
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
arm_gemm fuses the actual bias addition with the output stage in quantized gemm.
The output stage, in its very basic form, is:
A_offset * B_offset - sum(A_row_i) * B_offset - sum(B_col_j) * A_offset
Matrix B is usually constant (e.g. weight matrix in convolutions). Therefore, except the middle term above, the expression is constant across the same output row because the column sums of matrix B are pre-calculated.
The bias is also usually constant. When it is, it makes sense to add the bias vector to the above sum and just perform a single addition on top of the output tensor.
For this to happen, the column sum computation of B tensor must account for the bias. This is ensured by set_quantized_bias() method in the interface. This function passes the bias pointer and strides to arm_gemm.
Gemv_pretransposed does not implement set_quantized_bias() and uses the parent function, which does nothing. Therefore, the bias is not added to the output. This causes tests to fail.
Resolves: COMPMID-6928
Change-Id: Iba24fabc65fdc47edb12db6abff2fb47784c0743
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11310
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
|
|
Resolves: COMPMID-6927
Signed-off-by: David Mansell <David.Mansell@arm.com>
Change-Id: Ib426fdc11ddbdbd0028d64547f3eaf312ca5fcce
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11301
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
|
|
* Validate output shape in CpuPool2dAssemblyWrapperKernel
* Resolves ARMCL-625
Change-Id: I4fd91c1b15ecb17efc39fd3e82a92210e4f182b2
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11290
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Resolves: COMPMID-6501
Signed-off-by: Omar Al Khatib <omar.alkhatib@arm.com>
Change-Id: I0abd3cbb5f861301f407c443988fb7efaa205b5d
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11056
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* Resolves COMPMID-6931
Change-Id: I3ed0c509807e26bddfcd20be71b12ec4cbb5cce6
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11277
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
Indirect GEMM uses optimized assembly path while Direct Conv uses the fallback Acl kernel for convolution.
In certain cases, where the input tensor is large and filter size is greater than 7 (e.g. 9x9 filters), heuristics fall back to Direct Conv algorithm where it could still prefer the assembly path if the data layout is NHWC. This is more important when SME2 kernels are present.
Resolves: COMPMID-6900
Change-Id: Ia611c975eee0423615113fcaeaa8f9eef0421456
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11254
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Anitha Raj <Anitha.Raj@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
* Resolves ARMCL-1129
Change-Id: I3e4e08d5ec401a274912c09674ef4a3245d65489
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11242
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
Fix the performance regression in CpuGemmConv2d caused by importing memory at every run for fixed-format kernels. This has been done by adding an bypass_import parameter to the aux. tensor handler class (CpuAuxTensorHandler) and using it in CpuGemmConv2d so that the memory import happens if and only when the associated tensor is used in the gemm pack.
Also, improve the documentation of CpuAuxTensorHandler.
Resolves: ARMCL-1126
Co-authored by: SiCong Li <sicong.li@arm.com>
Change-Id: Idb26bdb2d19419074a6e7f2497a1741ae200603f
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11240
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* This fixes the failure in the unit test CPU/UNIT/Context/CpuCapabilities.
* Resolves MLCE-1221
Change-Id: Ib5b3e8a7276939f6644783550caa245ee3f4fe7b
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11235
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
|
|
* Resolves MLCE-1219
Change-Id: If997180ec88c35d6af05a06c8c5ef95681e67c05
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11182
Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
include of alloca.h should be guarded against _WIN64 and __OpenBSD__
Partially Resolves: COMPMID-6595
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: I6a52ec129d92e290d033f75baeb4a598669daae0
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11180
Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
This patch fuses the transposition taking place in Acl with the transformations done in arm_gemm (called pretranspose_b_array) if the underlying kernel and transform supports it. This should improve start-up time (as it's for constant Rhs matrices) and memory footprint. The transformations in arm_gemm are kernel specific. The Rhs matrix is transformed into certain layouts to improve the performance.
Resolves: COMPMID-6595
Change-Id: Id2932dd966e59f903c279417bebcea83d9a42464
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11144
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Resolves: [COMPMID-6681]
Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>
Change-Id: I325b9d478dd1d04a45533bb7708cf76e98ee0cee
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11058
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* cl-clang is used to build ACL natively in WoA
* Resolves MLCE-1209
Change-Id: I040e84f526f16324138a074badf764ac099090e3
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11126
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Incorrect conditional meant that we were parallelizing over batches when
we should have been parallelizing over rows.
Relates to: ONCPUML-1443 COMPMID-6875
Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com>
Change-Id: I61d43bb2a94e8a6887d4cc5d1ae2ebb03295dff7
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11120
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
* Resolves ARMCL-1123
Change-Id: I4f8432ba41fa50bf787fb068c3672ac06b858bdd
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11117
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
Gpu code in dynamic fusion is now written by stable CKW. We do not need CKW protoype and the older writer implementation, i.e. TemplateWriter.
It also removes the need for the flag -DACL_INTERNAL_TEST_CKW_IN_DF to compile and test dynamic fusion operator.
Resolves: COMPMID-6715
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: I9f9453311e79d9be612bd4754240d832f98503e8
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11116
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
Tanh in dynamic fusion is a simple operator with no A and B coefficients, as its public interface implies. Tanh operator follows the TOSA specification.
Customization of tanh calculation with a and b can be achieved via fusion as below:
out = a * tanh(b *in) -->
x = b * in
y = tanh(x)
out = a * y;
Resolves: COMPMID-6873
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: I818765192f631ae82c2094b0fc376fb87bae4fa4
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11109
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
Softmax and Reshape operators are not supported in CKW. This patch marks them as not supported because we're deprecating template writer.
Resolves: COMPMID-6872
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: Ied97e5d21297c28be120c62760d33e6e832dd3b8
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11107
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
This patch also fixes a bug where the split dimension was wrong in
CpuDepthwiseConv2dAssemblyDispatch::run. It was set to DimY, which is
cols, but it should have been DimZ. This was rarely an issue in practice
because typically the number of cols are greater than the number of
threads anyway.
Relates to: ONCPUML-1443
Co-authored-by: Milos Puzovic <Milos.Puzovic@arm.com>
Change-Id: Ifed2fce22ddeb7cd77e6a6ae1083694427f91e04
Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11083
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
kernels.
Change-Id: I81b71ecc0d2e776d132091e074798a79b3141bec
Signed-off-by: David Mansell <David.Mansell@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11085
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
CpuGemmLowpMatrixBReductionKernel::run_internal randomly segfaults
because it reads out of bounds with vloadq. This doesn't trigger with
the unit tests because the read isn't out of bounds for the process, but
it can be seen clearly by running the following in debug mode
./examples/neon_gemm_qasymm8 1 1 1
The vloadq at src/cpu/kernels/CpuGemmLowpMatrixReductionKernel.cpp:353
accesses a quadword even though the input is a single byte.
relates to: ONCPUML-1444 MLINFSW-439 COMPMID-6844
Change-Id: I2ae5260c9f38d6d8149a6bcd5dc146b911209784
Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10966
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
- Refactor all kernels to work with the CKW stable API
- Add support for sub-tile in the op_load/op_store CKW operator
- Fix mismatch in resize
- Add comments in all kernels written with CKW to help developers
understand the structure of the code
- Add texture image support in depthwise convolution written with CKW
- Add support for different block sizes in depthwise convolution
- Remove the use of the dynamic fusion helper functions.
- Add support for floor in the op_unary() of CKW
Resolves: COMPMID-6708, COMPMID-6743, COMPMID-6530
Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>
Change-Id: I8104ce4d04a3138a1aeb0b84940e1f1c89e76069
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10914
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
The code in convolver.hpp generates pointers into either the
appropriate point in the input activation tensor or the padding buffer
for each kernel point of each output point of the convolution. This is
done at runtime interspersed with the data transform and matrix
multiplication steps. As such, it can have a significant impact on
performance, particularly for low input channel counts.
This change improves the performance of this code by streamlining the
checks for out of range input points (which must be directed to the
padding buffer). The previous implementation checked all four borders
for every point. The revised code does the checks one at a time, and
for any failing check applies the result to as many output points as
possible without repeating the other checks.
Signed-off-by: David Mansell <David.Mansell@arm.com>
Change-Id: I36a4fa114b425c1bcba2be40acf36718522519f5
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11004
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
|
|
Resolves: COMPMID-6746
Signed-off-by: Anitha Raj <anitha.raj@arm.com>
Change-Id: I96c158820469af3e54dca0c5909c888106eb1940
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11005
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Resolves: COMPMID-6753
Signed-off-by: Anitha Raj <anitha.raj@arm.com>
Change-Id: I80df0479eb4c7cc2c5380df708844cc9ffdd2aed
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/11001
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
* The tensor info objects created by calling create_tensor_info
is now solely owned by the context object. The user only receives
pointers to those objects.
- Internally pointers to tensor info objects are used in various
places. It's safer for dynamic fusion to manage these objects
directly rather than relying on the users.
- The validation test is updated to use the modified API.
* Make various changes in dynamic fusion API to make it more
friendly (e.g. making some of the objects moveable).
Partially resolves: COMPMID-6707
Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Change-Id: Ifee70e53c05f8e7b72bf9ef123701ff291c5ee80
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10990
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* CONVERT_TO_TENSOR4D_STRUCT_NO_STEP is implemented and used
in some CL kernels in the way that causes divide-by-zero issue.
- Since the steps are all zeros, the issue might have been
ignored by the compiler.
Resolves: COMPMID-6795
Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Change-Id: I0fb38fc62d63671b8abefa39b3d9b3ca6f49c7fe
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10967
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Resolves: [COMPMID-6799]
Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>
Change-Id: I47baeeea75f1d03609d1fa1e9a10d2f53d5694f7
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10969
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
- Locks pointer before checking for validity to prevent race condition
Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>
Change-Id: I6872b10d058ee7f3707ba641f44bb6116e26880a
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10960
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
The reorders supported at the moment are:
ab->BA4b4a
ab->BA8b4a
Co-Authored-By: David Mansell <David.Mansell@arm.com>
Change-Id: Ic466465629ce3bcdcee0089e251485b79b60e1f3
Signed-off-by: Renato Arantes <renato.arantes@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10775
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Suppress a false positive compiler warning caused by a bug in GCC https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104165
This issue is known to be reproducible in some versions of GCC 11, 12 and 13.
Remove a redundant std::move flagged by -Werror=redundant-move
Resolves: COMPMID-6777
Change-Id: I782e87b5e3df4c09195e67a37f49d122dc918224
Signed-off-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10950
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: <felixjohnny.thomasmathibalan@arm.com>
Comments-Addressed: <felixjohnny.thomasmathibalan@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
- Enables FP16 lut for logistic activation
- Adds LUTManager to re-use lut where appropriate.
Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com>
Change-Id: I94667b63b452a8e58a1eb59cb0b5866178954523
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10864
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
- For quantized RELU activation, de-quantization and re-quantization
is not required since comparison against the quantization bias is
only required.
Resolves: COMPMID-6340
Change-Id: I574bd220f3d0d893b7f7c4819a883e2a131f61f4
Signed-off-by: Sangwon Ha <sangwon.ha@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10916
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: <felixjohnny.thomasmathibalan@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
The issue appears when this kernel is used by convolution operators because the stride calculations consider only simple matrix multiplication.
In conv2d triggered runs, Rhs does not have the same dimension as Lhs and Dst. Also, cases where Lhs and Dst are interpreted as 3d, where their X and Y dimensions (in convolution sense) are collapsed into one.
Resolves: COMPMID-6764
Change-Id: If443e6eb8f7a5cca1acc58b37c598122a013e69b
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10913
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
This patch adds adds the latest Gpus as Gpu Target and sets up kernel selection heuristics for MatMul to address some nightly issues.
Resolves: COMPMID-6766
Change-Id: I29dbb08c5ecfb3fcd63230b0b1675ab557074aca
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10902
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
While writing this gemm kernel, code pieces, including validations were adapted from ClGemmMatrixMultiplyReshapedOnlyRhsKernel, and this validation should be about reinterpret_input_as_3d. This reveals a test gap for this kernel. There are currently no tests stressing this condition; but this is not going to be addressed as part of the bug ticket.
The corresponding snippet in ClGemmMatrixMultiplyReshapedOnlyRhsKernel is
if (gemm_info.reinterpret_input_as_3d)
{
ARM_COMPUTE_RETURN_ERROR_ON(src0->dimension(1) * src0->dimension(2) != m);
}
else
{
ARM_COMPUTE_RETURN_ERROR_ON(src0->dimension(1) != m);
}
Resolves: COMPMID-6757
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: I4363effcaf2b43ff3674a3443058384338fb9714
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10891
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
|
|
This reverts commit 270576a9fbeeda5210483931388e62f9a1059dd9.
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: Ia4e965156af46a9afd78819e90fd2a033a97fc2b
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10888
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
This fix modifies some of the conversions done in the generate proposals kernel that causes DDK issues while compiling the kernel.
The issues are mostly related to conversion from i64 to fp16, and it doesn't affect fp32. Firstly, type identifier size_t is converted into unsigned int. But, this alone was compiling but causing mismatches, even in older devices, where it was passing before. Therefore, the fp16 conversion delayed until vector construction where the integers are now converted to fp32, and then fp16. This, although may not be ideal, seems like the best solution.
Resolves: COMPMID-6756
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: Iee61216c908fe51431985b80c3653fc32add4741
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10879
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
While writing this gemm kernel, code pieces, including validations were adapted from ClGemmMatrixMultiplyReshapedOnlyRhsKernel, and this validation should be about reinterpret_input_as_3d, which is not dealt with in this kernel. The mmul kernel only deals with reinterpret_output_as_3d, which is equivalent to depth_output_gemm3d != 0. This reveals a test gap for this kernel. There are currently no tests stressing this condition; but this is not going to be addressed as part of the bug ticket.
The corresponding snippet in ClGemmMatrixMultiplyReshapedOnlyRhsKernel is
if (gemm_info.reinterpret_input_as_3d)
{
ARM_COMPUTE_RETURN_ERROR_ON(src0->dimension(1) * src0->dimension(2) != m);
}
else
{
ARM_COMPUTE_RETURN_ERROR_ON(src0->dimension(1) != m);
}
Resolves: COMPMID-6757
Change-Id: I73b203594b22098a5374c1fac6969ee769969901
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10874
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
* Changes in filelist.json: moved fp16 code from common to fp16
* Replaced the guard __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
with ENABLE_FP16_KERNELS.
* Resolves COMPMID-6755
Change-Id: I4da1c53d3f9e4734e5e67125265ab4e3fc0dcbe4
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10865
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
The graph example has fixed quantization information given for certain layers, and some of the offsets exceed the 8-bit range for Int8 data type.
This shouldn't have been the case and the offsets should respect the 8-bit quantization specification laid out here: https://www.tensorflow.org/lite/performance/quantization_spec
However, the mechanism added in the helper function introduces robustness in case of such irregularities with little/no cost; and therefore added as a fix.
Resolves: COMPMID-6748
Change-Id: If39bf323382f109fa100ee2b87ce63cc7bc89759
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10858
Reviewed-by: SiCong Li <sicong.li@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
|
|
The function pointer for clImportMemoryARM should be loaded in a portable way as recommended by Khronos® as outlined here:
https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#getting-opencl-api-extension-function-pointers
using clGetExtensionFunctionAddressForPlatform() call.
All extensions should ideally be loaded using the above mentioned function.
Resolves: COMPMID-6732
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: I482b6bde721267d5e8c08301e5780d28a9c5ba85
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10852
Benchmark: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
|
|
Resolves: COMPMID-6622
Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Change-Id: Ibac276618bdda125dcbb9c851c547f12739b15b4
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10749
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
This reverts commit ded5b182675e3166e947a8eb637b5b1e925816ab.
Resolves COMPMID-6735
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Change-Id: I9b69ca1ec80a671171d3f52081c4b8c61a676617
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10838
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: <felixjohnny.thomasmathibalan@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
Implement a single kernel instead of having two consecutive ones. In the previous setup, one kernel was calculating the maximum value in the axis, and this maximum was being subtracted from each data while calculating the softmax, i.e.
softmax(x_i) = exp(x_i - max) / sum_i( exp(x_i - max) )
This patch integrates these two stages into a single kernel for Neon™ for all data types. This will save some memory because we don't need to hold the max values in a separate auxiliary tensor.
It also introduces some other optimizations that will ease memory pressure when the data type is float/half, by using the dst tensor as temporary storage for already exponentiated inputs.
It removes the references to SVE and SVE2 implementations, and most of the associated files; but, it leaves the implementations as these may be used in the future.
Resolves: COMPMID-6500
Signed-off-by: Gunes Bayir <gunes.bayir@arm.com>
Change-Id: Icff9976d1214c4c6cbe15a62ca60b8a77d3784cc
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10688
Reviewed-by: SiCong Li <sicong.li@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* This is the initial patch to start working on enabling fp16 in all
multi_isa builds. More changes are required in the way we register
the kernels using the macro REGISTER_FP16_NEON.
* In this patch we add the capability to build the fp16 files in listed in
filelist.json with the correct arch option to enable FP16
* This patch is required towards building an universal multi_isa binary
where fp16 is enable.
* Enable REGISTER_FP16_NEON macro for all builds by removing
__ARM_FEATURE_FP16_VECTOR_ARITHMETIC guard from the macro definition.
The macro has to be used across all types of builds.
Change-Id: I99f4c273f6ee04cad3c097e5e374200f48568fa9
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10682
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Reviewed-by: Jakub Sujak <jakub.sujak@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|
|
* Moved NCHW kernels fp16 and fp32 to their corresponding files
src/cpu/kernels/fuse_batch_normalization/nchw/neon/fp16.cpp and
src/cpu/kernels/fuse_batch_normalization/nchw/neon/fp32.cpp
* Changes in filelist.json to include the new fp16 and fp32 files
* Moved the template batch_normalization_nchw to impl.h as we
need to instantiate it from fp16.cpp and fp32.cpp
* Pooling layer: removed the guard __ARM_FEATURE_FP16_VECTOR_ARITHMETIC that
prevented the FP16 kernel execution.
* Partially resolves MLCE-1102
Change-Id: Ia8c85e9ffb76c9e387f9ae2685e5df5e52c8dc27
Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com>
Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/10777
Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
Tested-by: Arm Jenkins <bsgcomp@arm.com>
Benchmark: Arm Jenkins <bsgcomp@arm.com>
|