aboutsummaryrefslogtreecommitdiff
path: root/src
AgeCommit message (Collapse)Author
2022-11-11Fix compiler warnings in dynamic fusionSiCong Li
Remove conflicting Padding2D from the unused comparison operator in the prototpye Resolve unused variables in release mode Resolves COMPMID-5683 Signed-off-by: SiCong Li <sicong.li@arm.com> Change-Id: I19d74c57e51e6cf64003ddcbc74227608bb866d2 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8590 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-11-09Fix CPU multiplication layer threading overheadViet-Hoa Do
* When the tensors are reinterpreted as 1D, any thing smaller than 10KB won't be splitted into different thread. Resolves: COMPMID-5630 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: Icff7089e37c85c8b325f099008a080a5805d36a2 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8581 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-11-04Fix compiler warnings in dynamic fusionSiCong Li
Resolves: COMPMID-5686 Change-Id: I608c359583c44f2f04f29faddd1c6b38a381de60 Signed-off-by: SiCong Li <sicong.li@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8562 Benchmark: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2022-11-04Fix activation block in gemm.clGian Marco Iodice
- Replace VEC_SIZE with N0. VEC_SIZE was used in the old gemm kernel and not used anymore in the existing ones Resolves COMPMID-5678 Change-Id: Ia770200b9d6e24c51c57347e4634fb8eadd10385 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8556 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2022-11-02Partially Revert "Add threshold for floating-point SOFT_RELU activation"Gunes Bayir
Revert the range removal in tests for soft relu and bring the former implementation in CL backend back. Resolves: COMPMID-5677 Change-Id: I35d5ac03a134299041ce97aabc9fff2d4380d09f Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8551 Reviewed-by: Milos Puzovic <milos.puzovic@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>
2022-11-01Fix fixed-point quantized additionViet-Hoa Do
* The range check condition is incorrect. The maximum value of 8-bit quantized data is 255, not 1023. * Use 256 to make the check slightly stricter than it should. Resolves: COMPMID-5458 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I5c071680531574b409a7ce71a732f6480caa10d8 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8537 Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-11-01Updateable weights in depthwise convolutionMilos Puzovic
Check whether weights are defined as constant, if they are not constant then do not pack them if they are already packed so that they can be updated. Signed-off-by: Milos Puzovic <Milos.Puzovic@arm.com> Change-Id: I73447e31e3660b05f8f40e04ea4ea2003eb9b802 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8539 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-11-01Add threshold for floating-point SOFT_RELU activationMilos Puzovic
Added missing threshold for calculating SOFT_RELU when SVE and CL implementations are used. As a result removed from the testing bounds for input values that were set to be in the interval [-40, 40]. Resolves: COMPMID-5658 Signed-off-by: Milos Puzovic <Milos.Puzovic@arm.com> Change-Id: I3d14df60125e36e4eb85aeb222f4fb0cc5741521 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8536 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-11-01Add check for Batch Matmul in GemmAssemblyDispatchMohammed Suhail Munshi
Relates to : COMPMID-5507 Change-Id: Ia2c4ea153ac2524ffa6b2a9c10f3a0318a8a67a1 Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8509 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: SiCong Li <sicong.li@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-11-01Rewrite dynamic fusionSiCong Li
The new version introduces the following major changes: * Change public interface to simplify and standardize the user experience - Use the term "Workload" uniformly - Simplify operator interface to be a set of static methods: validate_op(), create_op() * Separate the kernel writing into its own component (template_writer). This is to allow the co-development of GpuKernelWriter, and to allow easy replacement once GpuKernelWriter is mature. * Optimize the core fusion algorithm used by the component graph. The details can be found in GpuKernelComponentGraph::fuse() * Use Gpu instead of Cl prefixes for most of the Workload interfaces (except for runtime and kernel components, which have to be language specific) This allows the potential extension to other Gpu langauges in the future. * Refactor runtime memory interface so that auxiliary tensor handling is separate from the user tensor passing. This is because the former is less stable and may require extension in the future. * Hide source code object from the user as it is not required at the moment * Deprecate the old prototype entirely by disabling it in SCons build Resolves COMPMID-5510, COMPMID-5512, COMPMID-5513 Change-Id: If69d2362856f2de4503546b7b6cf48a525cf3079 Signed-off-by: SiCong Li <sicong.li@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8406 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-11-01Rework direct convolution heuristic on OpenCLGian Marco Iodice
Resolves COMPMID-5634 Change-Id: I075de70d509d0c4430b4bcf3f218384e237a3a56 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://eu-gerrit-1.euhpc.arm.com/c/VisualCompute/ComputeLibrary/+/453708 Tested-by: bsgcomp <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: bsgcomp <bsgcomp@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8473 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com>
2022-10-27Fix fixed-point quantized additionViet-Hoa Do
* Use the same rounding function for the left-over part with the vectorized part. Resolves: COMPMID-5640 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I07450b2a43390b77539b78cd5d3e6772bdc38548 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8520 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-10-24Add FP16 tanh based on rational approximationJonathan Deakin
Use rational approximation with optimised coefficients to calculate tanh_f16. Method is ~2.5x faster than previous and has lower relative and absolute error. This will fix https://github.com/ARM-software/ComputeLibrary/issues/1002 Credit to George Steed for suggesting use of vcageq instead of min+max Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com> Change-Id: Id70da3aab666c68b0d798266a837b59c00937bf7 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8480 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-10-20Update reinterpret tensor as 1D for CPU addViet-Hoa Do
* Use the same implementation as other layers. Resolves: COMPMID-5108 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I5a50259b398b71ca1f61b5ee8daa539bf8263fac Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8501 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-10-20Add test in GEMMLowp for batch matmulMohammed Suhail Munshi
- Adds tests for batched matrix multiplication - Bugfix for issue : 3d tensors input tensors with offsets in GemmLowp results in mismatches Resolves : [COMPMID-5507] Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Change-Id: I68e036fbca642c1841dd4321033045aadc8f5636 Reviewed-on: https://eu-gerrit-1.euhpc.arm.com/c/VisualCompute/ComputeLibrary/+/461298 Comments-Addressed: bsgcomp <bsgcomp@arm.com> Tested-by: bsgcomp <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8482 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-10-19Fix FFTConvolutionLayer testViet-Hoa Do
* Do not change the tensor info after configure stage. * By fixing this, the 1D optimization for activation layer can be applied to all data types and tensor layout. Resolves: COMPMID-5644 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I557f9bb84e5e456c28d6b423584887d7a3648ad4 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8470 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Mohmun02 <MohammedSuhail.Munshi@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-10-12Optimize Neon™ Logistic ActivationMohammed Suhail Munshi
- Use a 1d execution window to improve memory access pattern. Resolves: [COMPMID-5465] Signed-off-by: Mohammed Suhail Munshi <MohammedSuhail.Munshi@arm.com> Change-Id: Ida30669ffa06eb002ca43a6edf15e25a6eaad2f6 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8344 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-10-12Adding documentation section explaining how BF16 is usedRamy Elgammal
Resolves: COMPMID-5494 Signed-off-by: Ramy Elgammal <ramy.elgammal@arm.com> Change-Id: I8f512745855b8ca21181a9ab21323bfff6aeb866 Reviewed-on: https://eu-gerrit-1.euhpc.arm.com/c/VisualCompute/ComputeLibrary/+/458884 Tested-by: bsgcomp <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Comments-Addressed: bsgcomp <bsgcomp@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8391 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-10-10Fix LUT-based activation layerViet-Hoa Do
* Use the window instead of the tensor shape to determine the number of elements in the x-dimension. * Remove the LUT implementation in 32-bit build. Resolves: COMPMID-5641 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I0a79aa38d8f6a105ad01785bd94571f5a2ecb348 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8380 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
2022-10-07Workaround CL compiler issue on FP16Viet-Hoa Do
Resolves: COMPMID-5600 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I5196d1639c48d0b8a116d47ed1d6c7334dc8f41e Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8374 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-10-07Optimize Neon™ SUB operator by squashing execution windowJakub Sujak
Resolves: COMPMID-5462 Change-Id: I2c7151c8faf4016cc33592fff04d492d7cbc8fd6 Signed-off-by: Jakub Sujak <jakub.sujak@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8366 Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-10-06Rework DepthwiseConvolution heuristic on OpenCLGian Marco Iodice
Resolves COMPMID-5632 Change-Id: I2bdbe69a610ca2510fbd74d5d412842679299762 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8365 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by: Jakub Sujak <jakub.sujak@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-10-06Improve start-up time in gemmlowp reshaped rhs only.Adnan AlSinan
- Does not Pass RESULT_MULTIPLIER, RESULT_SHIFT When PER_CHANNEL_QUANTIZATION is defined. Resolves COMPMID-5499 Signed-off-by: Adnan AlSinan <adnan.alsinan@arm.com> Change-Id: Ie433eaf83c003a6d5ccfeb89eb2783528dc2b48e Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8316 Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-10-04Update GEMM reshaped rhs only heuristicGian Marco Iodice
Resolves COMPMID-5631 Change-Id: I37d1d0d043f8d44d782d2225091af607ad131b58 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8364 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com>
2022-10-03Force CL kernel compilation with 64 registersViet-Hoa Do
* For DDK version 30 and higher, force the CL compiler to use 64 registers for NHWC direct convolution. Resolves: COMPMID-5508 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I7d9ecc3b5a4eceaff44542cd26f6f05e30ab2c1f Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8351 Benchmark: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
2022-10-03Fix Batch Matmul nightly failureAdnan AlSinan
Resolves COMPMID-5601 Signed-off-by: Adnan AlSinan <adnan.alsinan@arm.com> Change-Id: I1baf92c4751d784d017c0b2f7de1fc09e42ce69c Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8309 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-10-03Optimize CPU add layer on quantized dataViet-Hoa Do
* Use fixed-point arithmetic where possible. * Various optimization for the FP32-based implementation. This implementation is kept as the fall-back solution in case of unrealistic quantization parameters that exceed the range of fixed-point solution. Resolves: COMPMID-5458 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: I221d2d3801ecaae4fe0b7cf6ae8ef00ca3743665 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8317 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-28Fix overflow in NEActivationLayer for FP16 typePablo Marquez Tello
* Resolves MLCE-924 Change-Id: I3cc3d30893c2ee0865eacafdc1d9ba3d5b876d32 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8326 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-26Add FP32 Neon™ swish activationJonathan Deakin
Change-Id: Id37b59adbc8c4cbe218d1652aeb02a0b4ce42c66 Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8256 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-23CPU GEMM: Fix overreads in SVE merges.David Mansell
SVE merges for interleaved kernels were not guarding bias reads with the correct predicates, leading to overreads and crashes in some cases. Fix to use the appropriate predicate. Resolves: COMPMID-5627 Change-Id: Ib049531c4a3bea56e90623b7b9f0d6a7ab4db2c8 Signed-off-by: David Mansell <David.Mansell@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8315 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com>
2022-09-22Fix unresolved symbol for target armv7a + AndroidPablo Marquez Tello
* Resolves COMPMID-5599 Change-Id: I4c1df48eda289c82ca567f6808fccd0b09065223 Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8302 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-16Fix validation in validate_image2d_support_on_rhsGian Marco Iodice
Resolves COMPMID-5533 Change-Id: Ice3d9469c7486a700c58fb61fc692b13f368d202 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8148 Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-16Fix bug in QASYMM8_SIGNED to F32 cast layerViet-Hoa Do
Resolves: COMPMID-5580 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: Ia731560c23a6ab2e0ead5a857fbabb9cbc25154c Reviewed-on: https://eu-gerrit-1.euhpc.arm.com/c/VisualCompute/ComputeLibrary/+/452428 Tested-by: bsgcomp <bsgcomp@arm.com> Reviewed-by: Pablo Tello <pablo.tello@arm.com> Comments-Addressed: bsgcomp <bsgcomp@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8268 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
2022-09-16Optimize Quantized/Integer Bilinear Scale for Neon™Gunes Bayir
This patch introduces several performance optimizations regarding the Bilinear Scale operator with REPLICATE Border mode. Changes apply only to NHWC. This patch - Reduces the memory footprint by disabling precomputation of indices and weights when they're not used - Rewrites the kernels for QASYMM8/QASYMM8_SIGNED/U8(Uint8) - Adds S8(Int8) Bilinear Scale for Border mode REPLICATE - Removes Bilinear Scale SVE kernels for Quantized and Integer types and adjust the heuristics to choose the Neon™ implementation - Adds new test cases where the input and output of the Bilinear Scale operator have different quantization scale and offset Resolves: COMPMID-5453, COMPMID-5454 Change-Id: I3d251e76e0c6978fd5a0a1795ec62ab536bec93c Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8250 Reviewed-by: SiCong Li <sicong.li@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-14Interpreting tensor as 1D for CPU multiplicationViet-Hoa Do
* Also fix a bug in mul_U8_U8_U8. Resolves: COMPMID-5460 Signed-off-by: Viet-Hoa Do <viet-hoa.do@arm.com> Change-Id: Ie1edafeae7aaad91164caeeb04661a8974a7fc1b Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8244 Reviewed-by: SiCong Li <sicong.li@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-14Fix invalid memory access for dynamically fused Cl Elementwise kernelsSiCong Li
The M0 and N0 were incorrectly set for the case of broadcasting when the elementwise component is non-root. This is because we previously always use rhs tensor to derive the load M0, N0. But for non-root components, the addend/divisor tensor can be in the lhs or rhs. Thus this would fail in case the addend/divisor is in the lhs. - Also fixes broken Dynamic Fusion test Resolves COMPMID-5482 Signed-off-by: SiCong Li <sicong.li@arm.com> Change-Id: I37f27ffa392781387db15739b1666f1dad28c554 Reviewed-on: https://eu-gerrit-1.euhpc.arm.com/c/VisualCompute/ComputeLibrary/+/445890 Tested-by: bsgcomp <bsgcomp@arm.com> Reviewed-by: Mohammed Suhail Munshi <mohammedsuhail.munshi@arm.com> Comments-Addressed: bsgcomp <bsgcomp@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8111 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-14Adding GELU activationMurray Kornelsen
OpenCL implementation uses built in erf. NEON implementation requires new vectorized erf. Uses the following approximation: erf(x) = 1 - 1 / (1 + a1x + a2x^2 + a3x^3 + a4x^4)^4 a1 = 0.278393, a2 = 0.230389, a3 = 0.000972, a4 = 0.078108 From https://en.wikipedia.org/wiki/Error_function#Numerical_approximations Signed-off-by: Murray Kornelsen <murray.kornelsen@mail.mcgill.ca> Change-Id: I2d3964b2c26a4334166b17135f9104bc6324fad2 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/7921 Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Pablo Marquez Tello <pablo.tello@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-14INT8 Quantized MeanStdDevNorm (LayerNorm)Murray Kornelsen
Implements LayerNorm for qasymm8 tensors. Uses uint8x16 loads and stores. Summation is performed in integer arithmetic (vpaddl) Normalization is performed in float32 before requantizing back to int8. Signed-off-by: Murray Kornelsen <murray.kornelsen@mail.mcgill.ca> Change-Id: I2407c8b34717fb47adab98791bd76fb8a3c62f4a Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/7922 Comments-Addressed: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Viet-Hoa Do <viet-hoa.do@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-12Add test for NEGEMM to test a batched matrix multiplication with variable ↵Adnan AlSinan
input tensors - Add a test for CPU to batched matrix multiplication with variable input tensors - Disable assembly kernel when using _reshape_b_only_on_first_run flag Resolves COMPMID-5501 Signed-off-by: Adnan AlSinan <adnan.alsinan@arm.com> Change-Id: If96b182584617806a9dfe597dbfaf05241b123c2 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8234 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-09Rework heuristic in ClConv2dGian Marco Iodice
The heuristic has been tweaked to call direct convolution when we think it can be faster than gemm-based convolution. The main change is affecting the selection of the convolution method on the first layer. In general, the question we should ask for the first convolution layer of a model is: when the execution time of im2col + gemm < direct?. Since im2col does not depend on the OFM, it means that when OFM is big enough, the contribution of im2col is small and the GEMM approach is preferable. From internal experiments, the OFM threshold is 64. Resolves COMPMID-5504, COMPMID-5504, COMPMID-5477 Change-Id: If1bd1fa93c185ffa874388e29866244e62ca3494 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8231 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-09-09Optimize FP32/16 Bilinear Scale Kernel for Neon™Gunes Bayir
This patch removes index and weight pre-computations where it's not used and reduces some calculations inside the inner-most loop of Scale. Resolves: COMPMID-5452 Change-Id: Ie149b1b76a90a8cb659ada0f97aef78caf69932f Signed-off-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8220 Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com>
2022-09-09Add a macro guard in all OpenCL kernels in gemmlowp.clGian Marco Iodice
Resolves COMPMID-5498 Change-Id: I474f3f963257014255d082aab0ccbe3efe5aa067 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8222 Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Ramy Elgammal <ramy.elgammal@arm.com> Reviewed-by: Ramy Elgammal <ramy.elgammal@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-08Disable Winograd on fp16 if fast-math = falseRamy Elgammal
- That would force CpuConv2d::get_convolution_method() choose GEMM_CONV2D or GEMM methods instead. Resolves: COMPMID-5531 Signed-off-by: Ramy Elgammal <ramy.elgammal@arm.com> Change-Id: I4dec8772e8c150da003d9a89c1d036057c4d28b0 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8233 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com>
2022-09-07Optimize depthwise convolution on OpenCLGian Marco Iodice
The optimization concerns the case where the depth multiplier is > 1. The depth multiplier for loop has been removed from the OpenCL kernel and the GWS has been mapped to the output shape. In this way, we can still perform a tile with N0 columns and improve the performance of depthwise conv over 80% when depth multiplier is > 1. Resolves COMPMID-5568 Change-Id: I604e287d4eeb31c54b9cc6c3072a698cd0e3e136 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8184 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-02F16 Specialization for MeanStdDevNormMurray Kornelsen
Ran into issues with f16 meanstddevnorm. Essentially, with large enough tensors and/or large values in tensors, output becomes all 0. This is due to the variance computation. In f16, it reaches infinity quite easily, then the division results in 0. This change modifies the OpenCL and NEON implementations to compute the sum of squares and the variance using f32, while other operations remain f16. Update: Found that the square operation also benefits from f32, rather than squaring in f16 and accumulating f32. Signed-off-by: Murray Kornelsen <murray.kornelsen@mail.mcgill.ca> Change-Id: Ide00afd84ec6d26fec4d53b073e295814f08ba46 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/7959 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Comments-Addressed: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-02Enable Winograd-based conv2d when IFM>=8 on GpuGian Marco Iodice
From an internal performance evaluation, it seems that Winograd-based Conv2D offers better performance than alternative methods such as direct convolution and gemm-based conv already from IFM=8. Before the condition was for IFM>=16 Resolves COMPMID-5532 Change-Id: I9ff04835d6fd07f5f0abeec9645c9d9cc913b6b7 Signed-off-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8147 Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-09-01Use parent buffer in CLSubTensor. This avoids calling enqueueMapBuffer ↵Murray Kornelsen
repeatedly when mapping multiple children of the same parent. Signed-off-by: Murray Kornelsen <murray.kornelsen@mail.mcgill.ca> Change-Id: I7ff554915a37320dfd94f15a6fb01a72a235cf39 Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/7920 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Reviewed-by: Pablo Marquez Tello <pablo.tello@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-08-24Fix add for tensors with non-matching stridesJonathan Deakin
Previously, the add_as_1d_array kernels were used on tensors with non-matching strides which caused the wrong elements to be added. The fix is to check that the strides are equal when selecting the addition kernel. Change-Id: I914ca2b95e5b8ed1875ec5ebe129bdfe2845496b Signed-off-by: Jonathan Deakin <jonathan.deakin@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8120 Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-08-24Fix validation problem in CLQLSTMLayerPablo Marquez Tello
* Resolves MLCE-799 Change-Id: I3d5b2afbf65d159aa5e645743c1e139110dfc20e Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8093 Tested-by: Arm Jenkins <bsgcomp@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Reviewed-by: Gian Marco Iodice <gianmarco.iodice@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com>
2022-08-23Fix macos build errorsPablo Marquez Tello
* Resolves MLCE-903 Change-Id: I39fb3b4b395a37c0f32481830f94d85ec15e205f Signed-off-by: Pablo Marquez Tello <pablo.tello@arm.com> Reviewed-on: https://review.mlplatform.org/c/ml/ComputeLibrary/+/8124 Reviewed-by: Gunes Bayir <gunes.bayir@arm.com> Comments-Addressed: Arm Jenkins <bsgcomp@arm.com> Benchmark: Arm Jenkins <bsgcomp@arm.com> Tested-by: Arm Jenkins <bsgcomp@arm.com>