Age | Commit message (Collapse) | Author |
|
* Convert AvgPool with stride_width > 3 and Valid padding to Conv2D to
optimize it to run on NPU.
Change-Id: I06ab412357f0b09b1498f9019a9d1963a324ad34
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
* Fix bug that caused filter padding to not be added proportionally
compared to the hardware padding added to IFM.
* Update needed_total_padding function that calculates hardware padding
to also account for the cases in which IFM width is not divisible by
the stride width.
* Update supported ops constraint on strides for conv2d to mark ops with
stride width > 3 and IFM width that is not divisible by the
optimization resize factor as not supported.
* Update unit tests that verify correct functionality when checking
whether ops are supported or not.
Change-Id: I62f14cca890b779ca787a9603fa37c873ad522f8
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
Change-Id: I4f466a7bac77d8bb6fa7243ea2e7c9f3be6d0585
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
Update import of Sized from collections to collections.abc to work with
Python 3.10
Change-Id: Iae281db9402331972ad13660d04523608b23614d
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
|
|
- Added RSQRT int8 support, implemented as LUT.
- Added test to supported operators
- Updated SUPPORTED_OPS.md
Change-Id: I34904772e044be8d22a6dfe426edf85358a205b7
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- When optimizing for Size the scheduler does not try to add weight
buffering to the schedule since this would add extra SRAM usage to
the peak usage. However, for all other ops that uses less SRAM than
the peak there is memory available that could be used for weight
buffering and hence improve the performance.
- Removed limitation to only run optimize schedule when optimizing
for Performance. Regardless of optimizing for Performance or Size the
scheduler flow is the same except that the limit for max SRAM usage is
different.
Change-Id: I6880b35655e37b4916a9c15150f0b8e5126a1cd8
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Cascading was recently enabled for Resize ops. A Resize op is
transformed into several ops. In this case the last op is a
DepthwiseConv2DBias using NEAREST resampling mode. This resampling/
upscaling is not taken into account when calculating the ifm box
size, causing the coordinates to get out of bounds.
- When generating the high level command stream there is a check to
see if an op is a resize op. If this is the case an upscaling factor
is calculated. The fix is to change this check to instead see if the
operator is using NEAREST resampling mode. If that is true, the
scaling factor should be used.
Change-Id: I5308a383cc3310c53004ccfe2d6fabf256478a26
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Added fix when building the minimum schedule forcing the stripe
to be even for is_nearest ops. This is required in order to be
able to allow cascading for resize ops.
- Remove limitation in cascade builder that prevents resize ops
to be cascaded.
Change-Id: I05150102b91531ecba786936494f1817a4472f42
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Added release information
- Minor changes to SUPPORTED_OPS.md including version info
Change-Id: I91fae4c40c6c1f25b874268b18d077a9babd4875
Signed-off-by: Tim Hall <tim.hall@arm.com>
|
|
half_pixel_center=True
Signed-off-by: Alexander Hansson <Alexander.Hansson@arm.com>
Change-Id: I0e9db22c97a9e2fbfee618262ffc43532cfcee2c
|
|
Signed-off-by: Alexander Hansson <Alexander.Hansson@arm.com>
Change-Id: I35fd042d572f62122ac681c231798c9f2163fc00
|
|
- Fixed an issue with the fusing of PAD and AVERAGE_POOL_2D whereby
the rounding away from zero didn't work because it requires the zero
point to be at zero but the input padding required it to be set to the
desired zero point. This affected both int8 and int16. The solution
was to remove it by using the bias prior to the scaling
- Refactored the rounding away from zero mode
Change-Id: I8f2df69df06d2a9722315c346646e5a901cb2c3b
Signed-off-by: Tim Hall <tim.hall@arm.com>
|
|
Fixed serializing of attribute container and shared_name that
accidently got lost when fixing the crash for a faulty LSTM model.
Change-Id: Ibd11da65735112bed4b1c8bcc4ef048bc093ebc4
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
* Fix import order in test_build.py
* Fix setup_tools_scm dependency version. Previously the version was
restricted to < 6, creating a version restriction on Setuptools
library too.
Because an older version of Setuptools was used, running
test_build.py::test_build_correct_readme_links would generate a
UNKNOWN.egg-info directory in the src directory instead of a
ethos_u_vela.egg-info directory.
Change-Id: I113ca25b23b39d43fa288e6eda16377f4f5b4143
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
Add --verbose-progress CLI option used to enable printing progress
information in the compiler driver and scheduler.
Change-Id: I99ac8c6a654e60391d5c11e28b89250405daa53a
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
Remove unused parameter rescale for faf
Change-Id: Id388d307f3eb0d27bce813ab58e3c9a5f4ba89ae
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
|
|
* Implement a general optimization solution for strided CONV2D that
supports a stride_w with no upper bound.
* Implement filter zero padding to allow for optimization in those cases
in which the filter width is not divisible by the stride width.
E.g.: Filter width = 8, stride width = 3 ->
Filter width = 8 + 1 (0 padding) = 9, stride width = 3
* Implement partial optimization to reduce the stride to hw supported
strides (i.e. 2 and 3) when optimizing to reach a stride = 1 is not
possible due to the IFM width not being divisible by the stride width.
* Implement optimization for when SAME padding is used. If the pre-opt
and post-opt padding do not match, add zero padding to the filter so
that the post-opt IFM padding matches.
Change-Id: Ia66b0d107281fa9993f6bf4d0c26627ee743253b
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
This reverts commit 72c6a2414205e033279f80b622cdf479c05a4f5b.
Reason for revert: Fix performance regression caused by breaking cascades in certain models
Change-Id: I5aba6e3c59ab27c5129f4a3f0c320ed18df78943
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
- The reference calculates the rounding different between
int8 and int16 for Conv2d. However, internally a Conv2d can be
changed to a FullyConnect but then the rounding must still be
calculated following the Conv2d reference.
- The fix is to check the original type if NATURAL rounding
should be used or not. int16 Conv2d uses NATURAL rounding
in reference.
Change-Id: I80d48b54372ef7b978ee2e9384a01934dd454e24
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Updated the Q0_15_SCALE constant to match the updated value
in the reference.
Change-Id: Id680748c532d41fea9760ec76c0b65c0c3e73a13
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
|
|
Treat Dynamic Weights as FeatureMap to avoid issues during scheduling
caused by having non constant OPs that produce tensors used as weights.
Change-Id: I2b9ee7fb62a150c5052c6c3b1a6d34f22e9426a9
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
- The reference calculates the scale slightly different between
Conv2d and FullyConnect. Recently a fix was submitted to address
this issue. However, internally a Conv2d can be changed to a
FullyConnect but then the scale must still be calculated
following the Conv2d reference.
- The fix is to check the original type if FullyConnect scale
should be used or not.
Change-Id: I5a9fb49126f0df63712b73fb5520fdc604cee378
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
We now read operator code version, store it in operator and write it out
to optimized file.
Signed-off-by: wilisa01 <william.isaksson@arm.com>
Change-Id: Idba672531d2e2a0203a85d3ffca9cf65ace85b47
|
|
Change-Id: I50b85953bff13bd6ec0648dec5d86b8ac749137a
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
* Add test to verify that the metadata produced in the PKG-INFO file of
the sdist contains the correctly formatted links extracted from
README.md
Change-Id: I300094470fd115b1143aa8c663837e8a77428f24
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
removed redundant row
Change-Id: I8b90df3b45ed863c93572b33f695b06094103015
Signed-off-by: wilisa01 <william.isaksson@arm.com>
|
|
fixed by using sched op instead of last pass op
Change-Id: I2e03d39462ca07372d85c71e78189bd8c58a1b9c
Signed-off-by: wilisa01 <william.isaksson@arm.com>
|
|
- The assert triggers when a constant tensor is being assigned
to buffer 0 and that is a violation.
- The test case that triggered this problem revealed an error in
the reader code. If the input tensor has constant data it should
be using a Const op. Before this fix it was assigned a
Placeholder op and the tensor ended up in the scratch area
instead of the permanent area.
Change-Id: I4f92fb5ec1f0dc594defbaca0335eabe68fd5137
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Added int8 and int16 Exp support, implemented as LUT.
- Added generic 8bit and 16bit LUT table functions following
the implementation in the latest reference. If new ops are added
by the reference, they can easily be implemented in Vela using
the generic functions.
- Moved convert_to_lut to lut.py to have all LUT related code in
one file.
- Updated SUPPORTED_OPS.md
Change-Id: I388e76ea4b39162313599a5341cfb9bad71a782c
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- The array allocated in get_temporal_memory_usage is too small
so the first error is that not all LiveRange elements are added
to the temporal mem usage. The second error happens due to that
use_fast_storage_for_feature_maps is correctly trying to update
the temporal mem usage array but an assert happens due to out of
bounds. The array is too small since the LiveRangeClass is
reporting the wrong end time because of some inconsistencies in
how the mark usage is done for subgraph tensors.
- The fix is to mark the tensors with the current_time value.
Also changed so that tenors are marked consistently in both
extract functions. This means that the end time value to use
in get_temporal_memory_usage is the current_time + 1.
- Also made a small update to avoid updating current_time twice
when handling subgraphs.
Change-Id: Ib7e3681e370e097e433acb235740dfd69fa3ce8b
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- When compiling a model that only contains CPU ops, Vela
unnecessary adds an empty buffer.
- This extra buffer is added due to that the fast scratch tensor
always occupies index 1.
- Since scratch and fast_scratch does not have any constant data
they can use buffer 0.
Change-Id: I25e1fb124deed7069641bde1f571b522c5bf763a
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
Change-Id: Iaeb8f2cea0d3b576a6b138e64a882c701ac88ccb
|
|
Mean operators with height larger than 64 are reshaped but the IFM shape
was then reset to the original value, causing an output diff.
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
Change-Id: I3a89d4efac53173cbd6fe0a5c0542e028bed42ad
|
|
Updated FlatBuffers autogenerated files to TensorFlow 2.11
Change-Id: Ia39d30b06e9a37c9ab119d501ebf442f32167afe
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
|
|
- Weights are internally cloned and reshaped/transposed when
running on the NPU. This happens already in the reader. If
the op is passed through to the CPU there are code that writes
backs these clones but with another round of reshape/transpose.
This adds extra tensors in the optimized file compared to the
original file if the original tensors are subgraph inputs.
- If the op is passed trough to the CPU the clones should not
be written to the file. Solved this by setting the src_tensor
when making the clone.
Change-Id: I9f55d542c099882882920bffe8e15b43b2ca2c8d
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Fixed a problem where the fused activation got lost
when the op was passed through to the CPU
- The fix is to always make sure the attribute is not removed
Change-Id: I612cfa8f6f0a0465459080762094fe61e7ddc1c3
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Fixed an issue whereby a zero length buffer was written out instead
of an empty buffer
- Added a warning message to highlight when this type of semantically
incorrect empty buffer is read from an input network
Change-Id: Iac3bc71a2dbfda53737bbeb6e7f895552f0f13d0
Signed-off-by: Tim Hall <tim.hall@arm.com>
|
|
- Added checking and reporting of missing operator attributes when
reading and writing TFLite file
- Added a TFLite semantic check to ensure that all required attribute
fields of builtin operators are read
- Added some sanity checks for RESHAPE operators that run on the
Ethos-U
- Stopped CPU operators from having their attributes modified
Change-Id: I05700681acdb09554f5945819717c08a9457295c
Signed-off-by: Tim Hall <tim.hall@arm.com>
|
|
- Latest reference has changed implementation for the Mean op
and now only contain one variant.
- Updated Vela implementation to match reference. The full sum
is first calculated and then divided by the numbers of elements.
- Removed the avg pool variant and test case.
- Updated SUPPORTED_OPS.md
Change-Id: I4275e36e3697fa837f119f2cefd7c0ff94231605
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Added int8 and int16 UNIDIRECTIONAL_SEQUENCE_LSTM support.
The implementation does not include support for:
* CIFG
* Peephole
* Projection
* Normalisation
This change also:
* Removed unused Op.BlockLSTM operation type.
* Removed the only one consumer limitation on putting the SplitSliceRead
on the tensor consumer(s), if all consumers fullfills the requirements
* Added Op.VariableTensorWrite as a Operation.memory_function to make
sure writes to variable tensors:
* Always use linear mode
* Are not moved to fast scratch
* Are not fused with other elementwise operation tensor ranges
Change-Id: Ief831738924ac3d1f2ba6d41f10bd6dc969911f3
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
|
|
- Added 64-bit support for ArgMax
- Updated constraints for ArgMax and regenerated SUPPORTED_OPS.md
Change-Id: I4ef7d2e6fccab0088b87757f6afe40a006c77bbd
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Quantization for the OFM was added for the ArgMax operator
as a workaround in order to avoid a crash in the weight compressor.
This quantization is now removed.
- The weight compressor expects that all tensors have a quantization.
Updated code to use scale = 1.0 and zero point = 0 for tensor without
quantization.
Change-Id: I6816dce2db55f7d795d19f88d7fbe7ee419347fc
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Updated ARG_MAX to support IFM rank less than 4
- Regenerated SUPPORTED_OPS.md
Change-Id: Icd8e72733279413cbea49021325e1ab06fdc6011
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Remove op_index constraint and force linear format for all Conv2D that
have strides that can be optimised.
Change-Id: Idef3508ab074ea9abeacac030eaaa15a00ad1211
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
- Add support for ArgMax along depth dimension with a depth limit of 127.
- Only supports 8-bit input and 32-bit output
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
Change-Id: I5f6f0503135bebabbb1ca637f9729587b7c60740
|
|
- There is a latent bug when calculating the mem usage parallel to the
sub schedule. The error is the calculation done when optimizing the sub
schedules. There the cascade size is withdrawn from the snapshot usage
to decide non local memory usage. The problem is that the cascade mem
usage actually also includes non local memory so the end result will be
zero. This is normally not a problem but it will be when starting to
optimize sub schedule when optimizing for Size.
- The solution is to not include the non local usage in the cascade
info, the scheduler already have this information.
- Corrected usage of persistent initial IFM. This size should not be
included for Dedicated SRAM since only intermediate buffers are in SRAM.
- Added some comment to clarify the code in the cascade builder.
Change-Id: I473b36e0d69550ab6565f4ef028195636b362997
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Refactoring move_constant_data in the scheduler. The use case currently
only work for LUT tensor, so simplifying the logic. In order to make it
work for other tensors one would also have to take into consideration
memory usage when building cascades and also the
use_fast_storage_for_feature_maps would be effected.
Change-Id: Ic8de53b65a2c17d34515002d7f184d0ab1830222
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- The logic when bypassing memory only ops is
complicated and it still does not fix all corner cases.
- This patch simplifies the logic by always bypassing
the op by replacing the IFM with the OFM. If that is not
possible the memory only op is changed to an memcpy op.
- The bypassing was previously done in two steps but
is now reduced to one.
Change-Id: I545dd65e0ec77c70be479a5ada2d277cac3a027c
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Reshape ops can be bypassed and there is no need to process them by the NPU.
There are use cases when the IFM must be preserved so a memcpy is needed.
This is implemented by an AvgPool.
- In order to reduce the cost of the AvgPool the IFM can be copied by DMA.
This is faster and also it can be turned into a real NOP in cases where
the IFM and the OFM can use the same memory space.
- Added new memcpy op. Only NHWC format supported since DMA can not change
the format on the fly.
- Allow ofm to reuse ifm for memcpy op
- Make sure the DMA copy size is 16 byte aligned
Change-Id: I3605a48d47646ff60d2bb3644dd3a23f872235a7
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Fixed scale calculations for FullyConnected to match the reference.
Also removed unused low_precision_scaling.
Change-Id: I4b766febff4a0010acd3de708bb49be458d22bf3
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
|