Age | Commit message (Collapse) | Author |
|
- npu_performance now uses write/read shapes instead of using ifm/ofms
for memory cycle estimations.
- also fixes a would be bug in the tflite_graph_optimiser, where one
read shape is not Shape4D.
Change-Id: I2067069a713d2cf9e65a5cc227e803de79940fff
Signed-off-by: William Isaksson <william.isaksson@arm.com>
|
|
Performance estimation now uses the parent_tensor mem_area instead of
the scheduler_op mem_area, because the mem_area is only set on the
parent_tensor by the scheduler.
Signed-off-by: wilisa01 <william.isaksson@arm.com>
Change-Id: I11f73686bfbd6958a8920c5e264a5f95cc3f23d1
|
|
- When optimizing for Size the scheduler does not try to add weight
buffering to the schedule since this would add extra SRAM usage to
the peak usage. However, for all other ops that uses less SRAM than
the peak there is memory available that could be used for weight
buffering and hence improve the performance.
- Removed limitation to only run optimize schedule when optimizing
for Performance. Regardless of optimizing for Performance or Size the
scheduler flow is the same except that the limit for max SRAM usage is
different.
Change-Id: I6880b35655e37b4916a9c15150f0b8e5126a1cd8
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Added fix when building the minimum schedule forcing the stripe
to be even for is_nearest ops. This is required in order to be
able to allow cascading for resize ops.
- Remove limitation in cascade builder that prevents resize ops
to be cascaded.
Change-Id: I05150102b91531ecba786936494f1817a4472f42
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Add --verbose-progress CLI option used to enable printing progress
information in the compiler driver and scheduler.
Change-Id: I99ac8c6a654e60391d5c11e28b89250405daa53a
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
Added int8 and int16 UNIDIRECTIONAL_SEQUENCE_LSTM support.
The implementation does not include support for:
* CIFG
* Peephole
* Projection
* Normalisation
This change also:
* Removed unused Op.BlockLSTM operation type.
* Removed the only one consumer limitation on putting the SplitSliceRead
on the tensor consumer(s), if all consumers fullfills the requirements
* Added Op.VariableTensorWrite as a Operation.memory_function to make
sure writes to variable tensors:
* Always use linear mode
* Are not moved to fast scratch
* Are not fused with other elementwise operation tensor ranges
Change-Id: Ief831738924ac3d1f2ba6d41f10bd6dc969911f3
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
|
|
Remove op_index constraint and force linear format for all Conv2D that
have strides that can be optimised.
Change-Id: Idef3508ab074ea9abeacac030eaaa15a00ad1211
Signed-off-by: Raul Farkas <raul.farkas@arm.com>
|
|
- There is a latent bug when calculating the mem usage parallel to the
sub schedule. The error is the calculation done when optimizing the sub
schedules. There the cascade size is withdrawn from the snapshot usage
to decide non local memory usage. The problem is that the cascade mem
usage actually also includes non local memory so the end result will be
zero. This is normally not a problem but it will be when starting to
optimize sub schedule when optimizing for Size.
- The solution is to not include the non local usage in the cascade
info, the scheduler already have this information.
- Corrected usage of persistent initial IFM. This size should not be
included for Dedicated SRAM since only intermediate buffers are in SRAM.
- Added some comment to clarify the code in the cascade builder.
Change-Id: I473b36e0d69550ab6565f4ef028195636b362997
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Refactoring move_constant_data in the scheduler. The use case currently
only work for LUT tensor, so simplifying the logic. In order to make it
work for other tensors one would also have to take into consideration
memory usage when building cascades and also the
use_fast_storage_for_feature_maps would be effected.
Change-Id: Ic8de53b65a2c17d34515002d7f184d0ab1830222
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- There is a problem with large networks containing many NPU
subgraphs. The scheduling takes too long time since the snapshot
memory calculation is always doing a complete update for the
full graph.
- A complete run is needed in the end to calculate all the
time indexes correctly. However, when scheduling a NPU subgraph
it is enough to extract live ranges for the current schedule
and its operators.
Change-Id: Iccb7d6728119c1428ad0b45a2ac34e92158c15bd
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Previously a feature was added in order to reduce SRAM usage
when optimizing for Size. An investigation has now been done
that shows that this feature is also beneficial when optimizing for
Performance and hence this patch removes the Size only limitation.
Change-Id: I5b130db43cbda47e09d4196ab1daa5a21e35ae00
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
Change-Id: I026facce572ddce4249e05529f2bb1d285552ab9
|
|
IFM's in persistent memory should not be included in the memory
op SRAM calculation.
Change-Id: Iaac4d2ad8b206c5fb727e5815477cb3611a13e0e
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- The cascade builder estimates how much SRAM usage an operator
takes when calculating the cascades. If an elementwise operator
is included in a cascade the IFM2 will always be a constant/scalar
and the IFM2 will be in permanent memory and the size of the
IFM2 should not be included in the SRAM estimate.
- The scheduler did not take into account that IFM can be reused
for the OFM when calculating the op memory usage resulting in
a negative number for non-local memory usage. Corrected the
calculation and added assert to detect future problems.
Change-Id: Id7ec8fe1ec5560290f34579a7b9203a75067aba2
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
|
|
- Update copyright notices to use SPDX format and add OSS mail as contact.
- Update years on files where it had been missed.
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
Change-Id: I7e9715ea4e17b76252728c708e46df12ad67ab1f
|
|
- Due to a SPLIT op the following ADD op did get an IFM shape
that is bigger than its original shape but that is handled
by read_offset and read_shapes. The problem was that
the IFM was considered not be primary and an erroneously
swap was done.
- Make it even more clear when the swap is allowed.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I0aefa04234f66c935f269267ae8ed1d77da64c81
|
|
- Remove very long live ranges that are standing out compared to
its neighbors. This can be seen on large networks with complex
structure. If they are chosen instead of shorter live ranges,
it will be difficult for the HillClimb Allocator to find a perfect
fit in the final allocation.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I6cf23adfdc06c1e93e12e9cf816453d940ff31f7
|
|
- Refactored erroneously if statement that allowed illegal
swapping between ifm1 and ifm2 for elementwise operators.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: Iec571f710824432edac9104d960f199f33a1b241
|
|
- The algorithm for trying out different stripes in order
to optimize a sub schedule/cascade, have a problem that it
can split the initial cascade into several smaller cascades.
The problem with this is that it will increase IFM/OFM DRAM
bandwith and performance will drop.
- Changed the stripe algorithm to prefer long cascades.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I4f38b381597b7094819e9dd463aa1876e4e6bc62
|
|
- The cascade builder is using the ifm_ifm2_correct_order
function in order to decide if the operator is cascadable or not.
The problem is that this function expects a full shape or no shape
and the cascade builder did not provide that, so the operator was
reported to be non cascadable.
- The fix is to provide a full 4D shape, also refactoring
ifm_ifm2_correct_order to use 4D shape to avoid confusion
in the future.
- Refactoring code so that the scheduler can perform a
correct ifm and ifm2 swap.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I9a86c4690612f332afa428456a07e67698852495
|
|
Fixed output diff when cascading elementwise operators with
reversed operand order.
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
Change-Id: Iac2e28cfb53037b929459af213f4fa7715b3e6de
|
|
Output diffs were found to be caused by odd input stripe heights,
despite the input being an upscaling operator.
Signed-off-by: erik.andersson@arm.com <erik.andersson@arm.com>
Change-Id: Ia3791d815250364cfe7a38c3ed0e30768d64ca08
|
|
- When compiling for shared SRAM the old scheduler has an option so
that it produces less SRAM than what the new scheduler manages to
produce. The old scheduler was able to creates more/longer cascades.
In order to improve the new scheduler, the following has been
implemented:
- Take persistent IFM's into account when creating the min schedule.
- Choose longer cascades when it is possible to reduce the total
SRAM usage compared to using shorter cascades.
- Updated calculation for estimated SRAM usage for elementwise ops.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I209bbf2d94425e4f6aacb1d151b3b2aa65c0870b
|
|
Added check to see if additional stripe data is needed from producer op
when cascading to make sure the stripes are not overwriting data still
being used. Also changed scheduler to make sure ResizeBilinear always
runs with even stripe height.
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
Change-Id: If7d723e6be29575c2b55c400eebbe8275a1aa328
|
|
Enabled elementwise cascading for binary/single variable IFM operators.
Signed-off-by: erik.andersson@arm.com <erik.andersson@arm.com>
Change-Id: I1c0867875fdc5c4980224fb570185c11e719d5cd
|
|
- The fast storage allocator is supposed to add all feature maps
that does not fit in SRAM to an evicted list. However, in the
case when conflicting tensors were handled the list was not updated.
-This patch makes sure to update the list correctly.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: Ibeb3b4e4927f22a8206784a478f1ac38bd7f5a87
|
|
- The fast storage allocator only looked at tensor size, giving priority
to larger tensors. The problem with this method is that it does not
consider the actual read/write access of the tensor. So, a smaller
tensor size can cause higher memory transactions than a bigger one.
- The solution is to calculate the read/write access of the tensor and
add that score to the decision when deciding where to place the tensors.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I59eb9bd3a44a0238b576cfd8f09ff27012b99070
|
|
- For allocations that have a hard memory limit the Hill Climb allocator
should be given more attempts to find a solution that would fit
- The fix is to use a memory limit when there is a hard constraint, and
a minimum iteration count, reset on every improvement, when there is a soft
constraint
- Added maximum number iterations CLI option
Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I19ff53a0b68412de280263626778a3102cbe52fa
|
|
- Problem is due to a divide by zero
- Fix is simply to detect and assign zero. This could also affect
improvement_sram
Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I29a67710a17ef22656fb5ecfe9476953ffa5533d
|
|
- Added support to print per operator sram usage and performance
information
- Added new CLI option --verbose-performance to control this feature
Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I368599b410e5d441d9804871fc51b7a1049d85b3
|
|
Allow schedule do be used when calculations says zero total improvement
but calculations on the other hand shows there are dram improvement.
When testing on real target, total performance is improvement.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: Ib4f2a37710dc7954b72b48c38fce4817ccd7187b
|
|
Uses separate tensors for the individual weight buffers
in case of weight double buffering.
Each weight buffer tensor gets its own individual live range.
This patch is a clone of a previously reverted patch, but with some
additional bug fixes applied.
Signed-off-by: Rickard Bolin <rickard.bolin@arm.com>
Change-Id: I868c70d15821eb9f1399186f2da6e7345f6ee343
|
|
- Due to that bigger weight buffer sizes are being used, there are use cases
when feature maps are evicted from SRAM, causing the total performance to drop.
- A way to improve this is to limit the memory for those weight buffer ops,
to get the feature maps back to SRAM, and see if total performance is improved.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: Ibfaff330677185186af9f6362dfbe04824a329f6
|
|
This reverts commit cc5f4de1c35ba44fca7ff6295c6ae846f8242344.
Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I0fa5babfe9ad9ec668720d04fe1c16d9a9092131
|
|
Corrected calculation for used bufferering depth. Before change there
were scenarios when it was set to smaller sizes than needed.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I162859ade78487e848510c6a605685e4568c7068
|
|
Update version of Black to 22.3.0 due to updated dependencies.
Updates to fix reported issues due to new version.
Signed-off-by: Jonas Ohlsson <jonas.ohlsson@arm.com>
Change-Id: I60056aae452093ce8dcea1f499ecced22b25eef1
|
|
Uses separate tensors for the individual weight buffers
in case of weight double buffering.
Each weight buffer tensor gets its own individual live range.
Change-Id: I724a8c61a7045615fbd2ed9535663076ac8edd13
Signed-off-by: Louis Verhaard <louis.verhaard@arm.com>
|
|
- Fixed a bug due to ResizeBilinear modifying the attributes of a
shared IFM
- The ifm_resampling_mode is now an attribute of an operator rather
than a tensor
- Changed all calls to try_block_config() to use the attribute rather
than recalculating it in multiple places
Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I4641e9cd6b049bd4186776d98e3e751c5e5bcc06
|
|
Add mypy to pre-commit and clean up all reported errors.
Signed-off-by: Jonas Ohlsson <jonas.ohlsson@arm.com>
Change-Id: If7dc869f5fecdb0e2db40f14e7d9db21aa33df71
|
|
Fast storage allocator did not always return an optimal
allocation.
Signed-off-by: Louis Verhaard <louis.verhaard@arm.com>
Change-Id: Ic758b6c4a82dc2633c4752b0c204a27ed36f651b
|
|
Update the version of flake8 used in pre-commit to facilitate
adding mypy to pre-commit.
Signed-off-by: Jonas Ohlsson <jonas.ohlsson@arm.com>
Change-Id: I457dec87b77487ca6f14ff4a679c4cc927b272b0
|
|
*Original weights and encoded NPU weight now report correct size instead
of zero when running vela with --verbose-weights flag
(Code to update the aforementioned attributes was missing)
*Removed print references to unencoded NPU weight size
Change-Id: I6d3e41c04cc46d24eeb54cab89818a35e5df27be
Signed-off-by: Ayaan Masood <Ayaan.Masood@arm.com>
|
|
Reduce memory footprint when using optimization strategy Size
for elementwise operations.
Signed-off-by: Johan Alfven <johan.alfven@arm.com>
Change-Id: I30380aed587c31adbf7615f74179b4c5da686773
|
|
Ported the improved spilling behaviour from Regor
into Vela. This replaces use_fast_storage_for_feature_maps
with allocate_feature_maps and introduces the class called
FastStorageComponentAllocator.
Signed-off-by: erik.andersson@arm.com <erik.andersson@arm.com>
Change-Id: I34785840c905a79750a62863773015b00fb43387
|
|
Added checks to avoid merging elementwise op live ranges for subgraph
inputs and outputs, which sometimes caused problems when parts of the
network run on CPU.
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
Change-Id: Id07ab277a205b8550d19a276559f8904b9a4b4be
|
|
Resolves a bug where an IndexError would occur
if the same tensor was assigned to both IFM
and IFM2 of a binary elementwise operator
due to duplicates being allowed in operator
inputs but not in pass inputs.
Signed-off-by: Dwight Lidman <dwight.lidman@arm.com>
Change-Id: I39a6206a6252f6a848be9f9d4c5a8dc749c71699
|
|
Fixed output diff for some architectures due to incorrect IFM buffer size
calculation when using NearestNeighbour upscaling.
Signed-off-by: Fredrik Svedberg <fredrik.svedberg@arm.com>
Change-Id: I0d6d1efc606603cdd6188ae282e7f6babfd7e24e
|
|
Additional check added for when constant data can be moved
to fast storage.
Do not move constant data for concat.
Signed-off-by: Patrik Gustavsson <patrik.gustavsson@arm.com>
Change-Id: Ib8b5fd1483ee9fabe48e9874a5723af9b7c5231a
|
|
This commit fixes one assert regarding rolling buffers for 3D tensors.
It also addresses another issue where the incorrect weight buffering was
proposed for cascaded operators.
Signed-off-by: Jacob Bohlin <jacob.bohlin@arm.com>
Change-Id: I2501f35e5668b3085d917751cfc8002d250973d8
|
|
- Fixed index error in memory_snapshot
- When removing a cascade, also references are removed
Change-Id: I2b35dc52671d8ce115eb32bfdd93584391d1fc6d
Signed-off-by: Louis Verhaard <louis.verhaard@arm.com>
|