summaryrefslogtreecommitdiff
path: root/docs/sections/memory_considerations.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/sections/memory_considerations.md')
-rw-r--r--docs/sections/memory_considerations.md151
1 files changed, 71 insertions, 80 deletions
diff --git a/docs/sections/memory_considerations.md b/docs/sections/memory_considerations.md
index fc81f8f..89baf41 100644
--- a/docs/sections/memory_considerations.md
+++ b/docs/sections/memory_considerations.md
@@ -7,7 +7,7 @@
- [Understanding memory usage from Vela output](#understanding-memory-usage-from-vela-output)
- [Total SRAM used](#total-sram-used)
- [Total Off-chip Flash used](#total-off_chip-flash-used)
- - [Non-default configurations](#non-default-configurations)
+ - [Memory mode configurations](#memory-mode-configurations)
- [Tensor arena and neural network model memory placement](#tensor-arena-and-neural-network-model-memory-placement)
- [Memory usage for ML use-cases](#memory-usage-for-ml-use_cases)
- [Memory constraints](#memory-constraints)
@@ -94,52 +94,88 @@ buffers is. These are:
### Total SRAM used
-When the neural network model is compiled with Vela, a summary report that includes memory usage is generated. For
-example, compiling the keyword spotting model
+When the neural network model is compiled with Vela, a summary report that includes memory usage is generated.
+For example, compiling the keyword spotting model
[ds_cnn_clustered_int8](https://github.com/ARM-software/ML-zoo/blob/master/models/keyword_spotting/ds_cnn_large/tflite_clustered_int8/ds_cnn_clustered_int8.tflite)
-with Vela produces, among others, the following output:
+with the Vela command:
+
+```commandline
+vela \
+ --accelerator-config=ethos-u55-128 \
+ --optimise Performance \
+ --config scripts/vela/default_vela.ini
+ --memory-mode=Shared_Sram
+ --system-config=Ethos_U55_High_End_Embedded
+ ds_cnn_clustered_int8.tflite
+```
+
+It produces, among others, the following output:
```log
-Total SRAM used 70.77 KiB
-Total Off-chip Flash used 430.78 KiB
+Total SRAM used 146.31 KiB
+Total Off-chip Flash used 452.42 KiB
```
The `Total SRAM used` here shows the required memory to store the `tensor arena` for the TensorFlow Lite Micro
framework. This is the amount of memory required to store the input, output, and intermediate buffers. In the preceding
-example, the tensor arena requires 70.77 KiB of available SRAM.
+example, the tensor arena requires 146.31 KiB of available SRAM.
> **Note:** Vela can only estimate the SRAM required for graph execution. It has no way of estimating the memory used by
> internal structures from TensorFlow Lite Micro framework.
-Therefore, we recommend that you top this memory size by at least 2KiB. We also recoomend that you also carve out the
+Therefore, we recommend that you top this memory size by at least 2KiB. We also recommend that you also carve out the
`tensor arena` of this size, and then place it on the SRAM of the target system.
### Total Off-chip Flash used
The `Total Off-chip Flash` parameter indicates the minimum amount of flash required to store the neural network model.
-In the preceding example, the system must have a minimum of 430.78 KiB of available flash memory to store the `.tflite`
+In the preceding example, the system must have a minimum of 452.42 KiB of available flash memory to store the `.tflite`
file contents.
> **Note:** The Arm® *Corstone™-300* system uses the DDR region as a flash memory. The timing adapter sets up the AXI
> bus that is wired to the DDR to mimic both bandwidth and latency characteristics of a flash memory device.
-## Non-default configurations
+## Memory mode configurations
+
+The preceding example outlines a typical configuration for *Ethos-U55* NPU, and this corresponds to the default
+Vela memory mode setting.
+Evaluation kit supports all the *Ethos-U* NPU memory modes:
+
+| *Ethos™-U* NPU | Default Memory Mode | Other Memory Modes supported |
+|------------------|------------------------|--------------------------------|
+| *Ethos™-U55* | `Shared_Sram` | `Sram_Only` |
+| *Ethos™-U65* | `Dedicated_Sram` | `Shared_Sram` |
-The preceding example outlines a typical configuration, and this corresponds to the default Vela setting. However, the
-system SRAM can also be used to store the neural network model along with the `tensor arena`. Vela supports optimizing
-the model for this configuration with its `Sram_Only` memory mode.
+For further information on the default settings, please refer to: [default_vela.ini](../../scripts/vela/default_vela.ini).
-For further information, please refer to: [vela.ini](../../scripts/vela/vela.ini).
+For *Ethos-U55* NPU, the system SRAM can also be used to store the neural network model along with the `tensor arena`.
+Vela supports optimizing the model for this configuration with its `Sram_Only` memory mode.
+Although the Vela settings for this configurations suggests that only AXI0 bus is used, when compiling the model
+a warning is generated, for example:
+
+```log
+vela \
+ --accelerator-config=ethos-u55-128 \
+ --optimise Performance \
+ --config scripts/vela/default_vela.ini
+ --memory-mode=Sram_Only
+ --system-config=Ethos_U55_High_End_Embedded
+ ds_cnn_clustered_int8.tflite
+
+Info: Changing const_mem_area from Sram to OnChipFlash. This will use the same characteristics as Sram.
+```
-To make use of a neural network model that is optimized for this configuration, the linker script for the target
-platform must be changed. By default, the linker scripts are set up to support the default configuration only.
+This means that the neural network model is always placed in the flash region. In this case, timing adapters for the
+AXI buses are set the same values to mimic both bandwidth and latency characteristics of a SRAM memory device.
+See [Ethos-U55 NPU timing adapter default configuration](../../scripts/cmake/timing_adapter/ta_config_u55_high_end.cmake).
For script snippets, please refer to: [Memory constraints](./memory_considerations.md#memory-constraints).
> **Note:**
>
-> 1. The the `Shared_Sram` memory mode represents the default configuration.
-> 2. The `Dedicated_Sram` mode is only applicable for the Arm® *Ethos™-U65*.
+> 1. The `Shared_Sram` memory mode represents the default configuration.
+> 2. The `Dedicated_Sram` memory mode is only applicable for the Arm® *Ethos™-U65*.
+> 3. The `Sram_only` memory mode is only applicable for the Arm® *Ethos™-U55*.
## Tensor arena and neural network model memory placement
@@ -147,18 +183,15 @@ The evaluation kit uses the name `activation buffer` for the `tensor arena` in t
Every use-case application has a corresponding `<use_case_name>_ACTIVATION_BUF_SZ` parameter that governs the maximum
available size of the `activation buffer` for that particular use-case.
-The linker script is set up to place this memory region in SRAM. However, if the memory required is more than what the
-target platform supports, this buffer needs to be placed on flash instead. Every target platform has a profile
-definition in the form of a `CMake` file.
+The linker script is set up to place this memory region in SRAM for *Ethos-U55* and in flash for *Ethos-U65*.
+Every target platform has a profile definition in the form of a `CMake` file.
For further information and an example, please refer to: [Corstone-300 profile](../../scripts/cmake/subsystem-profiles/corstone-sse-300.cmake).
The parameter `ACTIVATION_BUF_SRAM_SZ` defines the maximum SRAM size available for the platform. This is propagated
-through the build system. If the `<use_case_name>_ACTIVATION_BUF_SZ` for a given use-case is *more* than the
-`ACTIVATION_BUF_SRAM_SZ` for the target build platform, then the `activation buffer` is placed on the flash memory
-instead.
+through the build system.
-The neural network model is always placed in the flash region. However, this can be changed in the linker script.
+The neural network model is always placed in the flash region (even in case of `Sram_Only` memory mode as mentioned earlier).
## Memory usage for ML use-cases
@@ -168,12 +201,12 @@ memory requirements for the different use-cases of the evaluation kit.
> **Note:** The SRAM usage does not include memory used by TensorFlow Lite Micro and must be topped up as explained
> under [Total SRAM used](#total-sram-used).
-- [Keyword spotting model](https://github.com/ARM-software/ML-zoo/tree/master/models/keyword_spotting/ds_cnn_large/tflite_clustered_int8)
+- [Keyword spotting model](https://github.com/ARM-software/ML-zoo/tree/68b5fbc77ed28e67b2efc915997ea4477c1d9d5b//models/keyword_spotting/ds_cnn_large/tflite_clustered_int8)
requires
- 70.7 KiB of SRAM
- 430.7 KiB of flash memory.
-- [Image classification model](https://github.com/ARM-software/ML-zoo/tree/master/models/image_classification/mobilenet_v2_1.0_224/tflite_uint8)
+- [Image classification model](https://github.com/ARM-software/ML-zoo/tree/e0aa361b03c738047b9147d1a50e3f2dcb13dbcb/models/image_classification/mobilenet_v2_1.0_224/tflite_uint8)
requires
- 638.6 KiB of SRAM
- 3.1 MB of flash memory.
@@ -199,38 +232,8 @@ scatter file is as follows:
;---------------------------------------------------------
LOAD_REGION_0 0x00000000 0x00080000
{
- ;-----------------------------------------------------
- ; First part of code mem - 512kiB
- ;-----------------------------------------------------
- itcm.bin 0x00000000 0x00080000
- {
- *.o (RESET, +First)
- * (InRoot$$Sections)
-
- ; Essentially only RO-CODE, RO-DATA is in a
- ; different region.
- .ANY (+RO)
- }
-
- ;-----------------------------------------------------
- ; 128kiB of 512kiB DTCM is used for any other RW or ZI
- ; data. Note: this region is internal to the Cortex-M
- ; CPU.
- ;-----------------------------------------------------
- dtcm.bin 0x20000000 0x00020000
- {
- ; Any R/W and/or zero initialised data
- .ANY(+RW +ZI)
- }
- ;-----------------------------------------------------
- ; 384kiB of stack space within the DTCM region. See
- ; `dtcm.bin` for the first section. Note: by virtue of
- ; being part of DTCM, this region is only accessible
- ; from Cortex-M55.
- ;-----------------------------------------------------
- ARM_LIB_STACK 0x20020000 EMPTY ALIGN 8 0x00060000
- {}
+...
;-----------------------------------------------------
; SSE-300's internal SRAM of 4MiB - reserved for
@@ -240,8 +243,11 @@ LOAD_REGION_0 0x00000000 0x00080000
;-----------------------------------------------------
isram.bin 0x31000000 UNINIT ALIGN 16 0x00400000
{
- ; activation buffers a.k.a tensor arena
- *.o (.bss.NoInit.activation_buf)
+ ; Cache area (if used)
+ *.o (.bss.NoInit.ethos_u_cache)
+
+ ; activation buffers a.k.a tensor arena when memory mode sram only
+ *.o (.bss.NoInit.activation_buf_sram)
}
}
@@ -251,7 +257,7 @@ LOAD_REGION_0 0x00000000 0x00080000
LOAD_REGION_1 0x70000000 0x02000000
{
;-----------------------------------------------------
- ; 32 MiB of DRAM space for neural network model,
+ ; 32 MiB of DDR space for neural network model,
; input vectors and labels. If the activation buffer
; size required by the network is bigger than the
; SRAM size available, it is accommodated here.
@@ -261,33 +267,18 @@ LOAD_REGION_1 0x70000000 0x02000000
; nn model's baked in input matrices
*.o (ifm)
- ; nn model
+ ; nn model's default space
*.o (nn_model)
; labels
*.o (labels)
- ; if the activation buffer (tensor arena) doesn't
- ; fit in the SRAM region, we accommodate it here
- *.o (activation_buf)
+ ; activation buffers a.k.a tensor arena when memory mode dedicated sram
+ *.o (activation_buf_dram)
}
- ;-----------------------------------------------------
- ; First 256kiB of BRAM (FPGA SRAM) used for RO data.
- ; Note: Total BRAM size available is 2MiB.
- ;-----------------------------------------------------
- bram.bin 0x11000000 ALIGN 8 0x00040000
- {
- ; RO data (incl. unwinding tables for debugging)
- .ANY (+RO-DATA)
- }
+...
- ;-----------------------------------------------------
- ; Remaining part of the 2MiB BRAM used as heap space.
- ; 0x00200000 - 0x00040000 = 0x001C0000 (1.75 MiB)
- ;-----------------------------------------------------
- ARM_LIB_HEAP 0x11040000 EMPTY ALIGN 8 0x001C0000
- {}
}
```