diff options
Diffstat (limited to 'docs/sections/memory_considerations.md')
-rw-r--r-- | docs/sections/memory_considerations.md | 151 |
1 files changed, 71 insertions, 80 deletions
diff --git a/docs/sections/memory_considerations.md b/docs/sections/memory_considerations.md index fc81f8f..89baf41 100644 --- a/docs/sections/memory_considerations.md +++ b/docs/sections/memory_considerations.md @@ -7,7 +7,7 @@ - [Understanding memory usage from Vela output](#understanding-memory-usage-from-vela-output) - [Total SRAM used](#total-sram-used) - [Total Off-chip Flash used](#total-off_chip-flash-used) - - [Non-default configurations](#non-default-configurations) + - [Memory mode configurations](#memory-mode-configurations) - [Tensor arena and neural network model memory placement](#tensor-arena-and-neural-network-model-memory-placement) - [Memory usage for ML use-cases](#memory-usage-for-ml-use_cases) - [Memory constraints](#memory-constraints) @@ -94,52 +94,88 @@ buffers is. These are: ### Total SRAM used -When the neural network model is compiled with Vela, a summary report that includes memory usage is generated. For -example, compiling the keyword spotting model +When the neural network model is compiled with Vela, a summary report that includes memory usage is generated. +For example, compiling the keyword spotting model [ds_cnn_clustered_int8](https://github.com/ARM-software/ML-zoo/blob/master/models/keyword_spotting/ds_cnn_large/tflite_clustered_int8/ds_cnn_clustered_int8.tflite) -with Vela produces, among others, the following output: +with the Vela command: + +```commandline +vela \ + --accelerator-config=ethos-u55-128 \ + --optimise Performance \ + --config scripts/vela/default_vela.ini + --memory-mode=Shared_Sram + --system-config=Ethos_U55_High_End_Embedded + ds_cnn_clustered_int8.tflite +``` + +It produces, among others, the following output: ```log -Total SRAM used 70.77 KiB -Total Off-chip Flash used 430.78 KiB +Total SRAM used 146.31 KiB +Total Off-chip Flash used 452.42 KiB ``` The `Total SRAM used` here shows the required memory to store the `tensor arena` for the TensorFlow Lite Micro framework. This is the amount of memory required to store the input, output, and intermediate buffers. In the preceding -example, the tensor arena requires 70.77 KiB of available SRAM. +example, the tensor arena requires 146.31 KiB of available SRAM. > **Note:** Vela can only estimate the SRAM required for graph execution. It has no way of estimating the memory used by > internal structures from TensorFlow Lite Micro framework. -Therefore, we recommend that you top this memory size by at least 2KiB. We also recoomend that you also carve out the +Therefore, we recommend that you top this memory size by at least 2KiB. We also recommend that you also carve out the `tensor arena` of this size, and then place it on the SRAM of the target system. ### Total Off-chip Flash used The `Total Off-chip Flash` parameter indicates the minimum amount of flash required to store the neural network model. -In the preceding example, the system must have a minimum of 430.78 KiB of available flash memory to store the `.tflite` +In the preceding example, the system must have a minimum of 452.42 KiB of available flash memory to store the `.tflite` file contents. > **Note:** The Arm® *Corstone™-300* system uses the DDR region as a flash memory. The timing adapter sets up the AXI > bus that is wired to the DDR to mimic both bandwidth and latency characteristics of a flash memory device. -## Non-default configurations +## Memory mode configurations + +The preceding example outlines a typical configuration for *Ethos-U55* NPU, and this corresponds to the default +Vela memory mode setting. +Evaluation kit supports all the *Ethos-U* NPU memory modes: + +| *Ethos™-U* NPU | Default Memory Mode | Other Memory Modes supported | +|------------------|------------------------|--------------------------------| +| *Ethos™-U55* | `Shared_Sram` | `Sram_Only` | +| *Ethos™-U65* | `Dedicated_Sram` | `Shared_Sram` | -The preceding example outlines a typical configuration, and this corresponds to the default Vela setting. However, the -system SRAM can also be used to store the neural network model along with the `tensor arena`. Vela supports optimizing -the model for this configuration with its `Sram_Only` memory mode. +For further information on the default settings, please refer to: [default_vela.ini](../../scripts/vela/default_vela.ini). -For further information, please refer to: [vela.ini](../../scripts/vela/vela.ini). +For *Ethos-U55* NPU, the system SRAM can also be used to store the neural network model along with the `tensor arena`. +Vela supports optimizing the model for this configuration with its `Sram_Only` memory mode. +Although the Vela settings for this configurations suggests that only AXI0 bus is used, when compiling the model +a warning is generated, for example: + +```log +vela \ + --accelerator-config=ethos-u55-128 \ + --optimise Performance \ + --config scripts/vela/default_vela.ini + --memory-mode=Sram_Only + --system-config=Ethos_U55_High_End_Embedded + ds_cnn_clustered_int8.tflite + +Info: Changing const_mem_area from Sram to OnChipFlash. This will use the same characteristics as Sram. +``` -To make use of a neural network model that is optimized for this configuration, the linker script for the target -platform must be changed. By default, the linker scripts are set up to support the default configuration only. +This means that the neural network model is always placed in the flash region. In this case, timing adapters for the +AXI buses are set the same values to mimic both bandwidth and latency characteristics of a SRAM memory device. +See [Ethos-U55 NPU timing adapter default configuration](../../scripts/cmake/timing_adapter/ta_config_u55_high_end.cmake). For script snippets, please refer to: [Memory constraints](./memory_considerations.md#memory-constraints). > **Note:** > -> 1. The the `Shared_Sram` memory mode represents the default configuration. -> 2. The `Dedicated_Sram` mode is only applicable for the Arm® *Ethos™-U65*. +> 1. The `Shared_Sram` memory mode represents the default configuration. +> 2. The `Dedicated_Sram` memory mode is only applicable for the Arm® *Ethos™-U65*. +> 3. The `Sram_only` memory mode is only applicable for the Arm® *Ethos™-U55*. ## Tensor arena and neural network model memory placement @@ -147,18 +183,15 @@ The evaluation kit uses the name `activation buffer` for the `tensor arena` in t Every use-case application has a corresponding `<use_case_name>_ACTIVATION_BUF_SZ` parameter that governs the maximum available size of the `activation buffer` for that particular use-case. -The linker script is set up to place this memory region in SRAM. However, if the memory required is more than what the -target platform supports, this buffer needs to be placed on flash instead. Every target platform has a profile -definition in the form of a `CMake` file. +The linker script is set up to place this memory region in SRAM for *Ethos-U55* and in flash for *Ethos-U65*. +Every target platform has a profile definition in the form of a `CMake` file. For further information and an example, please refer to: [Corstone-300 profile](../../scripts/cmake/subsystem-profiles/corstone-sse-300.cmake). The parameter `ACTIVATION_BUF_SRAM_SZ` defines the maximum SRAM size available for the platform. This is propagated -through the build system. If the `<use_case_name>_ACTIVATION_BUF_SZ` for a given use-case is *more* than the -`ACTIVATION_BUF_SRAM_SZ` for the target build platform, then the `activation buffer` is placed on the flash memory -instead. +through the build system. -The neural network model is always placed in the flash region. However, this can be changed in the linker script. +The neural network model is always placed in the flash region (even in case of `Sram_Only` memory mode as mentioned earlier). ## Memory usage for ML use-cases @@ -168,12 +201,12 @@ memory requirements for the different use-cases of the evaluation kit. > **Note:** The SRAM usage does not include memory used by TensorFlow Lite Micro and must be topped up as explained > under [Total SRAM used](#total-sram-used). -- [Keyword spotting model](https://github.com/ARM-software/ML-zoo/tree/master/models/keyword_spotting/ds_cnn_large/tflite_clustered_int8) +- [Keyword spotting model](https://github.com/ARM-software/ML-zoo/tree/68b5fbc77ed28e67b2efc915997ea4477c1d9d5b//models/keyword_spotting/ds_cnn_large/tflite_clustered_int8) requires - 70.7 KiB of SRAM - 430.7 KiB of flash memory. -- [Image classification model](https://github.com/ARM-software/ML-zoo/tree/master/models/image_classification/mobilenet_v2_1.0_224/tflite_uint8) +- [Image classification model](https://github.com/ARM-software/ML-zoo/tree/e0aa361b03c738047b9147d1a50e3f2dcb13dbcb/models/image_classification/mobilenet_v2_1.0_224/tflite_uint8) requires - 638.6 KiB of SRAM - 3.1 MB of flash memory. @@ -199,38 +232,8 @@ scatter file is as follows: ;--------------------------------------------------------- LOAD_REGION_0 0x00000000 0x00080000 { - ;----------------------------------------------------- - ; First part of code mem - 512kiB - ;----------------------------------------------------- - itcm.bin 0x00000000 0x00080000 - { - *.o (RESET, +First) - * (InRoot$$Sections) - - ; Essentially only RO-CODE, RO-DATA is in a - ; different region. - .ANY (+RO) - } - - ;----------------------------------------------------- - ; 128kiB of 512kiB DTCM is used for any other RW or ZI - ; data. Note: this region is internal to the Cortex-M - ; CPU. - ;----------------------------------------------------- - dtcm.bin 0x20000000 0x00020000 - { - ; Any R/W and/or zero initialised data - .ANY(+RW +ZI) - } - ;----------------------------------------------------- - ; 384kiB of stack space within the DTCM region. See - ; `dtcm.bin` for the first section. Note: by virtue of - ; being part of DTCM, this region is only accessible - ; from Cortex-M55. - ;----------------------------------------------------- - ARM_LIB_STACK 0x20020000 EMPTY ALIGN 8 0x00060000 - {} +... ;----------------------------------------------------- ; SSE-300's internal SRAM of 4MiB - reserved for @@ -240,8 +243,11 @@ LOAD_REGION_0 0x00000000 0x00080000 ;----------------------------------------------------- isram.bin 0x31000000 UNINIT ALIGN 16 0x00400000 { - ; activation buffers a.k.a tensor arena - *.o (.bss.NoInit.activation_buf) + ; Cache area (if used) + *.o (.bss.NoInit.ethos_u_cache) + + ; activation buffers a.k.a tensor arena when memory mode sram only + *.o (.bss.NoInit.activation_buf_sram) } } @@ -251,7 +257,7 @@ LOAD_REGION_0 0x00000000 0x00080000 LOAD_REGION_1 0x70000000 0x02000000 { ;----------------------------------------------------- - ; 32 MiB of DRAM space for neural network model, + ; 32 MiB of DDR space for neural network model, ; input vectors and labels. If the activation buffer ; size required by the network is bigger than the ; SRAM size available, it is accommodated here. @@ -261,33 +267,18 @@ LOAD_REGION_1 0x70000000 0x02000000 ; nn model's baked in input matrices *.o (ifm) - ; nn model + ; nn model's default space *.o (nn_model) ; labels *.o (labels) - ; if the activation buffer (tensor arena) doesn't - ; fit in the SRAM region, we accommodate it here - *.o (activation_buf) + ; activation buffers a.k.a tensor arena when memory mode dedicated sram + *.o (activation_buf_dram) } - ;----------------------------------------------------- - ; First 256kiB of BRAM (FPGA SRAM) used for RO data. - ; Note: Total BRAM size available is 2MiB. - ;----------------------------------------------------- - bram.bin 0x11000000 ALIGN 8 0x00040000 - { - ; RO data (incl. unwinding tables for debugging) - .ANY (+RO-DATA) - } +... - ;----------------------------------------------------- - ; Remaining part of the 2MiB BRAM used as heap space. - ; 0x00200000 - 0x00040000 = 0x001C0000 (1.75 MiB) - ;----------------------------------------------------- - ARM_LIB_HEAP 0x11040000 EMPTY ALIGN 8 0x001C0000 - {} } ``` |