From 6448932cc1c612d78e62c778ebb228b3cbe96a58 Mon Sep 17 00:00:00 2001
From: Kshitij Sisodia <kshitij.sisodia@arm.com>
Date: Tue, 27 Apr 2021 18:39:47 +0100
Subject: MLECO-1891: Document memory considerations

Change-Id: I8d775cbd5372d1142244daf6877355a3d5417a9e
Signed-off-by: George Gekov <george.gekov@arm.com>
---
 docs/documentation.md                  |  14 +-
 docs/sections/building.md              | 102 +-------------
 docs/sections/memory_considerations.md | 238 +++++++++++++++++++++++++++++++++
 3 files changed, 251 insertions(+), 103 deletions(-)
 create mode 100644 docs/sections/memory_considerations.md

diff --git a/docs/documentation.md b/docs/documentation.md
index f39e681..566feab 100644
--- a/docs/documentation.md
+++ b/docs/documentation.md
@@ -1,9 +1,9 @@
 # Arm® ML embedded evaluation kit
 
-## Table of Content
+## Table of Contents
 
 - [Arm® ML embedded evaluation kit](./documentation.md#arm-ml-embedded-evaluation-kit)
-  - [Table of Content](./documentation.md#table-of-content)
+  - [Table of Contents](./documentation.md#table-of-content)
   - [Trademarks](./documentation.md#trademarks)
   - [Prerequisites](./documentation.md#prerequisites)
     - [Additional reading](./documentation.md#additional-reading)
@@ -14,13 +14,14 @@
   - [Running code samples applications](./documentation.md#running-code-samples-applications)
   - [Implementing custom ML application](./documentation.md#implementing-custom-ml-application)
   - [Testing and benchmarking](./documentation.md#testing-and-benchmarking)
+  - [Memory considerations](./documentation.md#memory-considerations)
   - [Troubleshooting](./documentation.md#troubleshooting)
   - [Contribution guidelines](./documentation.md#contribution-guidelines)
     - [Coding standards and guidelines](./documentation.md#coding-standards-and-guidelines)
     - [Code Reviews](./documentation.md#code-reviews)
     - [Testing](./documentation.md#testing)
-  - [Communication](./documentation.md#communication)  
-  - [Licenses](./documentation.md#licenses)  
+  - [Communication](./documentation.md#communication)
+  - [Licenses](./documentation.md#licenses)
   - [Appendix](./documentation.md#appendix)
 
 ## Trademarks
@@ -238,7 +239,6 @@ See:
   - [Add custom inputs](./sections/building.md#add-custom-inputs)
   - [Add custom model](./sections/building.md#add-custom-model)
   - [Optimize custom model with Vela compiler](./sections/building.md#optimize-custom-model-with-vela-compiler)
-  - [Memory constraints](./sections/building.md#memory-constraints)
   - [Automatic file generation](./sections/building.md#automatic-file-generation)
 
 ## Deployment
@@ -290,6 +290,10 @@ See:
 
 See [Testing and benchmarking](./sections/testing_benchmarking.md).
 
+## Memory Considerations
+
+See [Memory considerations](./sections/memory_considerations.md)
+
 ## Troubleshooting
 
 See:
diff --git a/docs/sections/building.md b/docs/sections/building.md
index 6241286..c53b1f5 100644
--- a/docs/sections/building.md
+++ b/docs/sections/building.md
@@ -2,23 +2,22 @@
 
 ## Contents
 
-- [Building the Code Samples application from sources](#building-the-ml-embedded-code-sample-applications-from-sources)
+- [Building the ML embedded code sample applications from sources](#building-the-ml-embedded-code-sample-applications-from-sources)
   - [Contents](#contents)
   - [Build prerequisites](#build-prerequisites)
   - [Build options](#build-options)
   - [Build process](#build-process)
     - [Preparing build environment](#preparing-build-environment)
     - [Create a build directory](#create-a-build-directory)
-    - [Configuring the build for `MPS3: SSE-300`](#configuring-the-build-for-mps3-sse-300)
-    - [Configuring the build for `MPS3: SSE-200`](#configuring-the-build-for-mps3-sse-200)
+    - [Configuring the build for MPS3: SSE-300](#configuring-the-build-for-mps3-sse-300)
+    - [Configuring the build for MPS3: SSE-200](#configuring-the-build-for-mps3-sse-200)
     - [Configuring native unit-test build](#configuring-native-unit-test-build)
-    - [Configuring the build for `simple_platform`](#configuring-the-build-for-simple_platform)
+    - [Configuring the build for simple_platform](#configuring-the-build-for-simple_platform)
     - [Building the configured project](#building-the-configured-project)
   - [Building timing adapter with custom options](#building-timing-adapter-with-custom-options)
   - [Add custom inputs](#add-custom-inputs)
   - [Add custom model](#add-custom-model)
   - [Optimize custom model with Vela compiler](#optimize-custom-model-with-vela-compiler)
-  - [Memory constraints](#memory-constraints)
   - [Automatic file generation](#automatic-file-generation)
 
 This section assumes the use of an **x86 Linux** build machine.
@@ -674,99 +673,6 @@ Vela Compiler documentation for more details.
 > **Note:** By default, use of the Ethos-U55 NPU is enabled in the CMake configuration.
 This could be changed by passing `-DETHOS_U55_ENABLED`.
 
-## Memory constraints
-
-Both the MPS3 Fixed Virtual Platform and the MPS3 FPGA platform share
-the linker script (scatter file) for SSE-300 design. The design is set
-by the CMake configuration parameter `TARGET_SUBSYSTEM` as described in
-the previuous section.
-
-The memory map exposed by this design is presented in Appendix 1. This
-can be used as a reference when editing the scatter file, especially to
-make sure that region boundaries are respected. The snippet from MPS3's
-scatter file is presented below:
-
-```
-;---------------------------------------------------------
-; First load region
-;---------------------------------------------------------
-LOAD_REGION_0 0x00000000 0x00080000
-{
-    ;-----------------------------------------------------
-    ; First part of code mem -- 512kiB
-    ;-----------------------------------------------------
-    itcm.bin 0x00000000 0x00080000
-    {
-        *.o (RESET, +First)
-        * (InRoot$$Sections)
-        .ANY (+RO)
-    }
-
-    ;-----------------------------------------------------
-    ; 128kiB of 512kiB bank is used for any other RW or ZI
-    ; data. Note: this region is internal to the Cortex-M CPU
-    ;-----------------------------------------------------
-    dtcm.bin 0x20000000 0x00020000
-    {
-        .ANY(+RW +ZI)
-    }
-
-    ;-----------------------------------------------------
-    ; 128kiB of stack space within the DTCM region
-    ;-----------------------------------------------------
-    ARM_LIB_STACK 0x20020000 EMPTY ALIGN 8 0x00020000
-    {}
-
-    ;-----------------------------------------------------
-    ; 256kiB of heap space within the DTCM region
-    ;-----------------------------------------------------
-
-    ARM_LIB_HEAP 0x20040000 EMPTY ALIGN 8 0x00040000
-    {}
-
-    ;-----------------------------------------------------
-    ; SSE-300's internal SRAM
-    ;-----------------------------------------------------
-    isram.bin 0x21000000 UNINIT ALIGN 16 0x00080000
-    {
-        ; activation buffers a.k.a tensor arena
-        *.o (.bss.NoInit.activation_buf)
-    }
-}
-
-;---------------------------------------------------------
-; Second load region
-;---------------------------------------------------------
-LOAD_REGION_1 0x60000000 0x02000000
-{
-    ;-----------------------------------------------------
-    ; 32 MiB of DRAM space for nn model and input vectors
-    ;-----------------------------------------------------
-    dram.bin 0x60000000 ALIGN 16 0x02000000
-    {
-        ; nn model's baked in input matrices
-        *.o (ifm)
-
-        ; nn model
-        *.o (nn_model)
-
-        ; if the activation buffer (tensor arena) doesn't
-        ; fit in the SRAM region, we accommodate it here
-        *.o (activation_buf)
-    }
-}
-```
-
-It is worth noting that in the bitfile implementation, only the BRAM,
-internal SRAM and DDR memory regions are accessible to the Ethos-U55 NPU
-block. In the above snippet, the internal SRAM region memory can be seen
-to be utilized by activation buffers with a limit of 512kiB. If used,
-this region will be written to by the Ethos-U55 NPU block frequently. A bigger
-region of memory for storing the model is placed in the DDR region,
-under LOAD_REGION_1. The two load regions are necessary as the MPS3's
-motherboard configuration controller limits the load size at address
-0x00000000 to 512kiB. This has implications on how the application **is
-deployed** on MPS3 as explained under the section 3.8.3.
 
 ## Automatic file generation
 
diff --git a/docs/sections/memory_considerations.md b/docs/sections/memory_considerations.md
new file mode 100644
index 0000000..7db0eba
--- /dev/null
+++ b/docs/sections/memory_considerations.md
@@ -0,0 +1,238 @@
+# Memory considerations
+
+## Table of Contents
+
+- [Memory considerations](#memory-considerations)
+  - [Table of Contents](#table-of-contents)
+  - [Introduction](#introduction)
+  - [Understanding memory usage from Vela output](#understanding-memory-usage-from-vela-output)
+    - [Total SRAM used](#total-sram-used)
+    - [Total Off-chip Flash used](#total-off-chip-flash-used)
+  - [Non-default configurations](#non-default-configurations)
+  - [Tensor arena and neural network model memory placement](#tensor-arena-and-neural-network-model-memory-placement)
+  - [Memory usage for ML use cases](#memory-usage-for-ml-use-cases)
+  - [Memory constraints](#memory-constraints)
+
+## Introduction
+
+This section provides useful details on how the Machine Learning use cases of the
+evaluation kit use the system memory. Although the guidance provided here is with
+respect to the Arm® Corstone™-300 system, it is fairly generic and is applicable
+for other platforms too. Arm® Corstone™-300 is composed of Arm® Cortex™-M55 and
+Arm® Ethos™-U55 and the memory map for the Arm® Cortex™-M55 core can be found in the
+[Appendix 1](./appendix.md).
+
+The Arm® Ethos™-U55 NPU interacts with the system via two AXI interfaces. The first one
+is envisaged to be the higher bandwidth and/or lower latency interface. In a typical
+system would this be wired to an SRAM as it would be required to service frequent R/W
+traffic. The second interface is expected to have higher latency and/or lower bandwidth
+characteristics and would typically be wired to a flash device servicing read-only
+traffic. In this configuration, the Arm® Cortex™-M55 CPU and Arm® Ethos™-U55 NPU read the contents
+of the neural network model (`.tflite file`) from the flash memory region with Arm® Ethos™-U55
+requesting these read transactions over its second AXI bus.
+The input and output tensors, along with any intermediate computation buffers, would be
+placed on SRAM and both the Arm® Cortex™-M55 CPU and Arm® Ethos™-U55 NPU would be reading/writing
+to this region when running an inference. The Arm® Ethos™-U55 NPU will be requesting these R/W
+transactions over the first AXI bus.
+
+## Understanding memory usage from Vela output
+
+### Total SRAM used
+
+When the neural network model is compiled with Vela, a summary report that includes memory
+usage is generated. For example, compiling the keyword spotting model [ds_cnn_clustered_int8](https://github.com/ARM-software/ML-zoo/blob/master/models/keyword_spotting/ds_cnn_large/tflite_clustered_int8/ds_cnn_clustered_int8.tflite)
+with Vela produces, among others, the following output:
+
+```
+Total SRAM used                                 70.77 KiB
+Total Off-chip Flash used                      430.78 KiB
+```
+
+The `Total SRAM used` here indicates the required memory to store the `tensor arena` for the
+TensorFlow Lite Micro framework. This is the amount of memory required to store the input,
+output and intermediate buffers. In the example above, the tensor arena requires 70.77 KiB
+of available SRAM. Note that Vela can only estimate the SRAM required for graph execution.
+It has no way of estimating the memory used by internal structures from TensorFlow Lite
+Micro framework. Therefore, it is recommended to top this memory size by at least 2KiB and
+carve out the `tensor arena` of this size and place it on the target system's SRAM.
+
+### Total Off-chip Flash used
+
+The `Total Off-chip Flash` parameter indicates the minimum amount of flash required to store
+the neural network model. In the example above, the system needs to have a minimum of 430.78
+KiB of available flash memory to store `.tflite` file contents.
+
+> Note: For Arm® Corstone™-300 system we use the DDR region as a flash memory. The timing
+> adapter sets up AXI bus wired to the DDR to mimic bandwidth and latency
+> characteristics of a flash memory device.
+
+## Non-default configurations
+
+The above example outlines a typical configuration, and this corresponds to the default Vela
+setting. However, the system SRAM can also be used to store the neural network model along with
+the `tensor arena`. Vela supports optimizing the model for this configuration with its `Sram_Only`
+memory mode. See [vela.ini](../../scripts/vela/vela.ini). To make use of a neural network model
+optimised for this configuration, the linker script for the target platform would need to be
+changed. By default, the linker scripts are set up to support the default configuration only. See
+[Memory constraints](#Memory-constraints) for snippet of a script.
+
+> Note
+> 1. The default configuration is represented by `Shared_Sram` memory mode.
+> 2. `Dedicated_Sram` mode is only applicable for Arm® Ethos™-U65.
+
+## Tensor arena and neural network model memory placement
+
+The evaluation kit uses the name `activation buffer` for what is called `tensor arena` in the
+TensorFlow Lite Micro framework. Every use case application has a corresponding
+`<use_case_name>_ACTIVATION_BUF_SZ` parameter that governs the maximum available size of the
+`activation buffer` for that particular use case.
+
+The linker script is set up to place this memory region in SRAM. However, if the memory required
+is more than what the target platform supports, this buffer needs to be placed on flash instead.
+Every target platform has a profile definition in the form of a CMake file. See [Corstone-300
+profile](../../scripts/cmake/subsystem-profiles/corstone-sse-300.cmake) for example. The parameter
+`ACTIVATION_BUF_SRAM_SZ` defines the maximum SRAM size available for the platform. This is
+propagated through the build system and if the `<use_case_name>_ACTIVATION_BUF_SZ` for a given
+use case is more than the `ACTIVATION_BUF_SRAM_SZ` for the target build platform, the `activation buffer`
+is placed on the flash instead.
+
+The neural network model is always placed in the flash region. However, this can be changed easily
+in the linker script.
+
+## Memory usage for ML use cases
+
+The following numbers have been obtained from Vela for `Shared_Sram` memory mode and the SRAM and
+flash memory requirements for the different use cases of the evaluation kit. Note that the SRAM usage
+does not include memory used by TensorFlow Lite Micro and this will need to be topped up as explained
+under [Total SRAM used](#Total-SRAM-used).
+
+- [Keyword spotting model](https://github.com/ARM-software/ML-zoo/tree/master/models/keyword_spotting/ds_cnn_large/tflite_clustered_int8) requires
+  -  70.7 KiB of SRAM
+  -  430.7 KiB of flash memory.
+
+- [Image classification model](https://github.com/ARM-software/ML-zoo/tree/master/models/image_classification/mobilenet_v2_1.0_224/tflite_uint8) requires
+  - 638.6 KiB of SRAM
+  - 3.1 MB of flash memory.
+
+- [Automated speech recognition](https://github.com/ARM-software/ML-zoo/tree/master/models/speech_recognition/wav2letter/tflite_int8) requires
+  - 635.3 KiB of SRAM
+  - 21.1 MB of flash memory.
+
+## Memory constraints
+
+Both the MPS3 Fixed Virtual Platform and the MPS3 FPGA platform share the linker script for Arm® Corstone™-300
+design. The design is set by the CMake configuration parameter `TARGET_SUBSYSTEM` as described in
+[build options](./building.md#Build-options).
+
+The memory map exposed by this design is presented in [Appendix 1](./appendix.md). This can be used as a reference
+when editing the linker script, especially to make sure that region boundaries are respected. The snippet from the
+scatter file is presented below:
+
+```
+;---------------------------------------------------------
+; First load region (ITCM)
+;---------------------------------------------------------
+LOAD_REGION_0       0x00000000                  0x00080000
+{
+    ;-----------------------------------------------------
+    ; First part of code mem - 512kiB
+    ;-----------------------------------------------------
+    itcm.bin        0x00000000                  0x00080000
+    {
+        *.o (RESET, +First)
+        * (InRoot$$Sections)
+
+        ; Essentially only RO-CODE, RO-DATA is in a
+        ; different region.
+        .ANY (+RO)
+    }
+
+    ;-----------------------------------------------------
+    ; 128kiB of 512kiB DTCM is used for any other RW or ZI
+    ; data. Note: this region is internal to the Cortex-M
+    ; CPU.
+    ;-----------------------------------------------------
+    dtcm.bin        0x20000000                  0x00020000
+    {
+        ; Any R/W and/or zero initialised data
+        .ANY(+RW +ZI)
+    }
+
+    ;-----------------------------------------------------
+    ; 384kiB of stack space within the DTCM region. See
+    ; `dtcm.bin` for the first section. Note: by virtue of
+    ; being part of DTCM, this region is only accessible
+    ; from Cortex-M55.
+    ;-----------------------------------------------------
+    ARM_LIB_STACK   0x20020000 EMPTY ALIGN 8    0x00060000
+    {}
+
+    ;-----------------------------------------------------
+    ; SSE-300's internal SRAM of 4MiB - reserved for
+    ; activation buffers.
+    ; This region should have 3 cycle read latency from
+    ; both Cortex-M55 and Ethos-U55
+    ;-----------------------------------------------------
+    isram.bin       0x31000000  UNINIT ALIGN 16 0x00400000
+    {
+        ; activation buffers a.k.a tensor arena
+        *.o (.bss.NoInit.activation_buf)
+    }
+}
+
+;---------------------------------------------------------
+; Second load region (DDR)
+;---------------------------------------------------------
+LOAD_REGION_1       0x70000000                  0x02000000
+{
+    ;-----------------------------------------------------
+    ; 32 MiB of DRAM space for neural network model,
+    ; input vectors and labels. If the activation buffer
+    ; size required by the network is bigger than the
+    ; SRAM size available, it is accommodated here.
+    ;-----------------------------------------------------
+    dram.bin        0x70000000 ALIGN 16         0x02000000
+    {
+        ; nn model's baked in input matrices
+        *.o (ifm)
+
+        ; nn model
+        *.o (nn_model)
+
+        ; labels
+        *.o (labels)
+
+        ; if the activation buffer (tensor arena) doesn't
+        ; fit in the SRAM region, we accommodate it here
+        *.o (activation_buf)
+    }
+
+    ;-----------------------------------------------------
+    ; First 256kiB of BRAM (FPGA SRAM) used for RO data.
+    ; Note: Total BRAM size available is 2MiB.
+    ;-----------------------------------------------------
+    bram.bin        0x11000000          ALIGN 8 0x00040000
+    {
+        ; RO data (incl. unwinding tables for debugging)
+        .ANY (+RO-DATA)
+    }
+
+    ;-----------------------------------------------------
+    ; Remaining part of the 2MiB BRAM used as heap space.
+    ; 0x00200000 - 0x00040000 = 0x001C0000 (1.75 MiB)
+    ;-----------------------------------------------------
+    ARM_LIB_HEAP    0x11040000 EMPTY ALIGN 8    0x001C0000
+    {}
+}
+
+```
+
+It is worth noting that for the Arm® Corstone™-300 FPGA and FVP implementations, only the BRAM,
+internal SRAM and DDR memory regions are accessible to the Arm® Ethos™-U55 NPU block. In the above
+snippet, the internal SRAM region memory can be seen to be utilized by activation buffers with
+a limit of 4MiB. If used by a Vela optimised neural network model, this region will be written to
+by the Arm® Ethos™-U55 NPU block frequently. A bigger region of memory for storing the neural
+network model is placed in the DDR/flash region under LOAD_REGION_1. The two load regions are necessary
+as the MPS3's motherboard configuration controller limits the load size at address 0x00000000 to 1MiB.
+This has implications on how the application **is deployed** on MPS3 as explained under the section
+[Deployment on MPS3](./deployment.md#MPS3-board).
-- 
cgit v1.2.1