From d9bb3cf4fcc0c96bb9f6e9920eff18bf04d92258 Mon Sep 17 00:00:00 2001 From: Maksims Svecovs Date: Mon, 15 Aug 2022 13:06:09 +0100 Subject: MLECO-3244: Documentation on timing adapters Documents current differences between timing adapters implementations on Corstone-300 and Corstone-310 platforms. Signed-off-by: Maksims Svecovs Change-Id: I3161dc929bd01217a4992be869f13377a58e5471 --- docs/sections/building.md | 133 +--------------------------------- docs/sections/timing_adapters.md | 153 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 154 insertions(+), 132 deletions(-) create mode 100644 docs/sections/timing_adapters.md (limited to 'docs') diff --git a/docs/sections/building.md b/docs/sections/building.md index a7b64aa..f6b71a8 100644 --- a/docs/sections/building.md +++ b/docs/sections/building.md @@ -20,7 +20,7 @@ - [Configuring the build for simple-platform](./building.md#configuring-the-build-for-simple_platform) - [Building with CMakePresets](./building.md#building-with-cmakepresets) - [Building the configured project](./building.md#building-the-configured-project) - - [Building timing adapter with custom options](./building.md#building-timing-adapter-with-custom-options) + - [Building timing adapter with custom options](./timing_adapters.md#building-timing-adapter-with-custom-options) - [Add custom inputs](./building.md#add-custom-inputs) - [Add custom model](./building.md#add-custom-model) - [Optimize custom model with Vela compiler](./building.md#optimize-custom-model-with-vela-compiler) @@ -641,137 +641,6 @@ Where for each implemented use-case under the `source/use-case` directory, the f > **Note:** For the specific use-case commands, refer to the relative section in the use-case documentation. -## Building timing adapter with custom options - -The sources also contain the configuration for a timing adapter utility for the *Ethos-U* NPU driver. The timing -adapter allows the platform to simulate user provided memory bandwidth and latency constraints. - -The timing adapter driver aims to control the behavior of two AXI buses used by *Ethos-U* NPU. One is for SRAM memory -region, and the other is for flash or DRAM. - -The SRAM is where intermediate buffers are expected to be allocated and therefore, this region can serve frequent Read -and Write traffic generated by computation operations while executing a neural network inference. - -The flash or DDR is where we expect to store the model weights and therefore, this bus would only usually be used for RO -traffic. - -It is used for MPS3 FPGA and for Fast Model environment. - -The CMake build framework allows the parameters to control the behavior of each bus with following parameters: - -- `MAXR`: Maximum number of pending read operations allowed. `0` is inferred as infinite and the default value is `4`. - -- `MAXW`: Maximum number of pending write operations allowed. `0` is inferred as infinite and the default value is `4`. - -- `MAXRW`: Maximum number of pending read and write operations allowed. `0` is inferred as infinite and the default - value is `8`. - -- `RLATENCY`: Minimum latency, in cycle counts, for a read operation. This is the duration between `ARVALID` and - `RVALID` signals. The default value is `50`. - -- `WLATENCY`: Minimum latency, in cycle counts, for a write operation. This is the duration between `WVALID` and - `WLAST`, with `BVALID` being deasserted. The default value is `50`. - -- `PULSE_ON`: The number of cycles where addresses are let through. The default value is `5100`. - -- `PULSE_OFF`: The number of cycles where addresses are blocked. The default value is `5100`. - -- `BWCAP`: Maximum number of 64-bit words transferred per pulse cycle. A pulse cycle is defined by `PULSE_ON` - and `PULSE_OFF`. `0` is inferred as infinite and the default value is `625`. - - > **Note:** The bandwidth cap `BWCAP` operates on the transaction level and, because of its simple implementation, - > the accuracy is limited. - > When set to a small value it allows only a small number of transactions for each pulse cycle. - > Once the counter has reached or exceeded the configured cap, no transactions will be allowed before the next pulse - > cycle. In order to minimize this effect some possible solutions are: - > - > - scale up all the parameters to a reasonably large value. - > - scale up `BWCAP` as a multiple of the burst length (in this case bulk traffic will not face rounding errors in - > the bandwidth cap). - -- `MODE`: Timing adapter operation mode. Default value is `0`. - - - `Bit 0`: `0`=simple, `1`=latency-deadline QoS throttling of read versus write, - - - `Bit 1`: `1`=enable random AR reordering (`0`=default), - - - `Bit 2`: `1`=enable random R reordering (`0`=default), - - - `Bit 3`: `1`=enable random B reordering (`0`=default) - -For the CMake build configuration of the timing adapter, the SRAM AXI is assigned `index 0` and the flash, or DRAM, AXI -bus has `index 1`. - -To change the bus parameter for the build a "***TA_\_*"** prefix should be added to the above. For example, -**TA0_MAXR=10** sets the maximum pending reads to 10 on the SRAM AXI bus. - -As an example, if we have the following parameters for the flash, or DRAM, region: - -- `TA1_MAXR` = "2" - -- `TA1_MAXW` = "0" - -- `TA1_MAXRW` = "0" - -- `TA1_RLATENCY` = "64" - -- `TA1_WLATENCY` = "32" - -- `TA1_PULSE_ON` = "320" - -- `TA1_PULSE_OFF` = "80" - -- `TA1_BWCAP` = "50" - -For a clock rate of 500MHz, this would translate to: - -- The maximum duty cycle for any operation is:\ - ![Maximum duty cycle formula](../media/F1.png) - -- Maximum bit rate for this bus (64-bit wide) is:\ - ![Maximum bit rate formula](../media/F2.png) - -- With a read latency of 64 cycles, and maximum pending reads as 2, each read could be a maximum of 64 or 128 bytes. As - defined for the *Ethos-U* NPU AXI bus attribute. - - The bandwidth is calculated solely by read parameters: - - ![Bandwidth formula](../media/F3.png) - - This is higher than the overall bandwidth dictated by the bus parameters of: - - ![Overall bandwidth formula](../media/F4.png) - -This suggests that the read operation is only limited by the overall bus bandwidth. - -Timing adapter requires recompilation to change parameters. Default timing adapter configuration file pointed to by -`TA_CONFIG_FILE` build parameter is located in the `scripts/cmake folder` and contains all options for `AXI0` and `AXI1` -as previously described. - -here is an example of `scripts/cmake/timing_adapter/ta_config_u55_high_end.cmake`: - -```cmake -# Timing adapter options -set(TA_INTERACTIVE OFF) - -# Timing adapter settings for AXI0 -set(TA0_MAXR "8") -set(TA0_MAXW "8") -set(TA0_MAXRW "0") -set(TA0_RLATENCY "32") -set(TA0_WLATENCY "32") -set(TA0_PULSE_ON "3999") -set(TA0_PULSE_OFF "1") -set(TA0_BWCAP "4000") -... -``` - -An example of the build with a custom timing adapter configuration: - -```commandline -cmake .. -DTA_CONFIG_FILE=scripts/cmake/timing_adapter/my_ta_config.cmake -``` - ## Add custom inputs The application performs inference on input data found in the folder set by the CMake parameters, for more information diff --git a/docs/sections/timing_adapters.md b/docs/sections/timing_adapters.md new file mode 100644 index 0000000..ab05490 --- /dev/null +++ b/docs/sections/timing_adapters.md @@ -0,0 +1,153 @@ +# Building timing adapter with custom options + +The sources contain the configuration for a timing adapter utility for the *Arm® Ethos™-U* NPU driver. The timing +adapter allows the platform to simulate user provided memory bandwidth and latency constraints. + +The timing adapter driver aims to control the behavior of two AXI buses used by *Ethos-U* NPU. One is for SRAM memory +region, and the other is for flash or DRAM. + +The SRAM is where intermediate buffers are expected to be allocated and therefore, this region can serve frequent Read +and Write traffic generated by computation operations while executing a neural network inference. + +The flash or DDR is where we expect to store the model weights and therefore, this bus would only usually be used for RO +traffic. + +It is used for MPS3 FPGA and for Fast Model environment. + +The CMake build framework allows the parameters to control the behavior of each bus with following parameters: + +- `MAXR`: Maximum number of pending read operations allowed. `0` is inferred as infinite and the default value is `4`. + +- `MAXW`: Maximum number of pending write operations allowed. `0` is inferred as infinite and the default value is `4`. + +- `MAXRW`: Maximum number of pending read and write operations allowed. `0` is inferred as infinite and the default + value is `8`. + +- `RLATENCY`: Minimum latency, in cycle counts, for a read operation. This is the duration between `ARVALID` and + `RVALID` signals. The default value is `50`. + +- `WLATENCY`: Minimum latency, in cycle counts, for a write operation. This is the duration between `WVALID` and + `WLAST`, with `BVALID` being deasserted. The default value is `50`. + +- `PULSE_ON`: The number of cycles where addresses are let through. The default value is `5100`. + +- `PULSE_OFF`: The number of cycles where addresses are blocked. The default value is `5100`. + +- `BWCAP`: Maximum number of 64-bit words transferred per pulse cycle. A pulse cycle is defined by `PULSE_ON` + and `PULSE_OFF`. `0` is inferred as infinite and the default value is `625`. + + > **Note:** The bandwidth cap `BWCAP` operates on the transaction level and, because of its simple implementation, + > the accuracy is limited. + > When set to a small value it allows only a small number of transactions for each pulse cycle. + > Once the counter has reached or exceeded the configured cap, no transactions will be allowed before the next pulse + > cycle. In order to minimize this effect some possible solutions are: + > + > - scale up all the parameters to a reasonably large value. + > - scale up `BWCAP` as a multiple of the burst length (in this case bulk traffic will not face rounding errors in + > the bandwidth cap). + +- `MODE`: Timing adapter operation mode. Default value is `0`. + + - `Bit 0`: `0`=simple, `1`=latency-deadline QoS throttling of read versus write, + + - `Bit 1`: `1`=enable random AR reordering (`0`=default), + + - `Bit 2`: `1`=enable random R reordering (`0`=default), + + - `Bit 3`: `1`=enable random B reordering (`0`=default) + +For the CMake build configuration of the timing adapter, the SRAM AXI is assigned `index 0` and the flash, or DRAM, AXI +bus has `index 1`. + +To change the bus parameter for the build a "***TA_\_*"** prefix should be added to the above. For example, +**TA0_MAXR=10** sets the maximum pending reads to 10 on the SRAM AXI bus. + +As an example, if we have the following parameters for the flash, or DRAM, region: + +- `TA1_MAXR` = "2" + +- `TA1_MAXW` = "0" + +- `TA1_MAXRW` = "0" + +- `TA1_RLATENCY` = "64" + +- `TA1_WLATENCY` = "32" + +- `TA1_PULSE_ON` = "320" + +- `TA1_PULSE_OFF` = "80" + +- `TA1_BWCAP` = "50" + +For a clock rate of 500MHz, this would translate to: + +- The maximum duty cycle for any operation is:\ + ![Maximum duty cycle formula](../media/F1.png) + +- Maximum bit rate for this bus (64-bit wide) is:\ + ![Maximum bit rate formula](../media/F2.png) + +- With a read latency of 64 cycles, and maximum pending reads as 2, each read could be a maximum of 64 or 128 bytes. As + defined for the *Ethos-U* NPU AXI bus attribute. + + The bandwidth is calculated solely by read parameters: + + ![Bandwidth formula](../media/F3.png) + + This is higher than the overall bandwidth dictated by the bus parameters of: + + ![Overall bandwidth formula](../media/F4.png) + +This suggests that the read operation is only limited by the overall bus bandwidth. + +Timing adapter requires recompilation to change parameters. Default timing adapter configuration file pointed to by +`TA_CONFIG_FILE` build parameter is located in the `scripts/cmake folder` and contains all options for `AXI0` and `AXI1` +as previously described. + +here is an example of `scripts/cmake/timing_adapter/ta_config_u55_high_end.cmake`: + +```cmake +# Timing adapter options +set(TA_INTERACTIVE OFF) + +# Timing adapter settings for AXI0 +set(TA0_MAXR "8") +set(TA0_MAXW "8") +set(TA0_MAXRW "0") +set(TA0_RLATENCY "32") +set(TA0_WLATENCY "32") +set(TA0_PULSE_ON "3999") +set(TA0_PULSE_OFF "1") +set(TA0_BWCAP "4000") +... +``` + +An example of the build with a custom timing adapter configuration: + +```commandline +cmake .. -DTA_CONFIG_FILE=scripts/cmake/timing_adapter/my_ta_config.cmake +``` +## Differences between timing adapter implementations in Arm® Corstone™-300 and Arm® Corstone™-310 + +Corstone-300 FVP and FPGA implements timing adapters that are tied to AXI masters M0 and M1 on the Ethos-U NPU. + +Corstone-310 **FPGA** implements timing adapter blocks differently and those are placed on each of the main +memories present on FPGA: SRAM, QSPI flash, DDR and user memory. +Moreover, this timer adapter placement does not translate well to FVP, so current Corstone-310 FVP implementation does +not support the feature. Additionally - base addresses of timer adapters blocks have changed for Corestone-310: + +#### Timer Adapters for Corstone-300 FVP and FPGA: +| TA# | Interface TA is placed on | Base address (non-secure/secure) | Size | +|-----|---------------------------|----------------------------------|-------| +| 0 | M0/AXI0 for Ethos-U NPU | 0x4810_3000/0x5810_3000 | 0.5KB | +| 1 | M1/AXI1 for Ethos-U NPU | 0x4810_3200/0x5810_3200 | 0.5KB | +#### Timer Adapter for Corstone-310 FPGA: +| TA# | Interface TA is placed on | Base address (non-secure/secure) | Size | +|-----|---------------------------|----------------------------------|------| +| 0 | FPGA SRAM | 0x4170_0000/0x5170_0000 | 4KB | +| 1 | QSPI flash device | 0x4170_1000/0x5170_1000 | 4KB | +| 2 | DDR | 0x4170_1000/0x5170_2000 | 4KB | +| 3 | User memory | 0x4170_3000/0x5170_3000 | 4KB | + +With this in mind, when targeting Corstone-310, evaluation kit should be built with timing adapters disabled altogether via `-DETHOS_U_NPU_TIMING_ADAPTER_ENABLED=OFF` flag. Because timing adapters do not affect CPU-driven traffic for Corstone-300, building both platforms without the support for timing adapters allows for a CPU performance comparison. \ No newline at end of file -- cgit v1.2.1