diff options
Diffstat (limited to 'docs/sections/timing_adapters.md')
-rw-r--r-- | docs/sections/timing_adapters.md | 153 |
1 files changed, 153 insertions, 0 deletions
diff --git a/docs/sections/timing_adapters.md b/docs/sections/timing_adapters.md new file mode 100644 index 0000000..ab05490 --- /dev/null +++ b/docs/sections/timing_adapters.md @@ -0,0 +1,153 @@ +# Building timing adapter with custom options + +The sources contain the configuration for a timing adapter utility for the *Arm® Ethos™-U* NPU driver. The timing +adapter allows the platform to simulate user provided memory bandwidth and latency constraints. + +The timing adapter driver aims to control the behavior of two AXI buses used by *Ethos-U* NPU. One is for SRAM memory +region, and the other is for flash or DRAM. + +The SRAM is where intermediate buffers are expected to be allocated and therefore, this region can serve frequent Read +and Write traffic generated by computation operations while executing a neural network inference. + +The flash or DDR is where we expect to store the model weights and therefore, this bus would only usually be used for RO +traffic. + +It is used for MPS3 FPGA and for Fast Model environment. + +The CMake build framework allows the parameters to control the behavior of each bus with following parameters: + +- `MAXR`: Maximum number of pending read operations allowed. `0` is inferred as infinite and the default value is `4`. + +- `MAXW`: Maximum number of pending write operations allowed. `0` is inferred as infinite and the default value is `4`. + +- `MAXRW`: Maximum number of pending read and write operations allowed. `0` is inferred as infinite and the default + value is `8`. + +- `RLATENCY`: Minimum latency, in cycle counts, for a read operation. This is the duration between `ARVALID` and + `RVALID` signals. The default value is `50`. + +- `WLATENCY`: Minimum latency, in cycle counts, for a write operation. This is the duration between `WVALID` and + `WLAST`, with `BVALID` being deasserted. The default value is `50`. + +- `PULSE_ON`: The number of cycles where addresses are let through. The default value is `5100`. + +- `PULSE_OFF`: The number of cycles where addresses are blocked. The default value is `5100`. + +- `BWCAP`: Maximum number of 64-bit words transferred per pulse cycle. A pulse cycle is defined by `PULSE_ON` + and `PULSE_OFF`. `0` is inferred as infinite and the default value is `625`. + + > **Note:** The bandwidth cap `BWCAP` operates on the transaction level and, because of its simple implementation, + > the accuracy is limited. + > When set to a small value it allows only a small number of transactions for each pulse cycle. + > Once the counter has reached or exceeded the configured cap, no transactions will be allowed before the next pulse + > cycle. In order to minimize this effect some possible solutions are: + > + > - scale up all the parameters to a reasonably large value. + > - scale up `BWCAP` as a multiple of the burst length (in this case bulk traffic will not face rounding errors in + > the bandwidth cap). + +- `MODE`: Timing adapter operation mode. Default value is `0`. + + - `Bit 0`: `0`=simple, `1`=latency-deadline QoS throttling of read versus write, + + - `Bit 1`: `1`=enable random AR reordering (`0`=default), + + - `Bit 2`: `1`=enable random R reordering (`0`=default), + + - `Bit 3`: `1`=enable random B reordering (`0`=default) + +For the CMake build configuration of the timing adapter, the SRAM AXI is assigned `index 0` and the flash, or DRAM, AXI +bus has `index 1`. + +To change the bus parameter for the build a "***TA_\<index>_*"** prefix should be added to the above. For example, +**TA0_MAXR=10** sets the maximum pending reads to 10 on the SRAM AXI bus. + +As an example, if we have the following parameters for the flash, or DRAM, region: + +- `TA1_MAXR` = "2" + +- `TA1_MAXW` = "0" + +- `TA1_MAXRW` = "0" + +- `TA1_RLATENCY` = "64" + +- `TA1_WLATENCY` = "32" + +- `TA1_PULSE_ON` = "320" + +- `TA1_PULSE_OFF` = "80" + +- `TA1_BWCAP` = "50" + +For a clock rate of 500MHz, this would translate to: + +- The maximum duty cycle for any operation is:\ + ![Maximum duty cycle formula](../media/F1.png) + +- Maximum bit rate for this bus (64-bit wide) is:\ + ![Maximum bit rate formula](../media/F2.png) + +- With a read latency of 64 cycles, and maximum pending reads as 2, each read could be a maximum of 64 or 128 bytes. As + defined for the *Ethos-U* NPU AXI bus attribute. + + The bandwidth is calculated solely by read parameters: + + ![Bandwidth formula](../media/F3.png) + + This is higher than the overall bandwidth dictated by the bus parameters of: + + ![Overall bandwidth formula](../media/F4.png) + +This suggests that the read operation is only limited by the overall bus bandwidth. + +Timing adapter requires recompilation to change parameters. Default timing adapter configuration file pointed to by +`TA_CONFIG_FILE` build parameter is located in the `scripts/cmake folder` and contains all options for `AXI0` and `AXI1` +as previously described. + +here is an example of `scripts/cmake/timing_adapter/ta_config_u55_high_end.cmake`: + +```cmake +# Timing adapter options +set(TA_INTERACTIVE OFF) + +# Timing adapter settings for AXI0 +set(TA0_MAXR "8") +set(TA0_MAXW "8") +set(TA0_MAXRW "0") +set(TA0_RLATENCY "32") +set(TA0_WLATENCY "32") +set(TA0_PULSE_ON "3999") +set(TA0_PULSE_OFF "1") +set(TA0_BWCAP "4000") +... +``` + +An example of the build with a custom timing adapter configuration: + +```commandline +cmake .. -DTA_CONFIG_FILE=scripts/cmake/timing_adapter/my_ta_config.cmake +``` +## Differences between timing adapter implementations in Arm® Corstone™-300 and Arm® Corstone™-310 + +Corstone-300 FVP and FPGA implements timing adapters that are tied to AXI masters M0 and M1 on the Ethos-U NPU. + +Corstone-310 **FPGA** implements timing adapter blocks differently and those are placed on each of the main +memories present on FPGA: SRAM, QSPI flash, DDR and user memory. +Moreover, this timer adapter placement does not translate well to FVP, so current Corstone-310 FVP implementation does +not support the feature. Additionally - base addresses of timer adapters blocks have changed for Corestone-310: + +#### Timer Adapters for Corstone-300 FVP and FPGA: +| TA# | Interface TA is placed on | Base address (non-secure/secure) | Size | +|-----|---------------------------|----------------------------------|-------| +| 0 | M0/AXI0 for Ethos-U NPU | 0x4810_3000/0x5810_3000 | 0.5KB | +| 1 | M1/AXI1 for Ethos-U NPU | 0x4810_3200/0x5810_3200 | 0.5KB | +#### Timer Adapter for Corstone-310 FPGA: +| TA# | Interface TA is placed on | Base address (non-secure/secure) | Size | +|-----|---------------------------|----------------------------------|------| +| 0 | FPGA SRAM | 0x4170_0000/0x5170_0000 | 4KB | +| 1 | QSPI flash device | 0x4170_1000/0x5170_1000 | 4KB | +| 2 | DDR | 0x4170_1000/0x5170_2000 | 4KB | +| 3 | User memory | 0x4170_3000/0x5170_3000 | 4KB | + +With this in mind, when targeting Corstone-310, evaluation kit should be built with timing adapters disabled altogether via `-DETHOS_U_NPU_TIMING_ADAPTER_ENABLED=OFF` flag. Because timing adapters do not affect CPU-driven traffic for Corstone-300, building both platforms without the support for timing adapters allows for a CPU performance comparison.
\ No newline at end of file |