From d8339a75c9b655c0507e34238078fdad068b4023 Mon Sep 17 00:00:00 2001
From: Tim Hall <tim.hall@arm.com>
Date: Thu, 27 May 2021 18:49:40 +0100
Subject: MLBEDSW-4034: New Scheduler Size or Performance Optimisation

 - Merged dev/scheduler at 83639f90e8c828f70de6e29142355a940224959b

Signed-off-by: Tim Hall <tim.hall@arm.com>
Change-Id: I0050529d4b42da93768c7264296434dd877fb5b4
---
 PERFORMANCE.md | 238 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 238 insertions(+)
 create mode 100644 PERFORMANCE.md

(limited to 'PERFORMANCE.md')

diff --git a/PERFORMANCE.md b/PERFORMANCE.md
new file mode 100644
index 00000000..6bcfbbfe
--- /dev/null
+++ b/PERFORMANCE.md
@@ -0,0 +1,238 @@
+# Vela Performance Estimation Summary
+
+This is a description of the performance estimation summary that Vela prints
+after each compilation.
+
+The following is an example of the output.
+```
+$ vela my_network.tflite
+
+Network summary for my_network
+Accelerator configuration               Ethos_U55_256
+System configuration                 internal-default
+Memory mode                          internal-default
+Accelerator clock                                 500 MHz
+Design peak SRAM bandwidth                       4.00 GB/s
+Design peak Off-chip Flash bandwidth             0.50 GB/s
+
+Total SRAM used                                  0.95 KiB
+Total Off-chip Flash used                      106.98 KiB
+
+51 passes fused into 51
+0/106 (0.0%) operations falling back to the CPU
+Average SRAM bandwidth                           0.04 GB/s
+Input   SRAM bandwidth                           0.01 MB/batch
+Weight  SRAM bandwidth                           0.00 MB/batch
+Output  SRAM bandwidth                           0.00 MB/batch
+Total   SRAM bandwidth                           0.01 MB/batch
+Total   SRAM bandwidth            per input      0.01 MB/inference (batch size 1)
+
+Average Off-chip Flash bandwidth                 0.46 GB/s
+Input   Off-chip Flash bandwidth                 0.01 MB/batch
+Weight  Off-chip Flash bandwidth                 0.09 MB/batch
+Output  Off-chip Flash bandwidth                 0.00 MB/batch
+Total   Off-chip Flash bandwidth                 0.10 MB/batch
+Total   Off-chip Flash bandwidth  per input      0.10 MB/inference (batch size 1)
+
+Neural network macs                             86952 MACs/batch
+Network Tops/s                                   0.00 Tops/s
+
+NPU cycles                                      21298 cycles/batch
+SRAM Access cycles                               2261 cycles/batch
+DRAM Access cycles                                  0 cycles/batch
+On-chip Flash Access cycles                         0 cycles/batch
+Off-chip Flash Access cycles                   112755 cycles/batch
+Total cycles                                   114098 cycles/batch
+
+Batch Inference time                 0.23 ms, 4382.18 inferences/s (batch size 1)
+```
+
+## Configuration
+
+The first section of the summary shows the configuration used for
+optimizing the network.
+
+```
+Accelerator configuration               Ethos_U55_256
+System configuration                 internal-default
+Memory mode                          internal-default
+Accelerator clock                                 500 MHz
+Design peak SRAM bandwidth                       4.00 GB/s
+Design peak Off-chip Flash bandwidth             0.50 GB/s
+```
+
+### Accelerator configuration
+
+This shows the selected accelerator configuration. It identifies the Embedded
+NPU that the compiler is targeting. **NOTE: It is extremely important to select
+the correct device, otherwise a run-time check in the driver will fail.**
+To select a different accelerator configuration use the CLI option
+`--accelerator-config`, see [OPTIONS.md](OPTIONS.md#Accelerator-Configuration).
+
+### System configuration
+
+The selected system configuration from the provided configuration file or
+`internal-default`. **NOTE: It is very important to select a system
+configuration that correctly describes the target embedded system.** This is
+because the compiler makes some of its optimization decisions based upon this
+information. Failing to select the correct configuration could result in
+run-time errors, bit-inexact operation, or suboptimal operation of the Embedded
+NPU. To select a different system configuration use the CLI option
+`--system-config`, see [OPTIONS.md](OPTIONS.md#System-Config).
+
+### Memory mode
+
+The selected memory mode from the provided configuration file or
+internal-default. **NOTE: It is very important to select a memory
+mode that correctly describes the target embedded system.** This is
+because the compiler makes some of its optimization decisions based upon this
+information. To select a different memory mode use the CLI option
+`--memory-mode`, see [OPTIONS.md](OPTIONS.md#Memory-Mode).
+
+### Accelerator clock
+
+The accelerator clock for the given the system configuration.
+
+### Design peak memory bandwidth
+
+The design peak memory bandwidth for the given system configuration.
+It gives the theoretical maximum bandwidth of the memory based upon the
+[System Configuration](OPTIONS.md#Configuration-File) parameters specified
+and the AXI port width of the Ethos-U NPU.
+
+## Memory Usage
+
+The next section of the summary shows the memory usage for the
+the various memory types in the system.
+
+```
+Total SRAM used                                  0.95 KiB
+Total Off-chip Flash used                      106.98 KiB
+```
+
+The contents of this section and the meaning of it changes depending upon the
+system config and memory mode.
+
+## Operator information
+
+Information about cascading and operators.
+The first line shows the number of passes (i.e. operations) and how many NPU
+passes they have been fused or combined into.
+The second line shows how many operators in the network are falling back to
+the CPU (i.e. not supported by the NPU).
+
+```
+51 passes fused into 51
+0/106 (0.0%) operations falling back to the CPU
+```
+
+## Estimated memory bandwidth
+
+The next section shows the estimated memory bandwidth for each memory type.
+Data is provided for average, batch and per data type.
+
+```
+Average SRAM bandwidth                           0.04 GB/s
+Input   SRAM bandwidth                           0.01 MB/batch
+Weight  SRAM bandwidth                           0.00 MB/batch
+Output  SRAM bandwidth                           0.00 MB/batch
+Total   SRAM bandwidth                           0.01 MB/batch
+Total   SRAM bandwidth            per input      0.01 MB/inference (batch size 1)
+
+Average Off-chip Flash bandwidth                 0.46 GB/s
+Input   Off-chip Flash bandwidth                 0.01 MB/batch
+Weight  Off-chip Flash bandwidth                 0.09 MB/batch
+Output  Off-chip Flash bandwidth                 0.00 MB/batch
+Total   Off-chip Flash bandwidth                 0.10 MB/batch
+Total   Off-chip Flash bandwidth  per input      0.10 MB/inference (batch size 1)
+```
+
+### Average bandwidth
+
+This shows the average memory bandwidth usage for the memory type.
+
+### Input bandwidth
+
+This shows the memory bandwidth usage for reading feature maps for the memory
+type per batch.
+
+### Weight bandwidth
+
+This shows the memory bandwidth usage for reading and writing weights for the
+memory type per batch.
+
+### Output bandwidth
+
+This shows the memory bandwidth usage for writing feature maps for the memory
+type per batch.
+
+### Total bandwidth
+
+This shows the total memory bandwidth usage the memory
+type per batch and per inference.
+
+## Weights data
+
+This section is only visible if the CLI option `--verbose-weights` is provided.
+```
+Original Weights Size                           84.91 KiB
+NPU Weights Size                                94.00 KiB
+NPU Encoded Weights Size                        89.30 KiB
+```
+
+### Original Weights Size
+
+This is the total size of all weights in the network before optimization.
+
+### NPU Weights Size
+
+This is the total size of the weights rearranged and padded to fit the NPUs
+block based processing.
+
+### NPU Encoded Weights Size
+
+This is the total size of the [NPU Weights](#NPU-Weights-Size) after being
+encoded for the NPU.
+
+## Estimated performance
+
+The final sections show the estimated required compute power and performance for
+the network.
+
+```
+Neural network macs                             86952 MACs/batch
+Network Tops/s                                   0.00 Tops/s
+
+NPU cycles                                      21298 cycles/batch
+SRAM Access cycles                               2261 cycles/batch
+DRAM Access cycles                                  0 cycles/batch
+On-chip Flash Access cycles                         0 cycles/batch
+Off-chip Flash Access cycles                   112755 cycles/batch
+Total cycles                                   114098 cycles/batch
+
+Batch Inference time                 0.23 ms, 4382.18 inferences/s (batch size 1)
+```
+
+### Neural network macs
+
+This shows the estimated number of MACs in the network per batch. This number
+includes MACs from convolutions and vector products, not from operations such as
+elementwise and pooling operations.
+
+### Network Tops/s
+
+This shows the estimated TOPs/s for the network, which is an alternative
+representation of [Neural network macs](#Neural-network-macs)
+
+### Cycles
+
+This shows the estimated number of cycles per batch for NPU, memory accesses and
+in total. The total is the sum of the single action that consumes the most
+cycles per pass, i.e. if memory access consumes the most cycles for a pass only
+that will account for the pass cycles in the total.
+
+### Batch Inference time
+
+This shows the estimated inference time and inferences per second per batch.
+**NOTE: This is just an estimate, for more accurate numbers we recomend to run
+the compiled network in the software model.**
-- 
cgit v1.2.1