tests/ExecuteNetwork/evaluate_network.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162

# Evaluate Tensorflow Lite script.

This script will run a TfLite model through ExecuteNetwork evaluating its performance and accuracy on all available backends and with some available performance options. This script is designed to be used on an aarch64 Linux target.

## Usage
__evaluate_network.sh -e \<Path to ExecuteNetwork> -m \<Tflite model to test>__

The script takes two mandatory parameters. The first, -e, is the directory containing the prebuilt execute network binary. The second, -m, is the path to the Tf Lite model to be evaluated. For example:

```bash
evaluate_network.sh -e ./build/release/armnn/test -m ./my_tflite_model.tflite
```
## Prerequisites of your built execute network binary

* Built for a Linux target (Android is not yet supported by this script)
* CpuRef must be enabled (-DARMNNREF=1)
* The TfLite delegate must be enabled (-DBUILD_CLASSIC_DELEGATE=1)
* The TfLite parser must be enabled (-DBUILD_TF_LITE_PARSER=1)
* Any backend you want to test against. E.g. -DARMCOMPUTENEON=1 -DARMCOMPUTECL=1

## Prerequisites of the model
* The model must be fully supported by Arm NN.

## What tests are performed?

* Initial validation
  * Checks that the mandatory parameters point to valid locations.
  * Determines what backends are both built into the Execute Network binary and can execute on the current platform.
  * Checks that the TfLite delegate is supported by the binary.
  * Checks that the model is fully supported by Arm NN.
* Accuracy: for each available backend it will
  * Execute the model with input tensors set to all zeros.
  * Compare the results against running the model via the TfLite reference implementation.
  * The results are expressed as an RMS error between resultant tensors.
* Performance: for each available backend it will
  * Execute an inference 6 times.
  * Print the measured "model load" and "model optimization" times.
  * Print the execution time of the first inference. This is considered the "initial inference". Generally, this is longer as some kernel compilation may be required.
  * Average the remaining 5 inference times.
  * For the CpuAcc backend, if available, it will re-run the 6 performance inferences with:
    * "number-of-threads" set values between 1 and 12 printing the average inference times for each.
    * "fp16-turbo-mode" enabled it will print the average inference times.
    * "enable-fast-math" enabled it will print the average inference times.
  * For the GpuAcc backend, if available, it will re-run the 6 performance inferences with:
    * "fp16-turbo-mode" enabled it will print the average inference times.
    * "enable-fast-math" enabled it will print the average inference times.
    * "tuning-level/tuning-path" it will cycle through values 1 to 3 and printing average inference times.

## Worked examples

The following examples were run on an Odroid N2+ (4xCortex-A73, 2xCortex-A53, Mali-G52 GPU)

First using an int8 mobilenet v1 TfLite model:
```
~/$ ./evaluate_network.sh -e . -m ./mobilenet_v1_1.0_224_quant.tflite
Using Execute Network from			: ./ExecuteNetwork
Available backends on this executable		: GpuAcc CpuAcc CpuRef 
Is the delegate supported on this executable?	: Yes
Is the model fully supported by Arm NN?		: Yes
===================================================================================
BACKEND		ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)
GpuAcc		OK 		5.68		3.63			121.14			47.242	
CpuAcc		OK 		9.47		8.37			141.30			45.366	
CpuRef		OK 		6.54		3.21			8570.74			8585.3	

CpuAcc optimizations.
============================
The value of "number-of-threads" parameter by default is decided on by the backend.
Cycle through number-of-threads=1 -> 12 and see if any are faster than the default.

 "--number-of-threads 3" resulted in a faster average inference by 11.348 ms. (34.018 v 45.366)
 "--number-of-threads 4" resulted in a faster average inference by 6.992 ms. (38.374 v 45.366)
 "--number-of-threads 5" resulted in a faster average inference by 2.664 ms. (42.702 v 45.366)
 "--number-of-threads 6" resulted in a faster average inference by 2.060 ms. (43.306 v 45.366)
 "--number-of-threads 7" resulted in a faster average inference by 18.016 ms. (27.35 v 45.366)
 "--number-of-threads 8" resulted in a faster average inference by 18.792 ms. (26.574 v 45.366)
 "--number-of-threads 9" resulted in a faster average inference by 15.294 ms. (30.072 v 45.366)
 "--number-of-threads 10" resulted in a faster average inference by 16.820 ms. (28.546 v 45.366)
 "--number-of-threads 11" resulted in a faster average inference by 16.130 ms. (29.236 v 45.366)
 "--number-of-threads 12" resulted in a faster average inference by 16.134 ms. (29.232 v 45.366)

Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		28.40		5.68			94.65			41.84				3.526  (41.84 v 45.366)

Now tryng "enable-fast-math".
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		61.05		5.79			92.53			42.036				3.330  (42.036 v 45.366)

GpuAcc optimizations.
============================

Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		18.86		3.92			78.16			42.738				4.504  (42.738 v 47.242)

Now tryng "enable-fast-math".
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		6.60		3.88			78.06			43.47				3.772  (43.47 v 47.242)

Now tryng "tuning-level/tuning-path".
 "--tuning-level 1" resulted in a faster average inference by -3.652 ms. (43.59 v 47.242)
 "--tuning-level 2" resulted in a faster average inference by -3.718 ms. (43.524 v 47.242)
 "--tuning-level 3" resulted in a faster average inference by -4.624 ms. (42.618 v 47.242)
```
Looking at the results, the fastest execution mechanism for this model is using CpuAcc and setting -number-of-threads 8. The average time of this inference being almost twice as fast as the default CpuAcc execution. Unsurprisingly with an int8 model the GPU parameters didn't improve its execution times by much.

This next example is a fp32 resnet50 v2 TfLite model.
```
~/$ ./evaluate_network.sh -e . -m ./resnet50_v2_batch_fixed_fp32.tflite 
Using Execute Network from			: ./ExecuteNetwork
Available backends on this executable		: GpuAcc CpuAcc CpuRef 
Is the delegate supported on this executable?	: Yes
Is the model fully supported by Arm NN?		: Yes
===================================================================================
BACKEND		ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)
GpuAcc		OK 		144.54		31.19			779.37			220.274	
CpuAcc		OK 		152.36		28.55			1309.72			284.556	
CpuRef		OK 		5.13		8.70			39374.79			39349.9	

CpuAcc optimizations.
============================
The value of "number-of-threads" parameter by default is decided on by the backend.
Cycle through number-of-threads=1 -> 12 and see if any are faster than the default.

 "--number-of-threads 2" resulted in a faster average inference by 7.078 ms. (277.478 v 284.556)
 "--number-of-threads 3" resulted in a faster average inference by 80.326 ms. (204.23 v 284.556)
 "--number-of-threads 4" resulted in a faster average inference by 116.096 ms. (168.46 v 284.556)
 "--number-of-threads 5" resulted in a faster average inference by 64.658 ms. (219.898 v 284.556)
 "--number-of-threads 6" resulted in a faster average inference by 76.662 ms. (207.894 v 284.556)
 "--number-of-threads 7" resulted in a faster average inference by 63.524 ms. (221.032 v 284.556)
 "--number-of-threads 8" resulted in a faster average inference by 108.138 ms. (176.418 v 284.556)
 "--number-of-threads 9" resulted in a faster average inference by 117.110 ms. (167.446 v 284.556)
 "--number-of-threads 10" resulted in a faster average inference by 115.042 ms. (169.514 v 284.556)
 "--number-of-threads 11" resulted in a faster average inference by 100.866 ms. (183.69 v 284.556)
 "--number-of-threads 12" resulted in a faster average inference by 97.302 ms. (187.254 v 284.556)

Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		184.41		37.74			1486.33			278.828				5.728  (278.828 v 284.556)

Now tryng "enable-fast-math".
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		183.09		44.90			1438.94			279.976				4.580  (279.976 v 284.556)

GpuAcc optimizations.
============================

Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		5.20		277.20			303.70			184.028				36.246  (184.028 v 220.274)

Now tryng "enable-fast-math".
ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
OK 		190.88		27.11			775.53			222.564				**No improvment**

Now tryng "tuning-level/tuning-path".
 "--tuning-level 1" did not result in a faster average inference time. (223.06 v 220.274)
 "--tuning-level 2" did not result in a faster average inference time. (222.72 v 220.274)
 "--tuning-level 3" did not result in a faster average inference time. (222.958 v 220.274)
```
Again for this model CpuAcc with --number-of-threads 9 produced the fastest inference. However, you can see how adding --fp16-turbo-mode to GpuAcc almost brings it to the same performance level as CpuAcc.