IVGCVSW-7881 Add a script that evaluates the performance of a network.

This script will attempt to execute a tflite model through all available backends reporting performance and accuracy checks. Signed-off-by: Colm Donelan <colm.donelan@arm.com> Change-Id: Id60626eddc48b48c497a7f52f3fbd10aa036d997
author: Colm Donelan <colm.donelan@arm.com> 2023-09-07 10:36:17 +0100
committer: Colm Donelan <colm.donelan@arm.com> 2023-11-08 14:19:12 +0000
commit: 6f68ad2de312c79bddad2f07f23084f2fec06bda (patch)
tree: 8c2d1c18183a0518bca559f6a8749d198f8353b2
parent: bba0d1385e9a9df8f43643f767ec4350dca745ca (diff)
download: armnn-6f68ad2de312c79bddad2f07f23084f2fec06bda.tar.gz
2 files changed, 520 insertions, 0 deletions
diff --git a/tests/ExecuteNetwork/evaluate_network.md b/tests/ExecuteNetwork/evaluate_network.md
new file mode 100644
index 0000000000..b979183a41
--- /dev/null
+++ b/tests/ExecuteNetwork/evaluate_network.md
@@ -0,0 +1,162 @@
+# Evaluate Tensorflow Lite script.
+
+This script will run a TfLite model through ExecuteNetwork evaluating its performance and accuracy on all available backends and with some available performance options. This script is designed to be used on an aarch64 Linux target.
+
+## Usage
+__evaluate_network.sh -e \<Path to ExecuteNetwork> -m \<Tflite model to test>__
+
+The script takes two mandatory parameters. The first, -e, is the directory containing the prebuilt execute network binary. The second, -m, is the path to the Tf Lite model to be evaluated. For example:
+
+```bash
+evaluate_network.sh -e ./build/release/armnn/test -m ./my_tflite_model.tflite
+```
+## Prerequisites of your built execute network binary
+
+* Built for a Linux target (Android is not yet supported by this script)
+* CpuRef must be enabled (-DARMNNREF=1)
+* The TfLite delegate must be enabled (-DBUILD_CLASSIC_DELEGATE=1)
+* The TfLite parser must be enabled (-DBUILD_TF_LITE_PARSER=1)
+* Any backend you want to test against. E.g. -DARMCOMPUTENEON=1 -DARMCOMPUTECL=1
+
+## Prerequisites of the model
+* The model must be fully supported by Arm NN.
+
+## What tests are performed?
+
+* Initial validation
+  * Checks that the mandatory parameters point to valid locations.
+  * Determines what backends are both built into the Execute Network binary and can execute on the current platform.
+  * Checks that the TfLite delegate is supported by the binary.
+  * Checks that the model is fully supported by Arm NN.
+* Accuracy: for each available backend it will
+  * Execute the model with input tensors set to all zeros.
+  * Compare the results against running the model via the TfLite reference implementation.
+  * The results are expressed as an RMS error between resultant tensors.
+* Performance: for each available backend it will
+  * Execute an inference 6 times.
+  * Print the measured "model load" and "model optimization" times.
+  * Print the execution time of the first inference. This is considered the "initial inference". Generally, this is longer as some kernel compilation may be required.
+  * Average the remaining 5 inference times.
+  * For the CpuAcc backend, if available, it will re-run the 6 performance inferences with:
+    * "number-of-threads" set values between 1 and 12 printing the average inference times for each.
+    * "fp16-turbo-mode" enabled it will print the average inference times.
+    * "enable-fast-math" enabled it will print the average inference times.
+  * For the GpuAcc backend, if available, it will re-run the 6 performance inferences with:
+    * "fp16-turbo-mode" enabled it will print the average inference times.
+    * "enable-fast-math" enabled it will print the average inference times.
+    * "tuning-level/tuning-path" it will cycle through values 1 to 3 and printing average inference times.
+
+## Worked examples
+
+The following examples were run on an Odroid N2+ (4xCortex-A73, 2xCortex-A53, Mali-G52 GPU)
+
+First using an int8 mobilenet v1 TfLite model:
+```
+~/$ ./evaluate_network.sh -e . -m ./mobilenet_v1_1.0_224_quant.tflite
+Using Execute Network from			: ./ExecuteNetwork
+Available backends on this executable		: GpuAcc CpuAcc CpuRef 
+Is the delegate supported on this executable?	: Yes
+Is the model fully supported by Arm NN?		: Yes
+===================================================================================
+BACKEND		ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)
+GpuAcc		OK 		5.68		3.63			121.14			47.242	
+CpuAcc		OK 		9.47		8.37			141.30			45.366	
+CpuRef		OK 		6.54		3.21			8570.74			8585.3	
+
+CpuAcc optimizations.
+============================
+The value of "number-of-threads" parameter by default is decided on by the backend.
+Cycle through number-of-threads=1 -> 12 and see if any are faster than the default.
+
+ "--number-of-threads 3" resulted in a faster average inference by 11.348 ms. (34.018 v 45.366)
+ "--number-of-threads 4" resulted in a faster average inference by 6.992 ms. (38.374 v 45.366)
+ "--number-of-threads 5" resulted in a faster average inference by 2.664 ms. (42.702 v 45.366)
+ "--number-of-threads 6" resulted in a faster average inference by 2.060 ms. (43.306 v 45.366)
+ "--number-of-threads 7" resulted in a faster average inference by 18.016 ms. (27.35 v 45.366)
+ "--number-of-threads 8" resulted in a faster average inference by 18.792 ms. (26.574 v 45.366)
+ "--number-of-threads 9" resulted in a faster average inference by 15.294 ms. (30.072 v 45.366)
+ "--number-of-threads 10" resulted in a faster average inference by 16.820 ms. (28.546 v 45.366)
+ "--number-of-threads 11" resulted in a faster average inference by 16.130 ms. (29.236 v 45.366)
+ "--number-of-threads 12" resulted in a faster average inference by 16.134 ms. (29.232 v 45.366)
+
+Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		28.40		5.68			94.65			41.84				3.526  (41.84 v 45.366)
+
+Now tryng "enable-fast-math".
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		61.05		5.79			92.53			42.036				3.330  (42.036 v 45.366)
+
+GpuAcc optimizations.
+============================
+
+Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		18.86		3.92			78.16			42.738				4.504  (42.738 v 47.242)
+
+Now tryng "enable-fast-math".
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		6.60		3.88			78.06			43.47				3.772  (43.47 v 47.242)
+
+Now tryng "tuning-level/tuning-path".
+ "--tuning-level 1" resulted in a faster average inference by -3.652 ms. (43.59 v 47.242)
+ "--tuning-level 2" resulted in a faster average inference by -3.718 ms. (43.524 v 47.242)
+ "--tuning-level 3" resulted in a faster average inference by -4.624 ms. (42.618 v 47.242)
+```
+Looking at the results, the fastest execution mechanism for this model is using CpuAcc and setting -number-of-threads 8. The average time of this inference being almost twice as fast as the default CpuAcc execution. Unsurprisingly with an int8 model the GPU parameters didn't improve its execution times by much.
+
+This next example is a fp32 resnet50 v2 TfLite model.
+```
+~/$ ./evaluate_network.sh -e . -m ./resnet50_v2_batch_fixed_fp32.tflite 
+Using Execute Network from			: ./ExecuteNetwork
+Available backends on this executable		: GpuAcc CpuAcc CpuRef 
+Is the delegate supported on this executable?	: Yes
+Is the model fully supported by Arm NN?		: Yes
+===================================================================================
+BACKEND		ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)
+GpuAcc		OK 		144.54		31.19			779.37			220.274	
+CpuAcc		OK 		152.36		28.55			1309.72			284.556	
+CpuRef		OK 		5.13		8.70			39374.79			39349.9	
+
+CpuAcc optimizations.
+============================
+The value of "number-of-threads" parameter by default is decided on by the backend.
+Cycle through number-of-threads=1 -> 12 and see if any are faster than the default.
+
+ "--number-of-threads 2" resulted in a faster average inference by 7.078 ms. (277.478 v 284.556)
+ "--number-of-threads 3" resulted in a faster average inference by 80.326 ms. (204.23 v 284.556)
+ "--number-of-threads 4" resulted in a faster average inference by 116.096 ms. (168.46 v 284.556)
+ "--number-of-threads 5" resulted in a faster average inference by 64.658 ms. (219.898 v 284.556)
+ "--number-of-threads 6" resulted in a faster average inference by 76.662 ms. (207.894 v 284.556)
+ "--number-of-threads 7" resulted in a faster average inference by 63.524 ms. (221.032 v 284.556)
+ "--number-of-threads 8" resulted in a faster average inference by 108.138 ms. (176.418 v 284.556)
+ "--number-of-threads 9" resulted in a faster average inference by 117.110 ms. (167.446 v 284.556)
+ "--number-of-threads 10" resulted in a faster average inference by 115.042 ms. (169.514 v 284.556)
+ "--number-of-threads 11" resulted in a faster average inference by 100.866 ms. (183.69 v 284.556)
+ "--number-of-threads 12" resulted in a faster average inference by 97.302 ms. (187.254 v 284.556)
+
+Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		184.41		37.74			1486.33			278.828				5.728  (278.828 v 284.556)
+
+Now tryng "enable-fast-math".
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		183.09		44.90			1438.94			279.976				4.580  (279.976 v 284.556)
+
+GpuAcc optimizations.
+============================
+
+Now tryng to enable fp16-turbo-mode. This will only have positive results with fp32 models.
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		5.20		277.20			303.70			184.028				36.246  (184.028 v 220.274)
+
+Now tryng "enable-fast-math".
+ACCURACY	MODEL LOAD(ms)	OPTIMIZATION(ms)	INITIAL INFERENCE(ms)	AVERAGE INFERENCE(ms)		DELTA(ms)
+OK 		190.88		27.11			775.53			222.564				**No improvment**
+
+Now tryng "tuning-level/tuning-path".
+ "--tuning-level 1" did not result in a faster average inference time. (223.06 v 220.274)
+ "--tuning-level 2" did not result in a faster average inference time. (222.72 v 220.274)
+ "--tuning-level 3" did not result in a faster average inference time. (222.958 v 220.274)
+```
+Again for this model CpuAcc with --number-of-threads 9 produced the fastest inference. However, you can see how adding --fp16-turbo-mode to GpuAcc almost brings it to the same performance level as CpuAcc.
+\ No newline at end of file
diff --git a/tests/ExecuteNetwork/evaluate_network.sh b/tests/ExecuteNetwork/evaluate_network.sh
new file mode 100755
index 0000000000..931167dda8
--- /dev/null
+++ b/tests/ExecuteNetwork/evaluate_network.sh
@@ -0,0 +1,358 @@
+#!/bin/bash
+#set -x
+#
+# Copyright © 2023 Arm Ltd and Contributors. All rights reserved.
+# SPDX-License-Identifier: MIT
+#
+# This script will run a TfLite model through ExecuteNetwork trying all available backends to measure
+# both speed and accuracy. In addition, it will try some of the performance options that are available.
+#
+# Prerequisites: ExecuteNetwork must be built with:
+# * CpuRef enabled (-DARMNNREF=1)
+# * TfLite delegate enabled (-DBUILD_CLASSIC_DELEGATE=1)
+# * TfLite parser enabled (-DBUILD_TF_LITE_PARSER=1)
+# * Any backend you want to test against. E.g. -DARMCOMPUTENEON=1 -DARMCOMPUTECL=1
+# * The model must be fully supported by Arm NN.
+#
+# Usage:
+# evaluate_network.sh -e <Path to ExecuteNetwork> -m <Tfite model to test>
+#
+# Sample usage:
+# evaluate_network.sh -e ./build/release/armnn/test -m ./my_tflite_model.tflite
+#
+
+CMD=$( basename "$0" )
+
+usage() {
+  echo "Usage: $CMD -e <Path to ExecuteNetwork> -m <Test model>"
+  echo "Options:        -e <Path to ExecuteNetwork>"
+  echo "                -m <Test model>"
+  exit 1
+}
+
+# Errors if the previous command had a non-zero exit code.
+function AssertZeroExitCode {
+  EXITCODE=$?
+  if [ $EXITCODE -ne 0 ]; then
+    echo -e "Previous command exited with code $EXITCODE"
+    exit 1
+  fi
+}
+
+OPTION_COUNTER=0
+while getopts "e:m:" opt; do
+  ((OPTION_COUNTER+=1))
+  case "$opt" in
+    h|\?) usage;;
+    e) EXECUTE_NETWORK_PATH="$OPTARG";;
+    m) MODEL="$OPTARG";;
+  esac
+done
+shift $((OPTIND - 1))
+
+# Both parameters are mandatory.
+if [ -z "$EXECUTE_NETWORK_PATH" ] || [ -z "$MODEL" ]; then
+    usage
+    exit 1
+fi
+
+# Check the path to execute network will find the executable.
+if [ -x "$EXECUTE_NETWORK_PATH/ExecuteNetwork" ]; then
+    echo -e "Using Execute Network from\t\t\t: $EXECUTE_NETWORK_PATH/ExecuteNetwork"
+    EXECUTE_NETWORK="$EXECUTE_NETWORK_PATH/ExecuteNetwork"
+else
+    echo "Execute Network does not exist at \"$EXECUTE_NETWORK_PATH/ExecuteNetwork\""
+    usage
+    exit 1
+fi
+
+# Check that the model exists and has a supported extension.
+if [ -f $MODEL ]; then
+    if [[ ! $MODEL =~ (tflite)$ ]]; then
+        echo "Only .tflite files are supported."
+        exit 1
+    fi
+else
+    echo Model file: "\"$MODEL\" could not be found."
+    usage
+    exit 1
+fi
+
+# Find out the available backends. Unfortunaltey the list of backends spans multiple lines.
+# This means we have to do this in several steps.
+echo -n -e "Available backends on this executable\t\t:"
+HELP_OUTOUT=`$EXECUTE_NETWORK --help`
+BACKENDS=`echo $HELP_OUTOUT | sed  's/.*: \[//' | sed 's/\].*//' | sed 's/,//g'`
+# Remove the leading space to make it look prettier.
+BACKENDS="${BACKENDS:1}"
+if [ -z "$BACKENDS" ]; then
+    echo ""
+    echo "Execute Network reported no available backends!"
+    exit 1
+else
+    echo " $BACKENDS"
+    # We really need the CpuRef to be in there.
+    if [[ ! $BACKENDS =~ "CpuRef" ]]; then
+        echo ""
+        echo "Fatal: Please recompile ExecuteNetwork to include the CpuRef backend. (-DARMNNREF=1)"
+        exit 1
+    fi
+fi
+
+
+# This is where the real work starts.
+# Model execution can take a long time. Trap ctrl-c and tell the user.
+trap ctrl_c INT
+
+function ctrl_c() {
+        echo -e "Interrupted.\nNo patience eh? Try a smaller model."
+        exit 1
+}
+
+
+# We need to check that the delegate is supported otherwise we can't run through the tf runtime.
+echo -n -e "Is the delegate supported on this executable?\t:"
+TFLITE_EXECUTION=`$EXECUTE_NETWORK -m $MODEL -T tflite -c CpuRef -N`
+# Check for an error message about building with the delegate.
+if [[ $TFLITE_EXECUTION =~ "Tensorflow-Lite delegate support" ]]; then
+    echo ""
+    echo "Fatal: Please recompile ExecuteNetwork with TfLite delegate support enabled. (-DBUILD_CLASSIC_DELEGATE=1)"
+    exit 1
+else
+    echo " Yes"
+fi
+
+# Run through CpuRef to see if Arm NN supports the model.
+echo -n -e "Is the model fully supported by Arm NN?\t\t:"
+REF_EXECUTION=`$EXECUTE_NETWORK -m $MODEL -c CpuRef -N`
+# If it failed look for the most common reason - an unsupported layer.
+if [ $? -ne 0 ]; then
+    if [[ $REF_EXECUTION =~ "is not supported on requested backend CpuRef" ]]; then
+        echo -e " No - One or more layers are not supported by Arm NN"
+    else
+        echo -e " No - Execution using CpuRef backend failed."
+    fi
+    echo -e "The Reported problems were\t:"
+    echo `echo "$REF_EXECUTION" | sed '/Warning\|ERROR\|Fatal/!d'`
+    echo "To recreate this error try: \"$EXECUTE_NETWORK -m $MODEL -c CpuRef\" "
+    exit 1
+fi
+echo " Yes"
+
+# This function will execute the model and return a string representation of the results. This is the
+# first time the model will be executed.
+# Is done wth -c $BACKEND,CpuRef to allow the odd layer to be supported by an unaccelerated backend.
+#
+# Parameters:
+# $1 Backend string like CpuRef.
+# $2 Additional ExecuteNetwork parameters.
+#
+function RunAccuracyOnBackendWithParameters {
+    BACKEND=$1
+    ADDITIONAL_PARAM=$2
+    # Run on BACKEND to check accuracy against TfLite runtime first. This will be a warning not a failure.
+    ACCURACY_RUN=`$EXECUTE_NETWORK -m $MODEL -c $BACKEND $ADDITIONAL_PARAM -A -N`
+    # Start by checking the return code.
+    if [ $? -ne 0 ]; then
+        # Maybe this backend isn't supported.
+        if [[ $ACCURACY_RUN =~ "None of the preferred backends [$BACKEND ] are supported" ]]; then
+            echo -e "\t\t***Is not supported***"
+            return 1
+        elif [[ $ACCURACY_RUN =~ "is not supported on requested backend" ]]; then
+            # One or more layers require a fall back. Run again with CpuRef fall back.
+            ACCURACY_RUN=`$EXECUTE_NETWORK -m $MODEL -c $BACKEND,CpuRef $ADDITIONAL_PARAM -A -N`
+            REQUIRES_CPUREF="*"
+        else
+            # In the case of a general failure against this backend tell the user what we tried and then
+            # ignore this backend.
+            echo -e "\t***Execution failed. Ignoring this backend. Command was: \"$EXECUTE_NETWORK -m $MODEL -c $BACKEND -A -N\""
+            return 1
+        fi
+    fi
+    # Now check the RMS value. If it isn't 0 then mark this as questionable accuracy.
+    ACCURACY_VALUE=`echo "$ACCURACY_RUN" | grep 'Byte level'`
+    if [[ ! $ACCURACY_VALUE == *0 ]]; then
+        ACCURACY=!`echo $ACCURACY_VALUE | sed 's/[a-zA-Z:]*//g'`
+    else
+        ACCURACY="OK"
+    fi
+    # Add on the * if we needed to add CpuRef.
+    if [ -z $REQUIRES_CPUREF ]; then
+        echo -e "$ACCURACY $REQUIRES_CPUREF\t\t"
+    else
+        echo -e "$ACCURACY\t\t"
+    fi
+}
+
+# This function will execute the model and return a string representation of the results. The execution
+# Is done wth -c $BACKEND,CpuRef to allow the odd layer to ot be supported by an accelerated backend.
+#
+# Parameters:
+# $1 Backend string like CpuRef.
+# $2 Additional ExecuteNetwork parameters.
+#
+function RunPerformanceOnBackendWithParameters {
+    BACKEND=$1
+    ADDITIONAL_PARAM=$2
+    # Execute with 6 inferences. Mark the first as initial inference. Average the rest.
+    SPEED_RUN=`$EXECUTE_NETWORK -m $MODEL -c $BACKEND,CpuRef -I 6 -N $ADDITIONAL_PARAM`
+
+    # Extract the model load time
+    MODEL_LOAD_TIME=`echo "$SPEED_RUN" | grep "Initialization time" | sed 's/[a-zA-Z:]*//g'`
+    MODEL_LOAD_TIME=`echo ${MODEL_LOAD_TIME::-2}` # Remove the tailing space and full stop.
+    # and the optimization time.
+    OPTIMIZATION_TIME=`echo "$SPEED_RUN" | grep "Optimization time" | sed 's/[a-zA-Z:]*//g'`
+    OPTIMIZATION_TIME=`echo ${OPTIMIZATION_TIME::-1}` # Remove the tailing space.
+
+    # All 6 inference times.
+    RAW_INFERENCE=`echo "$SPEED_RUN" | grep "Inference time"`
+    # This will take "Info: Inference time: 0.03 ms Info:..." and transform to "0.03 0.01 0.01"
+    INFERENCE_TIMES=`echo $RAW_INFERENCE | sed 's/[a-zA-Z:]*//g'`
+    INITIAL_INFERENCE_TIME=`echo $INFERENCE_TIMES | cut -d ' ' -f 1`
+    # Now remove the initial inference time as it will skew the average.
+    INFERENCE_TIMES=`echo $INFERENCE_TIMES | sed 's/[^ ]* //'`
+    # Use awk to sum and average the remaining 5 numbers.
+    AVERAGE_INFERENCE_TIME=`echo $INFERENCE_TIMES | awk '{s+=$1}END{print s/NR}' RS=" "`
+
+    # Result format is: MODEL LOAD | OPTIMIZATION | INITIAL INFERENCE | AVERAGE INFERENCE
+    echo -e "$MODEL_LOAD_TIME\t\t$OPTIMIZATION_TIME\t\t\t$INITIAL_INFERENCE_TIME\t\t\t$AVERAGE_INFERENCE_TIME\t"
+}
+
+
+# Check execution in all available backends.
+echo    "==================================================================================="
+echo -e "BACKEND\t\tACCURACY\tMODEL LOAD(ms)\tOPTIMIZATION(ms)\tINITIAL INFERENCE(ms)\tAVERAGE INFERENCE(ms)"
+for backend in $BACKENDS
+do
+    echo -n -e "$backend\t\t"
+    RESULT=$(RunAccuracyOnBackendWithParameters $backend)
+    echo -n -e "$RESULT"
+    if [[ $RESULT =~ "*" ]]; then
+        REQUIRED_CPU_REF=1
+    fi
+    # It's possible the backend wasn't supported.
+    if [[ ! "$RESULT" =~ "not supported" ]]; then
+        # It was, continue.
+        RESULT=$(RunPerformanceOnBackendWithParameters $backend)
+        echo -n -e "$RESULT"
+        # Save some specific values for use later.
+        if [ $backend == "CpuAcc" ]; then
+            # In the case of CpuAcc we save the avrage inference time.
+            CPUACC_AVERAGE_INFERENCE_TIME=`echo $RESULT | cut -d ' ' -f 4`
+        fi
+        if [ $backend == "GpuAcc" ]; then
+            # In the case of GpuAcc we save the avrage inference time.
+            GPUACC_AVERAGE_INFERENCE_TIME=`echo $RESULT | cut -d ' ' -f 4`
+        fi
+    else
+        # Remove this backend from future tests.
+        BACKENDS=`echo $BACKENDS | sed "s/$backend//"`
+    fi
+    echo
+done
+# Only print this if it was required.
+if [ ! -z $REQUIRED_CPU_REF ]; then
+    echo "* denotes this backend required fallback to CpuRef."
+    echo
+fi
+
+# Now its time to look at backend specific parameters.
+
+# This function first run the accuracy test and then the performance test. It uses the average from earlier
+# to compare to.
+function RunAccuracyAndPerformanceWithExtraParameter
+{
+    BACKEND=$1
+    EXTRA_PARAM=$2
+    AVERAGE_INFERENCE_TIME=$3
+    echo -e "ACCURACY\tMODEL LOAD(ms)\tOPTIMIZATION(ms)\tINITIAL INFERENCE(ms)\tAVERAGE INFERENCE(ms)\t\tDELTA(ms)"
+    RESULT=$(RunAccuracyOnBackendWithParameters $BACKEND,CpuRef $EXTRA_PARAM)
+    echo -n "$RESULT"
+    RESULT=$(RunPerformanceOnBackendWithParameters $BACKEND,CpuRef $EXTRA_PARAM)
+    PARAM_AVERAGE_INFERENCE_TIME=`echo $RESULT | cut -d ' ' -f 4`
+    # If adding the parameter was faster then incude by how much.
+    if [[ "$PARAM_AVERAGE_INFERENCE_TIME" < "$AVERAGE_INFERENCE_TIME" ]]; then
+        DELTA=`echo $AVERAGE_INFERENCE_TIME - $PARAM_AVERAGE_INFERENCE_TIME | bc`
+        echo -e "$RESULT\t\t\t$DELTA  ($PARAM_AVERAGE_INFERENCE_TIME v $AVERAGE_INFERENCE_TIME)"
+    else
+        echo -e "$RESULT\t\t\t**No improvment**"
+    fi
+}
+
+
+# Start with CpuAcc. Three knobs to twiddle, threads, fast-math and fp16.
+if [[ $BACKENDS =~ "CpuAcc" ]]; then
+    echo
+    echo    "CpuAcc optimizations."
+    echo    "============================"
+    echo    "The value of \"number-of-threads\" parameter by default is decided on by the backend."
+    echo    "Cycle through number-of-threads=1 -> 12 and see if any are faster than the default."
+    echo
+    for i in {1..12}
+    do
+        RESULT=$(RunPerformanceOnBackendWithParameters "CpuAcc,CpuRef" "--number-of-threads $i")
+        AVERAGE_INFERENCE_TIME=`echo $RESULT | cut -d ' ' -f 4`
+        # Print something out if the returned average is less than the previously saved average.
+        if (( $(echo "$AVERAGE_INFERENCE_TIME < $CPUACC_AVERAGE_INFERENCE_TIME" | bc -l) )); then
+            DELTA=`echo $CPUACC_AVERAGE_INFERENCE_TIME - $AVERAGE_INFERENCE_TIME | bc`
+            echo " \"--number-of-threads $i\" resulted in a faster average inference by $DELTA ms. ($AVERAGE_INFERENCE_TIME v $CPUACC_AVERAGE_INFERENCE_TIME)"
+            FASTER=1
+        fi
+    done
+    if [ -z $FASTER ]; then
+        echo "No value of \"number-of-threads\" was faster than the default."
+    fi
+    # Next is fp16-turbo-mode. We do both accuracy and speed on this one.
+    echo
+    echo -n  "Now trying to enable fp16-turbo-mode. This will only have positive results with fp32 models."
+    echo
+    RunAccuracyAndPerformanceWithExtraParameter CpuAcc "--fp16-turbo-mode" $CPUACC_AVERAGE_INFERENCE_TIME
+
+    # Next is enable-fast-math. Again both accuracy and speed on this one.
+    echo
+    echo -n  "Now trying \"enable-fast-math\"."
+    echo
+    RunAccuracyAndPerformanceWithExtraParameter CpuAcc "--enable-fast-math" $CPUACC_AVERAGE_INFERENCE_TIME
+fi
+
+# GpuAcc.
+# Options to check enable-fast-math, fp16-turbo-mode, and tuning-level/tuning-path.
+if [[ $BACKENDS =~ "GpuAcc" ]]; then
+    echo
+    echo    "GpuAcc optimizations."
+    echo    "============================"
+
+    # fp16-turbo-mode. We do both accuracy and speed on this one.
+    echo
+    echo -n  "Now trying to enable fp16-turbo-mode. This will only have positive results with fp32 models."
+    echo
+    RunAccuracyAndPerformanceWithExtraParameter GpuAcc "--fp16-turbo-mode" $GPUACC_AVERAGE_INFERENCE_TIME
+
+    # Next is enable-fast-math. Again both accuracy and speed on this one.
+    echo
+    echo -n  "Now trying \"enable-fast-math\"."
+    echo
+    RunAccuracyAndPerformanceWithExtraParameter GpuAcc "--enable-fast-math" $GPUACC_AVERAGE_INFERENCE_TIME
+
+    # Next is tuning levels. Just speed on this one.
+    echo
+    echo -n  "Now trying \"tuning-level/tuning-path\"."
+    echo
+    for i in {1..3}
+    do
+        touch ./tuned-network.bin
+        # Create tuned network file with the first run.
+        OUTPUT=`$EXECUTE_NETWORK -m $MODEL -c $GpuAcc,CpuRef --tuning-path ./tuned-network.bin --tuning-level $i -N`
+        AssertZeroExitCode
+        # Now run the perforance test reusing that saved network.
+        RESULT=$(RunPerformanceOnBackendWithParameters "GpuAcc,CpuRef" "--tuning-path ./tuned-network.bin")
+        AVERAGE_INFERENCE_TIME=`echo $RESULT | cut -d ' ' -f 4`
+        if (( $(echo "$AVERAGE_INFERENCE_TIME < $GPUACC_AVERAGE_INFERENCE_TIME" | bc -l) )); then
+            DELTA=`echo $AVERAGE_INFERENCE_TIME - $GPUACC_AVERAGE_INFERENCE_TIME | bc`
+            echo  " \"--tuning-level $i\" resulted in a faster average inference by $DELTA ms. ($AVERAGE_INFERENCE_TIME v $GPUACC_AVERAGE_INFERENCE_TIME)"
+        else
+            echo  " \"--tuning-level $i\" did not result in a faster average inference time. ($AVERAGE_INFERENCE_TIME v $GPUACC_AVERAGE_INFERENCE_TIME)"
+        fi
+        rm ./tuned-network.bin
+    done
+fi
author	Colm Donelan <colm.donelan@arm.com>	2023-09-07 10:36:17 +0100
committer	Colm Donelan <colm.donelan@arm.com>	2023-11-08 14:19:12 +0000
commit	6f68ad2de312c79bddad2f07f23084f2fec06bda (patch)
tree	8c2d1c18183a0518bca559f6a8749d198f8353b2
parent	bba0d1385e9a9df8f43643f767ec4350dca745ca (diff)
download	armnn-6f68ad2de312c79bddad2f07f23084f2fec06bda.tar.gz