// // This confidential and proprietary software may be used only as // authorised by a licensing agreement from ARM Limited // (C) COPYRIGHT 2020 ARM Limited // ALL RIGHTS RESERVED // The entire notice above must be reproduced on all authorised // copies and copies may only be made to the extent permitted // by a licensing agreement from ARM Limited. == Introduction === Overview Tensor Operator Set Architecture (TOSA) provides a set of operations that Arm expects to be implemented on its NPUs. Each NPU may implement the operators with a different microarchitecture, however the result at the TOSA level must be consistent. Applications or frameworks which target TOSA can also be deployed on a wide variety of IP, such as CPUs or GPUs, with defined accuracy and compatibility constraints. Most operators from the common ML frameworks (TensorFlow, PyTorch, etc.) should be expressible in TOSA. It is expected that there will be tools to lower from the ML frameworks into TOSA. TOSA is focused on inference, leaving training to the original frameworks. === Profiles TOSA supports three profiles that enable efficient implementation on different classes of device. The Base-Inference profile is intended for embedded integer/fixed-point designs performing inference only. The Main-Inference profile is intended for general inference functionality including integer and floating-point data types. The Main-Training profile adds training operators in addition to inference operators. This version of the specification covers the Base and Main inference profiles. Main Training profile is expected in a later version of the specification. The following table summarizes the three profiles: .Profiles |=== |Profile|Name|Integer Inference|Floating-point Inference|Training |Base Inference|TOSA-BI|Yes|No|No |Main Inference|TOSA-MI|Yes|Yes|No |Main Training|TOSA-MT|Yes|Yes|Yes |=== === Operator Selection TOSA defines a set of primitive operators to which higher level operators can be lowered in a consistent way. To remain effective and efficient to implement the set of operators must be constrained to a reasonably small set of primitive operations out of which others can be constructed. The following principles govern the selection of operators within TOSA. .Principles [cols="1,5,5"] |=== |ID|Principle|Reason for this |P0 |An operator shall be a primitive operation or building block that cannot be broken down into simpler whole tensor operations |If the operator can be broken down, then we should look at the component operators. |P1 |An operator shall be a usable as a component out of which more complex operations can be constructed |Single use operators have a high architectural cost and a more reusable version should be considered instead. |P2 |Precision should be appropriate for the input and output data types |Precision higher than that needed to calculate the result leads to extra implementation cost |P3 |Numerical definition of common sub-operations should be consistent between operators (for example: value scaling) |Consistent sub-operation definition reduces the operator implementation cost |P4 |The valid input and output ranges for all operands shall be specified |Ranges are required to makes consistent (numerically agreeing) implementations possible |P5 |Integer operators shall be implementable in a bit-exact form with good efficiency on CPU, GPU and hardware targets. |Reduces implementation cost and gives consistent inference result |=== === Supported Features ==== Data Layouts The following data layouts are supported in TOSA. Data layouts are specified such that the rightmost dimension is the fastest changing. .Data Layouts [cols="1,4,4"] |=== |Name|Description of dimensions|Usage |NHWC|Batch, Height, Width, Channels|Feature maps |NDHWC|Batch, Depth, Height, Width, Channels|Feature maps for 3D convolution |OHWI|Output channels, Filter Height, Filter Width, Input channels|Weights |HWIM|Filter Height, Filter Width, Input channels, Channel Multiplier|Weights for depthwise convolutions |DOHWI|Depth, Output Channels, Filter Height, Filter Width, Input Channels|Weights for 3D convolution |=== ==== Floating point The base inference profile of TOSA requires support for the quantized integer operations. Floating point support is included in the main inference profile. ==== Number formats The following number formats are defined in TOSA. See section 2.3 for details on quantization within TOSA. The number formats supported by an operator are listed in a per-operator table of supported types. .Number formats [cols="1,1,1,6"] |=== |Format|Minimum|Maximum|Description |bool | - | - |Boolean value. Size implementation defined. |aint8 | -128 | +127 |Asymmetric 8-bit quantized values. Operators using this data type will require a zero point value and a scale factor. See <> for details on quantization parameters and their use in operators. |int4 | -7 | +7 |Signed 4-bit values. These values are symmetrically quantized, with values from -7, 7 as the range. These are quantized per-channel. No zero point is used, scale factor is provided as part of the operation. |int8 | -128 | +127 |Signed 8-bit twos-complement values. These values are quantized. Symmetric per-channel or per-tensor quantization. No zero point is used, scale factor is provided in the operation. |uint8 | 0 | 255 |Unsigned 8-bit value quantized value with zero point. This data type is only used for input/output conversion by the RESCALE operator and not supported by other operators. |int16 | -32768 | +32768 |Signed 16-bit twos-complement values. Symmetric per-tensor quantization. No zero point is used , scale factor is provided in the operation. |int32 | (1<<31)-1 | -(1<<31) |32-bit twos-complement value. No scale factor used. |float | -infinity | +infinity |floating point number. Must have features defined in the section <>. (Main inference profile) |=== Note: In this specification minimum and maximum will denote the minimum and maximum values of the data as stored in memory (ignoring the zero point). The minimum and maximum values for each type is given in the preceeding table. Note: Integer number formats smaller than 8 bits may be used provided that the numerical result is the same as using a sequence of 8-bit TOSA operations. For example, a convolution with low precision data must equal that of running the convolution at 8 bits and then clipping the result to the peritted output range. This ensures that a Base Inference profile TOSA implementation can calculate the same result. ==== Tensor Metadata Tensors have an associated tensorinfo that contains information about the tensor including: * Data Type * Shape The following pseudocode represents the operations that will happen to data elements as they are read in to be processed, or have their results written out. *Functionality of tensor read* If in_t is 8-bit then out_t=int16_t. Otherwise out_t is set to the same as in_t. .... out_t tensor_read(in_t *address, dim_t shape, dim_t index, in_t zero_point=0, dim_t pad=NULL) { assert(in_t==aint8_t || zero_point==0) unsigned offset = 0; for (i=0; i=0); return 0; } if (index[i]>=shape[i]) { assert(pad && index[i]( *address, dim_t shape, dim_t index, value) { unsigned offset = 0; for (i=0; i=0 && index[i]= 0); assert(2<=shift && shift<=62); int64_t round = 1<<(shift-1); if (double_round) { if (shift>31 && value>=0) round += 1<<30; if (shift>31 && value<0) round -= 1<<30; } int64_t result = (int64_t)value * multiplier + round; result = result >> shift; assert(result >= minimum && result <= maximum); return (int32_t)result; } int32_t apply_scale_16(int48_t value, int16_t multipler, uint6_t shift) { assert(multiplier >= 0); assert(2<=shift && shift<=62); int64_t round = (1<<(shift-1)); int64_t result = (int64_t)value * multiplier + round; result = result >> shift; assert(result >= minimum && result <= maximum); return (int32_t)result; } .... In some functions, the multiplier and shift are combined into a scale_t structure: .... typedef struct { int32_t multiplier; uint6_t shift; } scale_t; .... In places where a divide is required, we also use the function below to calculate an appropriate scaling value. .... scale_t reciprocal_scale(uint32_t value) { assert(value > 0) scale_t scale ; int k = 32-count_leading_zero(value-1); // (1<(acc_t a, acc_t b) { if (acc_t==float) return a+b; int64_t c = (int64_t)a + (int64_t)b; assert( c >= minimum && c <= maximum ); return (acc_t)c; } acc_t apply_sub(acc_t a, acc_t b) { if (acc_t==float) return a-b; int64_t c = (int64_t)a - (int64_t)b; assert( c >= minimum && c <= maximum ); return (acc_t)c; } .... The following functions are used in the pseudocode to take maximum, minimum or clip values to a range. .... apply_max( a, b) { if (a>=b) return a ; else return b ; } apply_min( a, b) { if (a apply_clip( value, min_val, max_val) { assert(min_val <= max_val) value = apply_max(value, min_val) ; value = apply_min(value, max_val) ; return value ; } .... ==== Quantized Convolutions For convolution, the input is not required to be scaled before the convolution occurs. The convolution produces an accumulator output of type int32_t or int48_t. This accumulator output is then scaled to the final output range using the RESCALE operator. The scale applied in the RESCALE operator should be set to multiplier and shift values such that: multiplier * 2^-shift^ = (input scale * weight scale) / output_scale. Here, input_scale, weight_scale and output_scale are the conversion factors from integer to floating point for the input, weight and output tensor values respectively. If per-channel scaling is needed then the per-channel option of the RESCALE operation should be used. ==== Elementwise operators When two quantized tensors are used in an operation, they must use the same scaling factor for the result to be valid. If the scaling factor for both tensors is equal, implementations will be allowed to optionally skip the scaling process. If the scaling factors are different, then the input with the smaller scaling factor is scaled to match the scaling factor of the input with the larger scaling factor. For each input, then, the scaled result = (result * scale + round) >> shift. For 8 and 16 bit activations, the scale will be calculated during compilation of the network and provided as a 16-bit scale factor and corresponding shift value. The value for round is 1 << (shift – 1). The scaled result should be 32 bits. Once each input has been scaled, the elementwise operation will occur. Then the result must be scaled into the proper output scaling range. The output scaling range will be supplied as a 16-bit scale factor and a 6-bit shift value (other than the comparison operators). This applies to the following operations: ADD, MAX, MIN, SUB, EQUAL, GREATER, GREATER_EQUAL MUL is a special case, where the inputs do not need to be scaled, all the scaling can be done during the output scaling process. ==== General unary functions General unary functions such as sigmoid(), tanh(), exp() are expressed using lookup table and interpolation to enable efficient implementation and extension to other operations with the addition of user supplied tables (the TABLE operation). All table lookups are based on the following reference lookup function that takes as input a table of 513 entries of 16-bit each. .... int32_t apply_lookup(int16_t *table, int value) { value = apply_clip(value, -32768, +32767) index = (value+32768) >> 7 fraction = value & 0x7f base = table[index] next = table[index+1] value = (base << 7) + (next - base) * fraction return value; // return interpolated value of 16+7=23 bits } .... Note that although the table lookup defined here has 16-bit precision, for 8-bit only operations an 8-bit table can be derived by applying the reference function to each of the possible 256 input values. The following code constructs a 513-entry table based on a reference function. .... void generate_lookup_table(int16_t *table, int (*reference)(int)) { for (int i=-256; i<=256; i++) { value = (*reference)(i); table[i+256] = clip(value, -32768, +32767) } } .... === Floating Point TOSA does not define bit-exact behaviour of the floating point type, since floating point operation results can vary according to operation order (floating point addition is not associative in general) and rounding behaviour. If a bit defined answer is required then integer operations should be used. TOSA does define that the floating point type must support the following list of features. These features ensure that detection of overflow and other exceptional conditions can be handled consistently. * The floating point type must have at least 16 total bits including the sign bit * The floating point type must support positive and negative infinity values * The floating point type must support at least one Not-a-Number encoding (NaN) * The floating point type must support signed zero * The floating point type must support handling of infinities, NaNs, zeros as in the following table .Floating point behaviour |=== |Case|Result |Any input operand is a NaN | a NaN |(± 0) × (± infinity), (± infinity) × (± 0) | a NaN |(± 0) / (± 0), (± infinity) / (± infinity) | a NaN | (+infinity) - (+infinity), (+infinity) + (-infinity) | a NaN | Any positive overflow | + infinity | Any negative overflow | - infinity | Any positive underflow | + 0 | Any negative underflow | - 0 |===