aboutsummaryrefslogtreecommitdiff
path: root/chapters/introduction.adoc
blob: eafaacaec81d21575951498a49cda962f016e711 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
//
// This confidential and proprietary software may be used only as
// authorised by a licensing agreement from ARM Limited
// (C) COPYRIGHT 2020-2022 ARM Limited
// ALL RIGHTS RESERVED
// The entire notice above must be reproduced on all authorised
// copies and copies may only be made to the extent permitted
// by a licensing agreement from ARM Limited.

== Introduction

=== Overview

Tensor Operator Set Architecture (TOSA) provides a set of whole-tensor
operations commonly employed by Deep Neural Networks. The intent is to enable a
variety of implementations running on a diverse range of processors, with the
results at the TOSA level consistent across those implementations. Applications
or frameworks which target TOSA can therefore be deployed on a wide range of
different processors, such as SIMD CPUs, GPUs and custom hardware such as
NPUs/TPUs, with defined accuracy and compatibility constraints. Most operators
from the common ML frameworks (TensorFlow, PyTorch, etc.) should be expressible
in TOSA. It is expected that there will be tools to lower from ML frameworks
into TOSA.

=== Goals

The goals of TOSA include the following:

* A minimal and stable set of tensor-level operators to which machine learning
framework operators can be reduced.

* Full support for both quantized integer and floating-point content.

* Precise functional description of the behavior of every operator, including
the treatment of their numerical behavior in the case of precision, saturation,
scaling, and range as required by quantized datatypes.

* Agnostic to any single high-level framework, compiler backend stack or
particular target.

* The detailed functional and numerical description enables precise code
construction for a diverse range of targets – SIMD CPUs, GPUs and custom
hardware such as NPUs/TPUs.

=== Specification

The TOSA Specification is written as AsciiDoc mark-up and developed in its raw
mark-up form, managed through a git repository here:
https://git.mlplatform.org/tosa/specification.git/. The specification is
developed and versioned much like software. While the mark-up is legible and can
be read fairly easily in its raw form, it’s recommended to build or “render” the
mark-up into a PDF document, or similar. To do this, please follow the
instructions in the README.md in the root of the specification repository.

=== Profiles

TOSA supports three profiles that enable efficient implementation on different classes of device. The Base-Inference profile is intended for embedded integer/fixed-point designs performing inference only.  The Main-Inference profile is intended for general inference functionality including integer and floating-point data types.  The Main-Training profile adds training operators in addition to inference operators.
This version of the specification covers the Base and Main inference profiles. Main Training profile is expected in a later version of the specification.
The following table summarizes the three profiles:

.Profiles
|===
|Profile|Name|Integer Inference|Floating-point Inference|Training

|Base Inference|TOSA-BI|Yes|No|No
|Main Inference|TOSA-MI|Yes|Yes|No
|Main Training|TOSA-MT|Yes|Yes|Yes
|===

=== Compliance

This section defines when a TOSA implementation is compliant to a given TOSA specification profile.
The term conformant will mean the same as compliant.

==== Baseline Inference Profile

The <<Operator Graphs>> section of this specification defines a TOSA graph and the behavior defined for a TOSA graph.
This behavior is captured in the pseudo-code function tosa_execute_graph().
For a given input graph (with attributes) and input tensors there are three possible tosa_graph_result values after executing the graph:

* tosa_unpredictable: The result of the graph on the given inputs cannot be relied upon.
* tosa_error: The graph does not meet the specification and is recognised as an illegal graph.
* tosa_valid: The result is defined and predictable and the list of output tensors defines the result.

An implementation is compliant to the TOSA Baseline Inference Profile if it matches the above results as follows:

* For tosa_unpredictable, the implementation can return whatever result it chooses (including error)
* For tosa_error, the implementation must return an error result (and there is no requirement on how much of the graph is executed, if any)
* For tosa_valid, the implementation must execute the entire graph without error and return the result defined by this specification.

In terms of psuedo-code, if *graph* is a TOSA graph consisting of Baseline Inference Profile operators and *input_list* is a list of input tensors then the following test must pass.

[source,c++]
----
bool tosa_test_compliance(tosa_graph_t graph, tosa_list_t input_list) {
    shape_list_t output_list_spec = tosa_allocate_list(tosa_output_shape(graph));
    shape_list_t output_list_test = tosa_allocate_list(tosa_output_shape(graph));
    tosa_graph_result = tosa_valid    // result starts as valid
    tosa_execute_graph(graph, input_list, output_list_spec);
    if (tosa_graph_result == tosa_unpredictable) {
        return true;    // No requirement to match an unpredictable result
    }
    result_test = execute_implementation_under_test(graph, input_list, output_list_test);
    if (tosa_graph_result == tosa_error) {
        return result_test == tosa_error;   // result must be an error
    }
    if (exact_tensor_match(output_list_spec, output_list_test)) {
       // Predictable bit-exact value match required
       return true;
    }
    return false;
}
----

==== Main Inference and Main Training Profile

An implementation is compliant to the Main Inference or Main Training profiles if the following both hold for that respective profile:

* For a graph returning tosa_error the implementation must also return an error
* For a graph returning tosa_valid the implementation must execute the entire graph without error
* For a graph returning tosa_valid and consisting only of integer operators the results must match exactly
* The implementation must report the maximum relative error on a set of standard graphs that contain floating point operators. These graphs will be provided as a future appendix to this specification.

Note that for graphs containing floating point there is no strict precision requirement that must be met, but that the precision achieved must be reported.

=== Operator Selection

TOSA defines a set of primitive operators to which higher level operators can be lowered in a consistent way. To remain effective and efficient to implement the set of operators must be constrained to a reasonably small set of primitive operations out of which others can be constructed. The following principles govern the selection of operators within TOSA.

.Principles
[cols="1,5,5"]
|===
|ID|Principle|Reason for this

|P0
|An operator shall be a primitive operation or building block that cannot be broken down into simpler whole tensor operations
|If the operator can be broken down, then we should look at the component operators.

|P1
|An operator shall be a usable as a component out of which more complex operations can be constructed
|Single use operators have a high architectural cost and a more reusable version should be considered instead.

|P2
|Precision should be appropriate for the input and output data types
|Precision higher than that needed to calculate the result leads to extra implementation cost

|P3
|Numerical definition of common sub-operations should be consistent between operators (for example: value scaling)
|Consistent sub-operation definition reduces the operator implementation cost

|P4
|The valid input and output ranges for all operands shall be specified
|Ranges are required to makes consistent (numerically agreeing) implementations possible

|P5
|Integer operators shall be implementable in a bit-exact form with good efficiency on CPU, GPU and hardware targets.
|Reduces implementation cost and gives consistent inference result
|===

=== Supported Features

==== Data Layouts

The following data layouts are supported in TOSA. Data layouts are specified such that the rightmost dimension is the fastest changing.

.Data Layouts
[cols="1,4,4"]
|===
|Name|Description of dimensions|Usage

|NHWC|Batch, Height, Width, Channels|Feature maps
|NDHWC|Batch, Depth, Height, Width, Channels|Feature maps for 3D convolution
|OHWI|Output channels, Filter Height, Filter Width, Input channels|Weights
|HWIM|Filter Height, Filter Width, Input channels, Channel Multiplier|Weights for depthwise convolutions
|DOHWI|Depth, Output Channels, Filter Height, Filter Width, Input Channels|Weights for 3D convolution
|===

==== Floating-point

The base inference profile of TOSA requires support for the quantized integer operations. Floating-point support is included in the main inference profile.

==== Number Formats

The following number formats are defined in TOSA.
The number formats supported by an operator are listed in a per-operator table of supported types.
The integer types may be used to represent quantized data.
For details of interpreting the quantized data, see the <<Quantization Scaling>> section.

.Number formats
[cols="1,1,1,5"]
|===
|Format|Minimum|Maximum|Description

|bool_t
| -
| -
|Boolean value. Size implementation defined. The TOSA reference model implements this as int8_t with 0 for false and 1 for true. All non-zero values are accepted on input as true.

|int4_t
| -7
| +7
|Signed 4-bit two's-complement value. Excludes -8 to maintain a symmetric about zero range for weights.

|int8_t
| -128
| +127
|Signed 8-bit two's-complement value.

|uint8_t
| 0
| 255
|Unsigned 8-bit value.

|int16_t
| -32768
| +32767
|Signed 16-bit two's-complement value.

|uint16_t
| 0
| 65535
|Unsigned 16-bit value.

|int32_t
| -(1<<31)
| (1<<31)-1
|Signed 32-bit two's-complement value.

|int48_t
| -(1<<47)
| (1<<47)-1
|Signed 48-bit two's-complement value.

|float_t
| -infinity
| +infinity
|floating-point number. Must have features defined in the section <<Floating-point>>.
|===

Note: In this specification minimum<type> and maximum<type> will denote the minimum and maximum values of the data as stored in memory (ignoring the zero point). The minimum and maximum values for each type is given in the preceeding table.

Note: Integer number formats smaller than 8 bits may be used provided that the numerical result is the same as using a sequence of 8-bit TOSA operations. For example, a convolution with low precision data must equal that of running the convolution at 8 bits and then clipping the result to the peritted output range. This ensures that a Base Inference profile TOSA implementation can calculate the same result.

==== Tensor Metadata

Tensors have an associated tensorinfo that contains information about the tensor including:

* Data Type
* Shape

The number of dimensions in a shape is called the rank.
Thus a tensor shape is an array of integers of size rank(shape) with shape[i] giving the the number of elements for dimension i.
The tensor shape in each dimension must be greater than or equal to 1.
The following pseudocode represents the operations that will happen to data elements as they are read in to be processed, or have their results written out.

*Functionality of tensor read*

tensor_read reads a single data value out of the given tensor.
The shape argument contains the shape of the tensor.
Index is the coordinates within the tensor of the value to be read.

[source,c++]
----
in_t tensor_read<in_t>(in_t *address, dim_t shape, dim_t index) {
    // Ensure this is a proper tensor with each dimension having size >= 1
    for_each(dimension_size in shape) {
        REQUIRE(dimension_size >= 1);
    }
    unsigned offset = 0;
    for (i = 0; i < rank(shape); i++) {
        REQUIRE(index[i] >= 0 && index[i] < shape[i]);
        offset = offset * shape[i] + index[i];
    }
    return address[offset];
}
----

*Functionality of tensor write*

tensor_write writes a single data value into the given tensor.
The shape argument contains the shape of the tensor.
Index is the coordinates within the tensor of the value to be written.
value is the value to be written to the given coordinate.

[source,c++]
----
tensor_write<type>(<type> *address, dim_t shape, dim_t index, <type> value) {
    unsigned offset = 0;
    // Ensure this is a proper tensor with each dimension having size >= 1
    for_each(dimension_size in shape) {
        REQUIRE(dimension_size >= 1);
    }
    for (i = 0; i < rank(shape); i++) {
        REQUIRE(index[i] >= 0 && index[i] < shape[i]);
        offset = offset * shape[i] + index[i];
    }
    address[offset] = value;
}
----

==== Broadcasting

In operations where broadcasting is supported, an input shape dimension can be broadcast to an output shape dimension if the dimensions are equal or the input shape dimension is 1. TOSA broadcast requires the rank of both tensors
to be the same. A RESHAPE can be done to create a compatible tensor with appropriate dimensions of size 1.

*Functionality of broadcast*

The following function maps an index in the output tensor to an index in the input tensor.

[source,c++]
----
// The index argument should be a valid location within out_shape.
// The function returns the location within in_shape that contributes
// to the output based on broadcasting rules.

dim_t apply_broadcast(dim_t out_shape, dim_t in_shape, dim_t index) {
    ERROR_IF(rank(out_shape) != rank(in_shape));
    ERROR_IF(rank(out_shape) != rank(index));
    for (i = 0; i < rank(out_shape); i++) {
        if (out_shape[i] != in_shape[i]) {
            ERROR_IF(in_shape[i] != 1);
            index[i] = 0;
        }
    }
    return index;
}
----

=== Quantization

==== Quantization Basics

When converting the floating-point values used in training to quantized integer values used on devices for inference, we need to know the range of values to be represented by the integers. The frameworks use slightly different parameters and data types to do this conversion. For example, TensorFlow passes a min and max floating-point values for quantization. TensorFlow Lite and PyTorch use a floating-point scale factor, and an integer zero point. TFLite and PyTorch also allow for symmetric quantization where the zero point value is not used.
In the initial quantization work, tensors were quantized with a single set of parameters for the entire tensor. Recently, frameworks have added support for different quantization parameters on a per channel basis. This per channel quantization thus carries a vector of scales and zero points to be used on each channel. TOSA will support per channel quantization, but only for the weight tensor used in convolution operators.
Quantization parameters in floating-point cause imprecision.
In some instances, the software may need to calculate post-op scaling values on hardware that does not have a floating-point unit.
Arm NPUs have fixed output scaling hardware that uses fixed point arithmetic to calculate the output values.
When calculating these multiplicands and shift amounts, different floating-point precisions may cause results to differ.
To remove this dependency on floating-point values, there are two design choices being made:

* Quantization parameters will be associated with operations rather than tensors. The operations are where the scaling is taking place, and thus can be specified such that the hardware fixed point calculations can be represented exactly, such that any hardware designed to the TOSA specification will return the same quantized values.
* Quantization parameters will be given in integer values, as multiplicands and shifts. Specific bit widths and signed/unsignedness will be provided with each operator.

When compiling a network to TOSA, we expect that a compiler would lower all possible subgraphs to TOSA, keeping the quantization parameters with the tensors, and then do an additional pass where the quantization values for the operators are calculated based on the input and output tensors for the operation.

TOSA currently supports signed 8-bit quantization, unsigned 8-bit quantization, and
signed 16-bit quantization. 8-bit values support an optional zero point, denoting
which value in the 8-bit range represents the value zero. Unsigned 8-bit values
are only allowed in the RESCALE operation, to allow for compatibility with
networks which expect unsigned 8-bit input tensors.

==== Quantization Scaling

Most operations in TOSA do not contain quantization scaling in the operation, but in a separate RESCALE node that performs change in scale using a multipler and shift value. This TOSA specification supports two precisions of multiplier: 16-bit and 32-bit. The 32-bit multiplier version supports two rounding modes to enable simpler lowering of existing frameworks that use two stage rounding. All arithmetic is designed so that it does not overflow a 64-bit accumulator and that the final result fits in 32 bits. In particular a 48-bit value can only be scaled with the 16-bit multiplier.

The apply_scale functions provide a scaling of approximately (multiplier * 2^-shift^).
The shift and value range is limited to allow a variety of implementations.
The limit of 62 on shift allows the shift to be decomposed as two right shifts of 31.
The limit on value allows implementations that left shift the value before the multiply in the case of shifts of 32 or less.
For example, in the case shift=30 an implementation of the form ((value\<<2) * multiplier + round)>>32 can be used.
A scaling range of 2^+12^ down to 2^-32^ is supported for both functions with a normalized multiplier.

For example, in typical usage a scaling of m*2^-n^ where m is a fraction in the
range 1.0 \<= m < 2.0 can be represented using multiplier=(1<<30)*m, shift=(30+n) for
apply_scale_32() and multiplier=(1<<14)*m, shift=(14+n) for apply_scale_16().
The values to achieve a scaling of 1.0 are shift=30, multiplier=1<<30 for apply_scale_32 and shift=14, multiplier=1<<14 for apply_scale_16.

[source,c++]
----
int32_t apply_scale_32(int32_t value, int32_t multipler, uint6_t shift, bool_t double_round=false) {
    REQUIRE(multiplier >= 0);
    REQUIRE(2 <= shift && shift <= 62);
    REQUIRE(value >= (-1<<(shift-2)) && value < (1<<(shift-2));
    int64_t round = 1 << (shift - 1);
    if (double_round) {
        if (shift > 31 && value >= 0) round += 1<<30;
        if (shift > 31 && value < 0)  round -= 1<<30;
    }
    int64_t result = (int64_t)value * multiplier + round;
    result = result >> shift;
    // result will fit a 32-bit range due to the REQUIRE on value
    return (int32_t)result;
}

int32_t apply_scale_16(int48_t value, int16_t multipler, uint6_t shift) {
    REQUIRE(multiplier >= 0);
    REQUIRE(2 <= shift && shift <= 62);
    int64_t round = (1 << (shift - 1));
    int64_t result = (int64_t)value * multiplier + round;
    result = result >> shift;
    REQUIRE(result >= minimum<int32_t> && result <= maximum<int32_t>);
    return (int32_t)result;
}
----

In some functions, the multiplier and shift are combined into a scale_t structure:

[source,c++]
----
typedef struct {
    int32_t multiplier;
    uint6_t shift;
} scale_t;
----

In places where a divide is required, we also use the function below to calculate an appropriate scaling value.

[source,c++]
----
scale_t reciprocal_scale(uint32_t value) {
    REQUIRE(value > 0);
    scale_t scale;
    int32_t k = 32 - count_leading_zeros(value - 1); // (1 << k) / 2 < value <= (1 << k)
    int64_t numerator = ((1 << 30) + 1) << k;
    scale.multiplier = numerator / value; // (1 << 30) <= multiplier < (1 << 31)
    scale.shift = 30 + k;
    return scale;
}
----

==== Quantized Convolutions

For convolution, the input is not required to be scaled before the convolution occurs.
The convolution produces an accumulator output of type int32_t or int48_t.
This accumulator output is then scaled to the final output range using the RESCALE operator.
The scale applied in the RESCALE operator should be set to multiplier and shift values such that: multiplier * 2^-shift^ = (input scale * weight scale) / output_scale.
Here, input_scale, weight_scale and output_scale are the conversion factors from integer to floating-point for the input, weight and output tensor values respectively.
If per-channel scaling is needed then the per-channel option of the RESCALE operation should be used.

==== Quantized Elementwise Operators

When two quantized tensors are used in an operation, they must represent the same numeric range for the result to be valid.
In this case, TOSA expects that RESCALE operators will be used as necessary to generate 32-bit integer values in a common range.
There are many valid choices for scale factors and options for the common range.
TOSA does not impose a requirement on which scale factors and range should be used.
Compilers generating TOSA sequences should choose a range that allows the operation to be computed without overflow, while allowing the highest possible accuracy of the output.

==== General Unary Functions
General unary functions such as sigmoid(), tanh(), exp() for integer inputs are expressed using a lookup table and interpolation to enable efficient implementation.
This also allows for other operations with the addition of user-supplied tables (the TABLE operation).
All table lookups are based on the following reference lookup function that takes as input a table of 513 entries of 16 bits each.

[source,c++]
----
int32_t apply_lookup(int16_t *table, int32_t value)
{
    int16_t clipped_value = (int16_t)apply_clip<int32_t>(value, -32768, +32767);
    int32_t index = (clipped_value + 32768) >> 7;
    int32_t fraction = clipped_value & 0x7f;
    int16_t base = table[index];
    int16_t next = table[index+1];
    int32_t slope = next - base;
    REQUIRE(slope >= minimum<int16_t> && slope <= maximum<int16_t>)
    int32_t return_value = (base << 7) + slope * fraction;
    return return_value;	// return interpolated value of 16 + 7 = 23 bits
}
----

Note that although the table lookup defined here has 16-bit precision, for 8-bit only operations an 8-bit table can be derived by applying the reference function to each of the possible 256 input values.
The following code constructs a 513-entry table based on a reference function.

[source,c++]
----
void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t))
{
    for (int i = -256; i <= 256; i++) {
        int32_t value = (*reference)(i);
        table[i + 256] = (int16_t)apply_clip<int32_t>(value, -32768, +32767)
    }
}
----

=== Floating-point

TOSA does not define bit-exact behavior of the floating-point type, since floating-point operation results can vary according to operation order (floating-point addition is not associative in general) and rounding behavior.
If a bit-exact answer is required then integer operations should be used.
TOSA does define that the floating-point type must support the following list of features.
These features ensure that detection of overflow and other exceptional conditions can be handled consistently.

* The floating-point type must have at least 16 total bits including the sign bit
* The floating-point type must support positive and negative infinity values
* The floating-point type must support at least one Not-a-Number encoding (NaN)
* The floating-point type must support signed zero
* The floating-point type must support handling of infinities, NaNs, zeros as in the following table

.floating-point behavior
|===
|Case|Result

|Operators other than explicitly mentioned by other rules: Any input operand is a NaN | a NaN

|Comparisons (EQUAL, GREATER, GREATER_EQUAL), where either or both operands is NaN | False

|Comparisons ignore the sign of 0|

|RSQRT (reciprocal square root) of negative numbers | a NaN
|(&#177; 0) &#215; (&#177; infinity), (&#177; infinity) &#215; (&#177; 0) | a NaN

|LOG of negative numbers | a NaN

|nonzero numbers / (&#177; 0) | (&#177; infinity)

|(&#177; 0) / (&#177; 0), (&#177; infinity) / (&#177; infinity) | a NaN

|(&#177; infinity) * 0 | a NaN

| (+infinity) - (+infinity),  (+infinity) + (-infinity) | a NaN

| Any positive overflow | + infinity

| Any negative overflow | - infinity

| Any positive underflow | + 0

| Any negative underflow | - 0

|===