aboutsummaryrefslogtreecommitdiff
path: root/chapters/introduction.adoc
diff options
context:
space:
mode:
authorDominic Symes <dominic.symes@arm.com>2020-10-22 10:56:36 +0100
committerDominic Symes <dominic.symes@arm.com>2020-10-28 16:42:06 +0000
commitf7179b5c3f42fa27835f286c78ca943772c867d6 (patch)
tree4ba379472f169de93f3bc2f078a1706bff386ada /chapters/introduction.adoc
parent3da62dfa0da290dbbb39e411ca2187703429f916 (diff)
downloadspecification-f7179b5c3f42fa27835f286c78ca943772c867d6.tar.gz
MUL: Add right shift on 32x32 multiply
The result of 32x32 elementwise multiply exceeds the int32_t result type range. This change adds a right scaling shift argument to shift down the result. Change-Id: I6ae17e6dc3fe342d052304533158ad2d0e7bb7be Signed-off-by: Dominic Symes <dominic.symes@arm.com>
Diffstat (limited to 'chapters/introduction.adoc')
-rw-r--r--chapters/introduction.adoc2
1 files changed, 1 insertions, 1 deletions
diff --git a/chapters/introduction.adoc b/chapters/introduction.adoc
index 09a21dd..5134330 100644
--- a/chapters/introduction.adoc
+++ b/chapters/introduction.adoc
@@ -220,7 +220,7 @@ Most operations in TOSA do not contain quantization scaling in the operation, bu
The apply_scale functions provide a scaling of approximately (multiplier * 2^-shift^). The shift range is limited to allow a variety of implementations. The upper limit of 62 allows it to be decomposed as two right shifts of 31. The lower limit removes special cases in the rounding. These restrictions have little practical impact since the shift value to achieve a scaling of 1.0 is 30 for apply_scale_32 with multiplier=1<<30 and 14 for apply_scale_16 with scale=1<<14. It follows that a scaling range of 2^+12^ down to 2^-32^ is supported for both functions with normalized multiplier. (Smaller scales can be obtained by denormalizing the multiplier).
....
-int32_t apply_scale_32(int32_t value, int32_t multipler, uint6_t shift, bool double_round) {
+int32_t apply_scale_32(int32_t value, int32_t multipler, uint6_t shift, bool double_round=false) {
assert(multiplier >= 0);
assert(2 <= shift && shift <= 62);
int64_t round = 1 << (shift - 1);