aboutsummaryrefslogtreecommitdiff
path: root/chapters
diff options
context:
space:
mode:
authorEric Kunze <eric.kunze@arm.com>2023-10-20 15:58:55 -0700
committerEric Kunze <eric.kunze@arm.com>2024-02-14 16:36:04 -0800
commit74e2ceba954ed6111b3e3ce40c5ff88fe79ff043 (patch)
tree7e1967b073313d7df4885693eda931230d401eb0 /chapters
parent9fe5e964e2193f0e345670f7f4098beecd7fd6eb (diff)
downloadspecification-74e2ceba954ed6111b3e3ce40c5ff88fe79ff043.tar.gz
Initial FP8 support
Adds support for Open Compute Project (OCP) 8-bit floating point operations to the TOSA specification. Both E4M3 and E5M2 types are supported for profiles as indicated in the Supported Data Types table for each operator. FP8 operator list ARGMAX AVGPOOL CONV2D CONV3D DEPTHWISE_CONV2D MATMUL MAX_POOL2D TRANSPOSE_CONV2D CONST CAST CONCAT PAD DIM RESHAPE REVERSE SLICE TILE TRANSPOSE GATHER SCATTER Signed-off-by: Eric Kunze <eric.kunze@arm.com> Change-Id: I3dd83f48afcc3c880c5c88039337ff4f1fd95b1b
Diffstat (limited to 'chapters')
-rw-r--r--chapters/introduction.adoc33
-rw-r--r--chapters/pseudocode.adoc3
2 files changed, 29 insertions, 7 deletions
diff --git a/chapters/introduction.adoc b/chapters/introduction.adoc
index 9d53510..17c16a8 100644
--- a/chapters/introduction.adoc
+++ b/chapters/introduction.adoc
@@ -245,7 +245,8 @@ Multiplication of an infinity by a zero must produce a NaN. +
Otherwise the result must be within 0.5 ulp of the mathematical result.
| <<CAST>>
-| Floating-point result overflows must be set to infinity of the correct sign. +
+| Result overflows when converting between fp32_t, bf16_t and fp16_t must be set to infinity of the correct sign. +
+fp8e4m3_t and fp8e5m2_t must use the saturation mode rules defined in <<IEEE-754,IEEE-754>> when converting from the wider floating-point types. +
Floating-point result underflows must be set to zero of the correct sign. +
Cast from floating-point to integer result overflows must be saturated. +
Cast from floating-point to integer must be rounded using round to nearest, ties to even, rounding mode. +
@@ -339,7 +340,7 @@ This may be, for example, a convolution.
This section defines the accuracy required for these operations.
In this section:
-* "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by IEEE 754 (<<Other publications>>[1])
+* "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by <<IEEE-754,IEEE-754>>
* `operation_fp64()` is an fp64 reference implementation of the operation
* `operation_imp()` is the implementation under test
* `local_bound` is defined as follows:
@@ -537,10 +538,29 @@ The number formats supported by a given operator are listed in its table of supp
| (1<<47)-1
|Signed 48-bit two's-complement value.
+|fp8e4m3_t
+| -448
+| 448
+| 8-bit floating-point defined by <<OCP-OFP8,OCP-OFP8>> with four bits of exponent and three bits of mantissa. +
+Normal values must be supported. +
+Denormal values must be supported. +
+The NaN encoding must be supported. +
+Signed zero must be supported.
+
+|fp8e5m2_t
+| -infinity
+| +infinity
+| 8-bit floating-point defined by <<OCP-OFP8,OCP-OFP8>> with five bits of exponent and two bits of mantissa. +
+Normal values must be supported. +
+Denormal values must be supported. +
+Positive and negative infinity must be supported. +
+NaN encodings must be supported. +
+Signed zero must be supported.
+
|fp16_t
| -infinity
| +infinity
-| 16-bit half-precision floating-point defined by <<Other publications>>[1]. +
+| 16-bit half-precision floating-point defined by <<IEEE-754,IEEE-754>> . +
Normal values must be supported. +
Denormal values must either be supported or flushed to zero. +
Positive and negative infinity must be supported. +
@@ -560,7 +580,7 @@ Signed zero must be supported.
|fp32_t
| -infinity
| +infinity
-| 32-bit single-precision floating-point defined by <<Other publications>>[1]. +
+| 32-bit single-precision floating-point defined by <<IEEE-754,IEEE-754>> . +
Normal values must be supported. +
Denormal values must either be supported or flushed to zero. +
Positive and negative infinity must be supported. +
@@ -570,7 +590,7 @@ Signed zero must be supported.
|fp64_t
| -infinity
| + infinity
-| 64-bit double-precision floating-point defined by <<Other publications>>[1]. +
+| 64-bit double-precision floating-point defined by <<IEEE-754,IEEE-754>>. +
Normal values must be supported. +
Denormal values must either be supported or flushed to zero. +
Positive and negative infinity must be supported. +
@@ -744,4 +764,5 @@ void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t))
The following publications are referred to in this specification, or provide more information:
-. IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008.
+. [[IEEE-754]]IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008.
+. [[OCP-OFP8]]Open Compute Project OCP 8-bit Floating Point Specification (OFP8) Revision 1.0
diff --git a/chapters/pseudocode.adoc b/chapters/pseudocode.adoc
index acce9c9..53b1142 100644
--- a/chapters/pseudocode.adoc
+++ b/chapters/pseudocode.adoc
@@ -1,7 +1,7 @@
//
// This confidential and proprietary software may be used only as
// authorised by a licensing agreement from ARM Limited
-// (C) COPYRIGHT 2021-2023 ARM Limited
+// (C) COPYRIGHT 2021-2024 ARM Limited
// ALL RIGHTS RESERVED
// The entire notice above must be reproduced on all authorised
// copies and copies may only be made to the extent permitted
@@ -142,6 +142,7 @@ include::{pseudocode}/library/arithmetic_helpers.tosac[lines=10..-1]
The following definitions indicate the type to be used when the given parameters are provided.
+
[source,c++]
----
include::{pseudocode}/library/type_conversion_helpers.tosac[lines=10..-1]