aboutsummaryrefslogtreecommitdiff
path: root/chapters
diff options
context:
space:
mode:
Diffstat (limited to 'chapters')
-rw-r--r--chapters/introduction.adoc33
-rw-r--r--chapters/pseudocode.adoc3
2 files changed, 29 insertions, 7 deletions
diff --git a/chapters/introduction.adoc b/chapters/introduction.adoc
index 9d53510..17c16a8 100644
--- a/chapters/introduction.adoc
+++ b/chapters/introduction.adoc
@@ -245,7 +245,8 @@ Multiplication of an infinity by a zero must produce a NaN. +
Otherwise the result must be within 0.5 ulp of the mathematical result.
| <<CAST>>
-| Floating-point result overflows must be set to infinity of the correct sign. +
+| Result overflows when converting between fp32_t, bf16_t and fp16_t must be set to infinity of the correct sign. +
+fp8e4m3_t and fp8e5m2_t must use the saturation mode rules defined in <<IEEE-754,IEEE-754>> when converting from the wider floating-point types. +
Floating-point result underflows must be set to zero of the correct sign. +
Cast from floating-point to integer result overflows must be saturated. +
Cast from floating-point to integer must be rounded using round to nearest, ties to even, rounding mode. +
@@ -339,7 +340,7 @@ This may be, for example, a convolution.
This section defines the accuracy required for these operations.
In this section:
-* "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by IEEE 754 (<<Other publications>>[1])
+* "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by <<IEEE-754,IEEE-754>>
* `operation_fp64()` is an fp64 reference implementation of the operation
* `operation_imp()` is the implementation under test
* `local_bound` is defined as follows:
@@ -537,10 +538,29 @@ The number formats supported by a given operator are listed in its table of supp
| (1<<47)-1
|Signed 48-bit two's-complement value.
+|fp8e4m3_t
+| -448
+| 448
+| 8-bit floating-point defined by <<OCP-OFP8,OCP-OFP8>> with four bits of exponent and three bits of mantissa. +
+Normal values must be supported. +
+Denormal values must be supported. +
+The NaN encoding must be supported. +
+Signed zero must be supported.
+
+|fp8e5m2_t
+| -infinity
+| +infinity
+| 8-bit floating-point defined by <<OCP-OFP8,OCP-OFP8>> with five bits of exponent and two bits of mantissa. +
+Normal values must be supported. +
+Denormal values must be supported. +
+Positive and negative infinity must be supported. +
+NaN encodings must be supported. +
+Signed zero must be supported.
+
|fp16_t
| -infinity
| +infinity
-| 16-bit half-precision floating-point defined by <<Other publications>>[1]. +
+| 16-bit half-precision floating-point defined by <<IEEE-754,IEEE-754>> . +
Normal values must be supported. +
Denormal values must either be supported or flushed to zero. +
Positive and negative infinity must be supported. +
@@ -560,7 +580,7 @@ Signed zero must be supported.
|fp32_t
| -infinity
| +infinity
-| 32-bit single-precision floating-point defined by <<Other publications>>[1]. +
+| 32-bit single-precision floating-point defined by <<IEEE-754,IEEE-754>> . +
Normal values must be supported. +
Denormal values must either be supported or flushed to zero. +
Positive and negative infinity must be supported. +
@@ -570,7 +590,7 @@ Signed zero must be supported.
|fp64_t
| -infinity
| + infinity
-| 64-bit double-precision floating-point defined by <<Other publications>>[1]. +
+| 64-bit double-precision floating-point defined by <<IEEE-754,IEEE-754>>. +
Normal values must be supported. +
Denormal values must either be supported or flushed to zero. +
Positive and negative infinity must be supported. +
@@ -744,4 +764,5 @@ void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t))
The following publications are referred to in this specification, or provide more information:
-. IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008.
+. [[IEEE-754]]IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008.
+. [[OCP-OFP8]]Open Compute Project OCP 8-bit Floating Point Specification (OFP8) Revision 1.0
diff --git a/chapters/pseudocode.adoc b/chapters/pseudocode.adoc
index acce9c9..53b1142 100644
--- a/chapters/pseudocode.adoc
+++ b/chapters/pseudocode.adoc
@@ -1,7 +1,7 @@
//
// This confidential and proprietary software may be used only as
// authorised by a licensing agreement from ARM Limited
-// (C) COPYRIGHT 2021-2023 ARM Limited
+// (C) COPYRIGHT 2021-2024 ARM Limited
// ALL RIGHTS RESERVED
// The entire notice above must be reproduced on all authorised
// copies and copies may only be made to the extent permitted
@@ -142,6 +142,7 @@ include::{pseudocode}/library/arithmetic_helpers.tosac[lines=10..-1]
The following definitions indicate the type to be used when the given parameters are provided.
+
[source,c++]
----
include::{pseudocode}/library/type_conversion_helpers.tosac[lines=10..-1]