diff options
Diffstat (limited to 'chapters/introduction.adoc')
-rw-r--r-- | chapters/introduction.adoc | 33 |
1 files changed, 27 insertions, 6 deletions
diff --git a/chapters/introduction.adoc b/chapters/introduction.adoc index 9d53510..17c16a8 100644 --- a/chapters/introduction.adoc +++ b/chapters/introduction.adoc @@ -245,7 +245,8 @@ Multiplication of an infinity by a zero must produce a NaN. + Otherwise the result must be within 0.5 ulp of the mathematical result. | <<CAST>> -| Floating-point result overflows must be set to infinity of the correct sign. + +| Result overflows when converting between fp32_t, bf16_t and fp16_t must be set to infinity of the correct sign. + +fp8e4m3_t and fp8e5m2_t must use the saturation mode rules defined in <<IEEE-754,IEEE-754>> when converting from the wider floating-point types. + Floating-point result underflows must be set to zero of the correct sign. + Cast from floating-point to integer result overflows must be saturated. + Cast from floating-point to integer must be rounded using round to nearest, ties to even, rounding mode. + @@ -339,7 +340,7 @@ This may be, for example, a convolution. This section defines the accuracy required for these operations. In this section: -* "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by IEEE 754 (<<Other publications>>[1]) +* "fp64 arithmetic" refers to double-precision floating-point arithmetic defined by <<IEEE-754,IEEE-754>> * `operation_fp64()` is an fp64 reference implementation of the operation * `operation_imp()` is the implementation under test * `local_bound` is defined as follows: @@ -537,10 +538,29 @@ The number formats supported by a given operator are listed in its table of supp | (1<<47)-1 |Signed 48-bit two's-complement value. +|fp8e4m3_t +| -448 +| 448 +| 8-bit floating-point defined by <<OCP-OFP8,OCP-OFP8>> with four bits of exponent and three bits of mantissa. + +Normal values must be supported. + +Denormal values must be supported. + +The NaN encoding must be supported. + +Signed zero must be supported. + +|fp8e5m2_t +| -infinity +| +infinity +| 8-bit floating-point defined by <<OCP-OFP8,OCP-OFP8>> with five bits of exponent and two bits of mantissa. + +Normal values must be supported. + +Denormal values must be supported. + +Positive and negative infinity must be supported. + +NaN encodings must be supported. + +Signed zero must be supported. + |fp16_t | -infinity | +infinity -| 16-bit half-precision floating-point defined by <<Other publications>>[1]. + +| 16-bit half-precision floating-point defined by <<IEEE-754,IEEE-754>> . + Normal values must be supported. + Denormal values must either be supported or flushed to zero. + Positive and negative infinity must be supported. + @@ -560,7 +580,7 @@ Signed zero must be supported. |fp32_t | -infinity | +infinity -| 32-bit single-precision floating-point defined by <<Other publications>>[1]. + +| 32-bit single-precision floating-point defined by <<IEEE-754,IEEE-754>> . + Normal values must be supported. + Denormal values must either be supported or flushed to zero. + Positive and negative infinity must be supported. + @@ -570,7 +590,7 @@ Signed zero must be supported. |fp64_t | -infinity | + infinity -| 64-bit double-precision floating-point defined by <<Other publications>>[1]. + +| 64-bit double-precision floating-point defined by <<IEEE-754,IEEE-754>>. + Normal values must be supported. + Denormal values must either be supported or flushed to zero. + Positive and negative infinity must be supported. + @@ -744,4 +764,5 @@ void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t)) The following publications are referred to in this specification, or provide more information: -. IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008. +. [[IEEE-754]]IEEE Std 754-2008, _IEEE Standard for Floating-point Arithmetic_, August 2008. +. [[OCP-OFP8]]Open Compute Project OCP 8-bit Floating Point Specification (OFP8) Revision 1.0 |