From aa162aa6d2287bcc7bfb7b976b3daabc84b62af4 Mon Sep 17 00:00:00 2001 From: Eric Kunze Date: Fri, 12 Apr 2024 16:19:55 -0700 Subject: Switch fp8 to use non-saturating mode when converting Implementations should use non-saturating mode and call CLAMP if saturation is needed. Signed-off-by: Eric Kunze Change-Id: I7a79931552dd6c3ab5fc247a963e3e7ba1e38ae2 --- chapters/introduction.adoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'chapters') diff --git a/chapters/introduction.adoc b/chapters/introduction.adoc index 64d34e9..0030757 100644 --- a/chapters/introduction.adoc +++ b/chapters/introduction.adoc @@ -254,7 +254,8 @@ Otherwise the result must be within 0.5 ulp of the mathematical result. | <> | Result overflows when converting between fp32_t, bf16_t and fp16_t must be set to infinity of the correct sign. + -fp8e4m3_t and fp8e5m2_t must use the saturation mode rules defined in <> when converting from the wider floating-point types. + +fp8e4m3_t and fp8e5m2_t must use the non-saturating mode defined in <> when converting from the wider floating-point types. + +If saturation of the fp8 types is desired, a <> operation with the appropriate parameters should be used before the cast. + Floating-point result underflows must be set to zero of the correct sign. + Cast from floating-point to integer result overflows must be saturated. + Cast from floating-point to integer must be rounded using round to nearest, ties to even, rounding mode. + -- cgit v1.2.1