diff options
author | Eric Kunze <eric.kunze@arm.com> | 2024-04-12 16:19:55 -0700 |
---|---|---|
committer | Eric Kunze <eric.kunze@arm.com> | 2024-04-17 23:56:37 +0000 |
commit | aa162aa6d2287bcc7bfb7b976b3daabc84b62af4 (patch) | |
tree | 2bcb24fe65343dd6cb43c16dbed518eeb19d3141 /pseudocode/library/numeric_conversion_helpers.tosac | |
parent | 7ad78d37a51f8b333367effe62d596ac89cdcdb5 (diff) | |
download | specification-aa162aa6d2287bcc7bfb7b976b3daabc84b62af4.tar.gz |
Switch fp8 to use non-saturating mode when converting
Implementations should use non-saturating mode and call CLAMP
if saturation is needed.
Signed-off-by: Eric Kunze <eric.kunze@arm.com>
Change-Id: I7a79931552dd6c3ab5fc247a963e3e7ba1e38ae2
Diffstat (limited to 'pseudocode/library/numeric_conversion_helpers.tosac')
-rw-r--r-- | pseudocode/library/numeric_conversion_helpers.tosac | 8 |
1 files changed, 2 insertions, 6 deletions
diff --git a/pseudocode/library/numeric_conversion_helpers.tosac b/pseudocode/library/numeric_conversion_helpers.tosac index 0073a66..ae5d9fb 100644 --- a/pseudocode/library/numeric_conversion_helpers.tosac +++ b/pseudocode/library/numeric_conversion_helpers.tosac @@ -13,13 +13,9 @@ int round_to_nearest_int(float_t f); // Converts the input value into floating-point, rounding to the nearest representable value. // Values that are not NaN outside of the representable range of the destination type must be set to infinity of the correct sign. +// If the destination floating point type does not have an infinity representation, values outside of the representable range must be set to NaN. // For the required precision see the section: Main inference precision requirements. -float_t round_to_nearest_float_nonsaturating(in_t f); - -// Converts the input value into floating-point, rounding to the nearest representable normal value. -// Values that are not NaN outside of the representable range must return the maximum representable normal value of the correct sign. -// For the required precision see the section: Main inference precision requirements. -float_t round_to_nearest_float_saturating(in_t f); +float_t round_to_nearest_float(in_t f); // Floating point values are unchanged. // For two's complement integer values where out_t has more bits than in_t, replicate the top bit of input for all bits between the top bit of input and the top bit of output. |