C中的十进制数据类型中不能精确表示的最小值是什么?

huangapple go评论74阅读模式
英文:

What are the smallest values that can no longer be represented exactly in C decimal point datatypes?

问题

例如:什么是最小的uint32_t值,无法精确表示为double?等等。

如何计算这些值?

英文:

For example: What is the smallest uint32_t value that can no longer be represented exactly as a double? etc.

And how to calculate these values?

答案1

得分: 3

以下是翻译的内容:

> 作为double,不能准确表示的最小uint32_t值是多少?

所有的uint32_t都可以被double准确表示,因为C规定DBL_DIG >= 10,这将导致C至少以准确地编码所有连续的整数值[-10^10到+10^10]。这包括所有的uint32_t


> 如何计算这些值?

请参考C规范。

C主要通过以下方式指定浮点类型可以准确编码的整数值的最小连续范围:

FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10

对于float,所有6位小数的整数都可以表示[-999,999.0到+999,999.0],包括+/-1,000,000。从零开始往下走,第一个不能表示为float的负整数值是-1,000,001。这是当FLT_RADIX == 10时的最小范围,几乎不再存在。

FLT_RADIX == 2(非常常见)时,有效数字中的二进制位数p满足以下关系:

(p-1)*log10(2) >= xxx_DIG

解决这个方程对于float,至少需要20位。有了p == 20float可以准确编码整数[-2^20到+2^20]或[-1,048,576到+1,048,576]。

至此为止是规范的最小要求。


常见限制

通常的float具有带符号位和24个二进制位,比C规范的最小要求范围更广。它可以准确表示整数[-2^24到+2^24]。不能表示的第一个负整数值是-2^24 - 1或-16,777,217。

英文:

> What is the smallest uint32_t value that can no longer be represented exactly as a double?

All uint32_t can be exactly represented as a double as C specifies DBL_DIG &gt;= 10 results in C encoding exactly at least all consecutive integers values [-10<sup>10</sup> to +10<sup>10</sup>]. That encompasses all uint32_t.


> how to calculate these values?

Refer to the C spec.

The minimal continuous range of integers value exactly encodable as a floating point type is specified by C primarily by:

FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10

In the case of float, all 6 decimal digit integers are representable [-999,999.0 to +999,999.0] including +/- 1,000,000. Starting from zero and going down, the first negative integer value that may not be representable as a float is -1,000,001. This is the minimum range when FLT_RADIX == 10, which hardly exists anymore.

When FLT_RADIX == 2 (very common), the number of binary digits p in the significand is:

(p-1)*log10(2) &gt;= xxx_DIG

solving this for float, p is at least 20. With p == 20, float can exactly encode integers [-2<sup>20</sup> to +2<sup>20</sup>] or [-1,048,576 to +1,048,576].

So much for spec minimums.


Common limit

Typical float with its sign bit and 24 binary digits has a wider range than the C spec minimum. It exactly encodes integers [-2<sup>24</sup> to +2<sup>24</sup>]. The first negative integer value that is not representable: -2<sup>24</sup> - 1 or -16,777,217.

答案2

得分: 2

floatdouble 的具体定义取决于编译器,它们的选择基于处理器架构。在 x86 或 x86-64 上,float 可能是 IEEE 单精度浮点数,而 double 可能是 IEEE 双精度浮点数。

类型 可连续表示整数范围(包括边界) 此范围的大小
uint16_t 0 .. 2^16-1 2^16
uint32_t 0 .. 2^32-1 2^32
uint64_t 0 .. 2^64-1 2^64
IEEE 单精度 -2^24 .. 2^24 2^25+1
IEEE 双精度 -2^53 .. 2^53 2^54+1

IEEE 单精度浮点数有 24 位精度。[1] 它可以精确表示比上述提到的更大的数字,但不能精确表示 2^24+1 或 -2^24-1。

IEEE 双精度浮点数有 53 位精度。[1] 它可以精确表示比上述提到的更大的数字,但不能精确表示 2^53+1 或 -2^53-1。


  1. 非规格化数(极小的数字)具有较低的精度。
英文:

What float and double are varies based on compiler, and it chooses based on the architecture. On an x86 or x86-64, a float is likely an IEEE single-precision floating point number, and double is likely an IEEE double-precision floating point number.

Type Range of continuously<br>representable integers<br>(Inclusive bounds) Size of this range
uint16_t 0 .. 2^16-1 2^16
uint32_t 0 .. 2^32-1 2^32
uint64_t 0 .. 2^64-1 2^64
IEEE single-precision -2^24 .. 2^24 2^25+1
IEEE double-precision -2^53 .. 2^53 2^54+1

An IEEE single-precision float has 24 bits of precision.<sup>[1]</sup> It can exactly represent far larger numbers than those mentioned above, but it can't represent 2^24+1 or -2^24-1 exactly.

An IEEE double-precision float has 53 bits of precision.<sup>[1]</sup> It can exactly represent far larger numbers than those mentioned above, but it can't represent 2^53+1 or -2^53-1 exactly.


  1. Subnormals (extremely small numbers) have less precision.

答案3

得分: 1

  1. 在你的问题中存在一个主要差异:你说了“小数点”数据类型,但后来提到了“double”。

double 在绝大多数可用的实现中不是小数点类型,它是二进制点类型(IEEE754 64位小数)。你可能需要在互联网和文献中更详细地了解它们的区别。

对于以下内容,我将假设是二进制的 double 而不是(真的很少出现和使用的)真正的小数类型。

  1. 嗯,这很简单但稍微有点棘手:))

我将假设你所指的平台本地实现了 IEEE754,并且 float 是 32 位二进制 IEEE 数字,double 是 64 位二进制 IEEE 数字。
在这种情况下,float 有一个带有前导1的 24 位尾数(我们不考虑非规格化数在这里),我们需要找到最小的值,不包括前导的尾随零,它不适合 24 位。 (对于这个问题,不需要检查指数范围。)

对于以下内容,“0b” 是二进制前缀,“**” 是幂运算符。

16777215 = 2**24-1 是 0b111111111111111111111111(24个连续的1)。它适合。

16777216 = 2**24 是 0b1000000000000000000000000(一个1和 24 个连续的0)。它适合。

16777217 = 2**24+1 是 0b1000000000000000000000001(一个1,23个连续的0和一个1)。它不适合。

“Float” 可以表示:...,16777214,16777215,16777216,16777218,16777220,16777222... 因此,从 16777216 = 2**24 开始,可表示的值之间的步长是 2。

因此,无法在“float”中精确表示的最小无符号整数是 16777217。

对于其他情况,不需要复制所有这些长字符串位 - 对于 double 和它的 53 位尾数,这只会变得繁琐。我希望原则已经清楚说明。对于你的具体示例,这意味着任何 uint32_t 值都可以在 double 中精确表示,但不能在 float 中精确表示。

  1. 此外,对于 uint32_tdouble,你只需检查声明的精度即可。uint32_t 可以有高达 10 位小数位数(保证 9 位)。double 需要高达 17 位小数位数才能精确表示任何值,保证 15 位可以表示。它们的范围显然有明显的差异,因此不需要更精确的边界检查。
英文:

​1. You have a principal discrepancy in your question: you have said for "decimal point" datatypes but then mentioned "double".

double, in utmost majority of available implementations, is not decimal point type, it is binary point type (IEEE754 64-bit decimal). You might look more detailed in Internet and literature how they differ.

For the following Iʼll assume binary double and not (really rarely present and used) truly decimal types.

​2. Thatʼs, well, simple but a tiny bit tricky:))

I will assume that a platform you refer natively implements IEEE754 and, float is 32-bit binary IEEE number and double is 64-bit binary IEEE number.
In this case, float has 24-bit mantissa with leading 1 (we donʼt count denormals here) and we need the smallest value which, not including leading trailing zeros, does not fit into 24 bits. (No need to check exponent range for the question.)

For the following, "0b" is binary prefix and "**" is power operator.

16777215 = 2**24-1 is 0b111111111111111111111111 (24 consecutive ones). It fits.

16777216 = 2**24 is 0b1000000000000000000000000 (one and 24 consecutive zeros). It fits.

16777217 = 2**24+1 is 0b1000000000000000000000001 (one, 23 consecutive zeros, and one). It does not fit.

"Float" can represent: ..., 16777214, 16777215, 16777216, 16777218, 16777220, 16777222... so starting with 16777216 = 2**24, step between representable values is 2.

So, the minimum unsigned integer, not representable in "float", is 16777217.

No need to copy all these long string bits for other case - with double with its 53-bit mantissa, it would be just cumbersome. I hope the principle is declared well. For your concrete example, it means that any uint32_t value can be exactly represented in double but not in float.

​3. Also, for uint32_t and double you could merely have checked declared accuracy. uint32_t is up to 10 decimal digits (9 guaranteed). double is up to 17 decimal digits required to exactly represent any value, and 15 guaranteed to represent in it. There is obvious difference between ranges so there is no need for more precise boundary checks.

huangapple
  • 本文由 发表于 2023年5月22日 02:30:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76301365.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定