英文:
What are the smallest values that can no longer be represented exactly in C decimal point datatypes?
问题
例如:什么是最小的uint32_t
值,无法精确表示为double
?等等。
如何计算这些值?
英文:
For example: What is the smallest uint32_t
value that can no longer be represented exactly as a double
? etc.
And how to calculate these values?
答案1
得分: 3
以下是翻译的内容:
> 作为double
,不能准确表示的最小uint32_t
值是多少?
所有的uint32_t
都可以被double
准确表示,因为C规定DBL_DIG >= 10
,这将导致C至少以准确地编码所有连续的整数值[-10^10到+10^10]。这包括所有的uint32_t
。
> 如何计算这些值?
请参考C规范。
C主要通过以下方式指定浮点类型可以准确编码的整数值的最小连续范围:
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
对于float
,所有6位小数的整数都可以表示[-999,999.0到+999,999.0],包括+/-1,000,000。从零开始往下走,第一个不能表示为float
的负整数值是-1,000,001。这是当FLT_RADIX == 10
时的最小范围,几乎不再存在。
当FLT_RADIX == 2
(非常常见)时,有效数字中的二进制位数p
满足以下关系:
(p-1)*log10(2) >= xxx_DIG
解决这个方程对于float
,至少需要20位。有了p == 20
,float
可以准确编码整数[-2^20到+2^20]或[-1,048,576到+1,048,576]。
至此为止是规范的最小要求。
常见限制
通常的float
具有带符号位和24个二进制位,比C规范的最小要求范围更广。它可以准确表示整数[-2^24到+2^24]。不能表示的第一个负整数值是-2^24 - 1或-16,777,217。
英文:
> What is the smallest uint32_t
value that can no longer be represented exactly as a double
?
All uint32_t
can be exactly represented as a double
as C specifies DBL_DIG >= 10
results in C encoding exactly at least all consecutive integers values [-10<sup>10</sup> to +10<sup>10</sup>]. That encompasses all uint32_t
.
> how to calculate these values?
Refer to the C spec.
The minimal continuous range of integers value exactly encodable as a floating point type is specified by C primarily by:
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
In the case of float
, all 6 decimal digit integers are representable [-999,999.0 to +999,999.0] including +/- 1,000,000. Starting from zero and going down, the first negative integer value that may not be representable as a float
is -1,000,001. This is the minimum range when FLT_RADIX == 10
, which hardly exists anymore.
When FLT_RADIX == 2
(very common), the number of binary digits p
in the significand is:
(p-1)*log10(2) >= xxx_DIG
solving this for float
, p
is at least 20. With p == 20
, float
can exactly encode integers [-2<sup>20</sup> to +2<sup>20</sup>] or [-1,048,576 to +1,048,576].
So much for spec minimums.
Common limit
Typical float
with its sign bit and 24 binary digits has a wider range than the C spec minimum. It exactly encodes integers [-2<sup>24</sup> to +2<sup>24</sup>]. The first negative integer value that is not representable: -2<sup>24</sup> - 1 or -16,777,217.
答案2
得分: 2
float
和 double
的具体定义取决于编译器,它们的选择基于处理器架构。在 x86 或 x86-64 上,float
可能是 IEEE 单精度浮点数,而 double
可能是 IEEE 双精度浮点数。
类型 | 可连续表示整数范围(包括边界) | 此范围的大小 |
---|---|---|
uint16_t | 0 .. 2^16-1 | 2^16 |
uint32_t | 0 .. 2^32-1 | 2^32 |
uint64_t | 0 .. 2^64-1 | 2^64 |
IEEE 单精度 | -2^24 .. 2^24 | 2^25+1 |
IEEE 双精度 | -2^53 .. 2^53 | 2^54+1 |
IEEE 单精度浮点数有 24 位精度。[1] 它可以精确表示比上述提到的更大的数字,但不能精确表示 2^24+1 或 -2^24-1。
IEEE 双精度浮点数有 53 位精度。[1] 它可以精确表示比上述提到的更大的数字,但不能精确表示 2^53+1 或 -2^53-1。
- 非规格化数(极小的数字)具有较低的精度。
英文:
What float
and double
are varies based on compiler, and it chooses based on the architecture. On an x86 or x86-64, a float
is likely an IEEE single-precision floating point number, and double
is likely an IEEE double-precision floating point number.
Type | Range of continuously<br>representable integers<br>(Inclusive bounds) | Size of this range |
---|---|---|
uint16_t | 0 .. 2^16-1 | 2^16 |
uint32_t | 0 .. 2^32-1 | 2^32 |
uint64_t | 0 .. 2^64-1 | 2^64 |
IEEE single-precision | -2^24 .. 2^24 | 2^25+1 |
IEEE double-precision | -2^53 .. 2^53 | 2^54+1 |
An IEEE single-precision float has 24 bits of precision.<sup>[1]</sup> It can exactly represent far larger numbers than those mentioned above, but it can't represent 2^24+1 or -2^24-1 exactly.
An IEEE double-precision float has 53 bits of precision.<sup>[1]</sup> It can exactly represent far larger numbers than those mentioned above, but it can't represent 2^53+1 or -2^53-1 exactly.
- Subnormals (extremely small numbers) have less precision.
答案3
得分: 1
- 在你的问题中存在一个主要差异:你说了“小数点”数据类型,但后来提到了“double”。
double
在绝大多数可用的实现中不是小数点类型,它是二进制点类型(IEEE754 64位小数)。你可能需要在互联网和文献中更详细地了解它们的区别。
对于以下内容,我将假设是二进制的 double
而不是(真的很少出现和使用的)真正的小数类型。
- 嗯,这很简单但稍微有点棘手:))
我将假设你所指的平台本地实现了 IEEE754,并且 float
是 32 位二进制 IEEE 数字,double
是 64 位二进制 IEEE 数字。
在这种情况下,float
有一个带有前导1的 24 位尾数(我们不考虑非规格化数在这里),我们需要找到最小的值,不包括前导的尾随零,它不适合 24 位。 (对于这个问题,不需要检查指数范围。)
对于以下内容,“0b” 是二进制前缀,“**” 是幂运算符。
16777215 = 2**24-1 是 0b111111111111111111111111(24个连续的1)。它适合。
16777216 = 2**24 是 0b1000000000000000000000000(一个1和 24 个连续的0)。它适合。
16777217 = 2**24+1 是 0b1000000000000000000000001(一个1,23个连续的0和一个1)。它不适合。
“Float” 可以表示:...,16777214,16777215,16777216,16777218,16777220,16777222... 因此,从 16777216 = 2**24 开始,可表示的值之间的步长是 2。
因此,无法在“float”中精确表示的最小无符号整数是 16777217。
对于其他情况,不需要复制所有这些长字符串位 - 对于 double
和它的 53 位尾数,这只会变得繁琐。我希望原则已经清楚说明。对于你的具体示例,这意味着任何 uint32_t
值都可以在 double
中精确表示,但不能在 float
中精确表示。
- 此外,对于
uint32_t
和double
,你只需检查声明的精度即可。uint32_t
可以有高达 10 位小数位数(保证 9 位)。double
需要高达 17 位小数位数才能精确表示任何值,保证 15 位可以表示。它们的范围显然有明显的差异,因此不需要更精确的边界检查。
英文:
1. You have a principal discrepancy in your question: you have said for "decimal point" datatypes but then mentioned "double".
double
, in utmost majority of available implementations, is not decimal point type, it is binary point type (IEEE754 64-bit decimal). You might look more detailed in Internet and literature how they differ.
For the following Iʼll assume binary double
and not (really rarely present and used) truly decimal types.
2. Thatʼs, well, simple but a tiny bit tricky:))
I will assume that a platform you refer natively implements IEEE754 and, float
is 32-bit binary IEEE number and double
is 64-bit binary IEEE number.
In this case, float
has 24-bit mantissa with leading 1 (we donʼt count denormals here) and we need the smallest value which, not including leading trailing zeros, does not fit into 24 bits. (No need to check exponent range for the question.)
For the following, "0b" is binary prefix and "**" is power operator.
16777215 = 2**24-1 is 0b111111111111111111111111 (24 consecutive ones). It fits.
16777216 = 2**24 is 0b1000000000000000000000000 (one and 24 consecutive zeros). It fits.
16777217 = 2**24+1 is 0b1000000000000000000000001 (one, 23 consecutive zeros, and one). It does not fit.
"Float" can represent: ..., 16777214, 16777215, 16777216, 16777218, 16777220, 16777222... so starting with 16777216 = 2**24, step between representable values is 2.
So, the minimum unsigned integer, not representable in "float", is 16777217.
No need to copy all these long string bits for other case - with double
with its 53-bit mantissa, it would be just cumbersome. I hope the principle is declared well. For your concrete example, it means that any uint32_t
value can be exactly represented in double
but not in float
.
3. Also, for uint32_t
and double
you could merely have checked declared accuracy. uint32_t
is up to 10 decimal digits (9 guaranteed). double
is up to 17 decimal digits required to exactly represent any value, and 15 guaranteed to represent in it. There is obvious difference between ranges so there is no need for more precise boundary checks.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论