如何执行双精度加法,并有效地检测结果无法表示。

huangapple go评论79阅读模式
英文:

How is addition of double is done here and a valid way to detect that the result could not be represented

问题

I am trying to understand double a bit better. 在以下代码片段中,minmax 都是 double 类型:

double min = 3.472727272727276;
double max = 3.4727272727272767;
System.out.println(max - min);
System.out.println((max - min)/2);
double mid = min + ((max - min)/2);
if(min == mid) {
    System.out.println("equal");
}
System.out.println(mid);

第一和第二个打印语句输出:

4.440892098500626E-16
2.220446049250313E-16

这基本上是:

0.00000000000000044408920985006260.0000000000000002220446049250313

然后条件检查为 true,即打印出 equal,最后一个打印输出为:3.472727272727276

根据我的理解,(max - min)/2 给出了一个可以由 double 表示的值。

我不清楚的是在加法运算过程中发生了什么。

  1. 是加法创建了一个 double 无法表示的数字,并通过丢弃数字保留了原始的 min 值,还是在实际执行加法之前,该数字被视为 0,或者究竟发生了什么?

  2. min == mid 是一种有效的方法来检测 double 类型的这种问题吗?也就是说,对于整数,我们可以通过检查结果是否小于起始值来检测溢出/下溢。在执行加法后进行等号检查是一种合理的方法来检测 double 类型的等效问题吗?也就是说,由于舍入误差(或确切的术语是什么)而没有真正增强添加的数字?

(Note: The provided code and questions have been translated as per your request.)

英文:

I am trying to understand double a bit better. In the following code snippet min and max are double:

double min = 3.472727272727276;
double max = 3.4727272727272767;
System.out.println(max - min);  
System.out.println((max - min)/2);
double mid = min + ((max - min)/2);
if(min == mid) {
    System.out.println("equal");
}
System.out.println(mid);

The first 2 print statements print:

4.440892098500626E-16
2.220446049250313E-16

Which basically is:
0.0000000000000004440892098500626 and 0.0000000000000002220446049250313

Then the conditional check is true i.e. prints equal and the last print is: 3.472727272727276

So from my understanding the (max - min)/2 gave a value that could be represented by a double.
What is not clear to me is what is happening during the addition.

  1. Is the addition creating a number that could not be represented by a double and leaves the original min as is by dropping of digits or is the number effectively considered as 0 before the addition actually happens or how exactly is this done?
  2. Is the min == mid a valid check to detect such issues with doubles? I.e. with integer we can detect overflow/underflow by checking if the result is less that we started with. Is the equality check after doing an addition a sane/reasonable check to detect the equivalent problem with double i.e. that the number added was not really enhanced due to rounding error (or what exactly is the actual term for this)?

答案1

得分: 2

以下是翻译的部分内容:

对于这个示例,通过查看十六进制浮点格式中的数字,很容易看出发生了什么。将源文本“3.472727272727276”转换为“double”的结果是3.47272727272727621539161191321909427642822265625,使用十六进制表示为:

1.BC8253C8253D0₁₆•2¹

请注意,尾数中确切有53位,点号前有一个位,之后有13个十六进制数字位。double格式的尾数有一位用于符号,11位用于指数,53位用于尾数(52位明确存储,一位通过指数编码)。

将源文本“3.4727272727272767”转换为“double”会得到3.472727272727276659480821763281710445880889892578125,即:

1.BC8253C8253D1₁₆•2¹

现在我们可以轻松看到对它们进行算术运算会发生什么。它们的差异是:

0.0000000000001₁₆•2¹

当我们将其标准化时,它变为1₁₆•2^(1-52) = 1₁₆•2^(-51) ≈ 4.44•10^(-16),而double格式可以通过调整指数轻松表示其一半。然后我们有1₁₆•2^(-52) ≈ 2.22•10^(-16)。

然而,当我们尝试将这一半的差值添加到第一个数字时,使用实数算术的结果是:

1.BC8253C8253D08₁₆•2¹

请注意,这个结果有54位,点号前有一个位,然后13个十六进制数字位,以及14位数字中的最高位,即8。double格式的尾数中没有54位,因此在double格式中进行加法运算不能产生这个结果。相反,和值四舍五入到最近的可表示值,或者在平局情况下,四舍五入到具有偶数低位的最近可表示值。因此,结果是1.BC8253C8253D08₁₆•2¹,与“min”相同。

英文:

For this example, it is easy to see what is happening by viewing the numbers in a hexadecimal floating-point format. The result of converting the source text 3.472727272727276 to double is 3.47272727272727621539161191321909427642822265625, which, using hexadecimal, is:

<pre> 1.BC8253C8253D0<sub>16</sub>•2<sup>1</sup></pre>

Observe there are exactly 53 bits in the significand—one before the “.” and 52 in 13 hexadecimal digits after it. The double format has one bit for the sign, 11 for the exponent, and 53 for the significand. (52 are stored explicitly; one is encoded via the exponent.)

Converting the source text 3.4727272727272767 to double yields 3.472727272727276659480821763281710445880889892578125, which is:

<pre> 1.BC8253C8253D1<sub>16</sub>•2<sup>1</sup></pre>

Now we can easily see what happens with arithmetic on them. Their difference is:

<pre> 0.0000000000001<sub>16</sub>•2<sup>1</sup></pre>

When we normalize that, it is 1.<sub>16</sub>•2<sup>1−52</sup> = 1.<sub>16</sub>•2<sup>−51</sup> ≈ 4.44•10<sup>−16</sup>, and the double format can easily represent half of that simply by adjusting the exponent. Then we have 1.<sub>16</sub>•2<sup>−52</sup> ≈ 2.22•10<sup>−16</sup>.

However, when we try to add that halved difference to the first number, the result with real-number arithmetic is:

<pre> 1.BC8253C8253D08<sub>16</sub>•2<sup>1</sup></pre>

Observe this has 54 bits—one before the “.”, then 52 in 13 hexadecimal digits, and a final one in the high bit of that 14<sup>th</sup> digit, the 8. The double format does not have 54 bits in its significand, so addition in double format cannot produce this result. Instead, the sum is rounded to the nearest representable value or, in case of a tie, to the nearest representable value with an even low bit. So the result is 1.BC8253C8253D08<sub>16</sub>•2<sup>1</sup>, which is the same as min.

答案2

得分: 1

  1. 在将两个浮点数相加的算法的第一步中,将这两个数字调整到相同的指数。实际上,这是通过将较小数字的位向右移动来完成的,而下溢的位将丢失(变为零)。

    如果使用64位精度进行计算,

    3.472727272727276 + 2.220446049250313E-16 或十六进制表示为:
    0x1.bc8253c8253dp1 + 0x1.0p-52

    实际上变成了以下计算:

    3.472727272727276 + 0.0 或十六进制表示为:
    0x1.bc8253c8253dp1 + 0x0.0p1

    这是在硬件中完成的,所以中间的0.0值不会存储在任何地方,也不会作为一个单独的步骤可见。

    但是:计算可能会使用高精度超过64位。例如,如果有80位精度的浮点CPU指令可用,JVM可以使用它们。在这种情况下,中间结果将会不同,但最终结果仍然会存储为64位双精度。

  2. 是否使用 min == mid 作为检查浮点数问题的有效方法取决于您的需求。==运算符检查两个数字是否完全相等,无论好坏。在许多情况下,人们不希望完全相等,因为这很难或不可能实现:例如Math.sin(Math.PI)不会完全等于0,但您可能更喜欢假装它是“足够接近”0。

英文:

> 1. Is the addition creating a number that could not be represented by a double

The algorithm for adding two floating point numbers as a first step brings the two numbers to the same exponent. Effectively this is done by shifting the bits of the smaller number to the right, and the bits that underflow are lost (become zero).

If the calculation is done with 64-bit precision,

3.472727272727276 + 2.220446049250313E-16     or in hex:
0x1.bc8253c8253dp1 + 0x1.0p-52

in effect becomes the calculation

3.472727272727276 + 0.0     or in hex:
0x1.bc8253c8253dp1 + 0x0.0p1

and this happens in hardware, so the intermediate 0.0 value is not stored anywhere or visible as a separate step.

But: it's possible the calculation is done with higher precision than 64 bits. For example if 80-bit precision floating point CPU instructions are available, the JVM is allowed to use them. In that case the intermediate results will be different, but the end result is still going to be the same because the result has to be stored as a 64-bit double.

> 2. Is the min == mid a valid check to detect such issues with doubles?

Depends on what you need to do. The == operator checks if the two numbers are exactly equal, for better or for worse. In many places people don't want exact equality because it's difficult or impossible to achieve: for example Math.sin(Math.PI) is not going to be exactly 0 but you may prefer to pretend it's "close enough" to 0.

答案3

得分: 0

The following code may demonstrate the issue:

double num = 1;
while (!Double.isInfinite(num)) {
	num *= 2;
	System.out.println(num);
}
System.out.println("-----------------------");
System.out.println("-- now the opposite----");
System.out.println("-----------------------");
num = 1;
while (num > 0) {
	num /= 2;
	System.out.println(num);
}

内存中的空间受位数的限制。因此,在某个点上,一个非常小的数字将会变成零。

在你的计算中,操作符作用于双精度数,CPU 中会创建临时的双精度数 - 这也受到精度限制,因此在你的情况下会变成零。

当然,对于双精度数,必须谨慎使用 == 运算符,但这不是这里的问题。

回答你的第二个问题,你需要使用 BigDecimal 而不是 double,以确保安全。

检查的问题在于,任何双精度数可以假定的值不是均匀分布的。在 0 和 1 之间,双精度数可以假定的值数量与 1 到无穷大之间的值数量相似。

编辑:是的,mid == min 的结果当然证明了双精度限制已经达到。但 mid != min 的逆反证明了限制可能已经在另一个步骤中达到。

在操作任意输入双精度数的一般程序中,你需要在每个中间计算结果上执行这种检查。我认为与使用 BigDecimal 相比,这不值得努力,而且你可能会忘记一些检查。

英文:

The following code may demonstrate the issue:

double num = 1;
while (!Double.isInfinite(num)) {
	num *= 2;
	System.out.println(num);
}
System.out.println(&quot;-----------------------&quot;);
System.out.println(&quot;-- now the opposite----&quot;);
System.out.println(&quot;-----------------------&quot;);
num = 1;
while (num &gt; 0) {
	num /= 2;
	System.out.println(num);
}

The space in the memory is limited by the number of bits. Thus it is inevitable that at some point a very small number will be exactly zero.

In your calculation, the operators act on doubles creating temporary doubles in the CPU - which also fall under the precision limit and thus in your case become zero

And of course, the == operator must be used with diligence on doubles, but that was not the problem here.

To answer your second question, you need to use BigDecimal instead of double to be on the safe side.

The problem with checking is, that the values that any double can assume are not evenly distributed. Between 0 and 1, there is a similar number of values a double can assume than between 1 and Infinity.

EDIT: yes, the result of mid == min is of course a proof that the double precision limit has been reached. But the inverse mid != min does not prove that the limit may have been reached in another step.

In a general program that operates on arbitrary input doubles you would need to do that sort of check with every intermediate calculation result. I think it is not worth the effort compared to using BigDecimal and also you run the risk of forgetting some checks.

huangapple
  • 本文由 发表于 2020年8月3日 22:06:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/63230958.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定