C++中双精度浮点数除法的问题

huangapple go评论65阅读模式
英文:

Issue related to double precision floating point division in C++

问题

In C++, 我们知道可以使用 std::numeric_limits<double>::min() 来找到可以表示的双精度最小值。该值在打印时为 2.22507e-308

现在,如果从这个最小值中减去给定的双精度值(比如 val),然后对同一个之前的双精度值进行除法操作 (val - minval) / val,我原本期望的是,如果对得到的除法值执行操作 floor((val - minval ) / val),答案会被四舍五入为 0

令人惊讶的是,答案却被返回为 1。有人可以解释这种反常的行为吗?

考虑以下代码:

int main()
{
  double minval = std::numeric_limits<double>::min(), wg = 8038,    
         ans = floor((wg - minval) / wg); // 期望答案四舍五入为0   
  cout << ans; // 但实际上答案是1!
}
英文:

In C++, we know that we can find the minimum representable double precision value using std::numeric_limits&lt;double&gt;::min(). The value turns out to be 2.22507e-308 when printed.

Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.

To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?

Consider the following code:

int main()
{
  double minval = std::numeric_limits&lt;double&gt;::min(), wg = 8038,    
         ans = floor((wg - minval) / wg); // expecting the answer to round to 0   
  cout &lt;&lt; ans; // but the answer actually resulted as 1!
}

答案1

得分: 1

A double typically has around 16 digits of precision.

You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.

8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.

Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.

英文:

A double typically has around 16 digits of precision.

You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.

8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.

Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.

答案2

得分: 0

以下是您要翻译的部分:

The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.

As an example, consider the code below.

int main()
{
    double b, c, d;
    
    vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
    
    cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
    
    for( int i = 0; i < 9; i++ ) {
        
        b = std::nextafter( a[i], 0 );
    
        c = a[i] - b;
    
        d = 1e-17;
        
        if( (bool)(d > c) )
            cout << "True" << "\t";
        else
            cout << "False" << "\t";
        
        cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
        
    }
    
    return 0;
}

The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.

This can be observed from the output of the C++ program as shown below:

Subtraction Possible?    Floor Result
False                    0
False                    0
False                    0
False                    0
False                    1
False                    1
False                    1
False                    1
False                    1

The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:

Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause

The statement in further repeated in Section 5.1 of Page 29 as shown below:

Unless otherwise specified, each of the computational operations specified by this standard that returns a numeric result shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).

C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that intermediate result. Hence, it may or may not be the case that a[i] - b == a[i which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i.

It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.

I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.

  1. "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
  2. Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.
英文:

The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.

As an example, consider the code below.

int main()
{
    double b, c, d;
    
    vector&lt;double&gt; a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
    
    cout &lt;&lt; &quot;Subtraction Possible?&quot; &lt;&lt; &quot;\t&quot; &lt;&lt; &quot;Floor Result&quot; &lt;&lt; &quot;\n&quot;;
    
    for( int i = 0; i &lt; 9; i++ ) {
        
        b = std::nextafter( a[i], 0 );
    
        c = a[i] - b;
    
        d = 1e-17;
        
        if( (bool)(d &gt; c) )
            cout &lt;&lt; &quot;True&quot; &lt;&lt; &quot;\t&quot;;
        else
            cout &lt;&lt; &quot;False&quot; &lt;&lt; &quot;\t&quot;;
        
        cout &lt;&lt; setprecision(52) &lt;&lt; floor((a[i] - d)/a[i]) &lt;&lt; &quot;\n&quot;;
        
    }
    
    return 0;
}

The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.

This can be observed from the output of the C++ program as shown below:

<pre>
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
</pre>

The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:

> Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause

The statement in further repeated in Section 5.1 of Page 29 as shown below:

> Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).

C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].

It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.

I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.

  1. "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
  2. Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.

huangapple
  • 本文由 发表于 2023年2月10日 10:57:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75406504.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定