IEEE 754浮点数为什么使用1作为负数的符号位?

huangapple go评论67阅读模式
英文:

Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?

问题

关于使用“偏置指数”(也称为“偏移二进制”)在浮点数中的典型原因,主要是为了使比较更容易。

通过安排字段,使得符号位占据最重要的位位置,偏置指数占据中间位置,然后尾数将占据最不重要的位,从而得到正确排序的结果值。无论它是解释为浮点还是整数值,都是如此。这是为了在使用固定点硬件进行浮点数之间的高速比较而做的。然而,因为 IEEE 754 浮点数的符号位对于负数设置为 1,对于正数设置为 0,所以负浮点数的整数表示大于正浮点数的整数表示。如果情况反过来,那么情况就不会是这样:所有正浮点数解释为无符号整数的值都会大于所有负浮点数。

我明白这不会完全简化比较,因为 NaN != NaN,必须单独处理(尽管是否这样做甚至值得讨论,正如在那个问题中所讨论的那样)。不管怎样,奇怪的是,这是使用偏置指数表示的原因,尽管它似乎被指定的符号和幅度表示值所击败。

有更多关于“为什么我们要偏置浮点数的指数?”和“为什么 IEEE 浮点数使用偏置形式计算指数?”的讨论。从第一个问题中,被接受的答案甚至提到了这一点(重点是我的):

IEEE 754 编码有一个便利的属性,可以通过简单地按字典顺序比较相应的比特字符串,或等效地,将这些比特字符串解释为无符号整数并比较这些整数来执行正数非 NaN 数之间的有序比较。这适用于从 +0.0 到 +Infinity 的整个浮点范围(然后扩展比较以考虑符号只是一个简单的问题)。

我可以想象两个原因:首先,使用符号位为 1 表示负值,允许以 -1^s * 1.f^e-b 的形式定义 IEEE 754 浮点数;其次,所有位为 0 的比特字符串对应的浮点数等于 +0,而不是 -0。

我认为这两者都不是有意义的,特别是考虑到使用偏置指数的常见原因。

英文:

The typical reason given for using a biased exponent (also known as offset binary) in floating-point numbers is that it makes comparisons easier.

> By arranging the fields such that the sign bit takes the most significant bit position, the biased exponent takes the middle position, then the significand will be the least significant bits and the resulting value will be ordered properly. This is the case whether or not it is interpreted as a floating-point or integer value. The purpose of this is to enable high speed comparisons between floating-point numbers using fixed-point hardware.

However, because the sign bit of IEEE 754 floating-point numbers is set to 1 for negative numbers and 0 for positive numbers, the integer representation of negative floating-point numbers is greater than that of the positive floating-point numbers. If this were reversed, then this would not be the case: the value of all positive floating-point numbers interpreted as unsigned integers would be greater than all negative floating-point numbers.

I understand this wouldn't completely trivialize comparisons because NaN != NaN, which must be handled separately (although whether or not this is even desirable is questionable as discussed in that question). Regardless, it's strange that this is the reason given for using a biased exponent representation when it is seemingly defeated by the specified values of the sign and magnitude representation.

There is more discussion on the questions "Why do we bias the exponent of a floating-point number?" and "Why IEEE floating point number calculate exponent using a biased form?" From the first, the accepted answer even mentions this (emphasis mine):

> The IEEE 754 encodings have a convenient property that an order comparison can be performed between two positive non-NaN numbers by simply comparing the corresponding bit strings lexicographically, or equivalently, by interpreting those bit strings as unsigned integers and comparing those integers. This works across the entire floating-point range from +0.0 to +Infinity (and then it's a simple matter to extend the comparison to take sign into account).

I can imagine two reasons: first, using a sign bit of 1 for negative values allows the definition of IEEE 754 floating-point numbers in the form -1<sup>s</sup> x 1.f<sup>e-b</sup>; and second, the floating-point number corresponding to a bit string of all 0s is equal to +0 instead of -0.

I don't see either of these as being meaningful especially considering the common rationale for using a biased exponent.

答案1

得分: 1

signed integers 使用 2's complement(如今普遍使用),1's complementsigned magnitude 进行编码,其中对于 -0 和陷阱值有一些变种。

这三种都可以在硬件中实现得足够好,具有相似的性能和硬件复杂性。已经存在相当多的硬件和软件设计,可以实现这三种编码方式。

IEEE浮点数在视为 signed magnitude 时可以轻松进行比较。

OP建议的 "如果反过来" 创建了第4种整数编码。

为什么IEEE 754浮点数在负数时使用1的符号位?

为了模仿 signed magnitude 整数的对称性,借鉴之前的技术,而不是采用另一种编码方式。

英文:

Back in the day, signed integers were encoded using 2's complement (ubiquitous today), 1s' complement and signed magnitude - with some variations on -0 and trap values.

All 3 could be realized well enough in hardware with similar performance and hardware complexity. A sizeable amount of hardware and software designs exist for all 3.

IEEE Floating point can do compares quite easily when viewed as signed magnitude.

OP's suggested "If this were reversed" creates a 4th integer encoding.


> Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?

To mimic the symmetry of signed magnitude integers, take advantage of prior art and not yet another encoding.

答案2

得分: 0

我在IEEE 754标准的Wikipedia文章中找到了参考资料"Radix Tricks",在标题为"Floating point support"的部分,作者描述了将两个浮点数比较为无符号2的补码整数(具体来说,是32位IEEE 754单精度浮点数)所需的步骤。

作者指出,简单地翻转符号位是不够的,因为将一个大的(较高幅度)负数的编码尾数解释为无符号整数时,其值将大于较小的负数,尽管较大的负数应该小于较小的负数。类似地,具有较大偏置指数的负数实际上小于具有较小偏置指数的负数,这样,具有无偏指数emax的负数小于具有无偏指数emin的负数。

为了纠正这一点,对于正数,符号位应该翻转,对于负数,所有位都应该翻转。作者提出了以下算法:

uint32_t cmp(uint32_t f1, uint32_t f2)
{
	uint32_t f1 = f1 ^ (-(f1 >> 31) | 0x80000000);
	uint32_t f2 = f2 ^ (-(f2 >> 31) | 0x80000000);
	return f1 < f2;
}

解释这一点的目的是澄清,翻转符号位不能直接将有限浮点数比较为无符号2的补码整数。相反,使用符号和幅值硬件(必须将符号位解释为符号位,而不是无符号整数的一部分)不需要额外的位操作,因此应该得到最简单、最小和最有效的设计。

可以创建一个使用2的补码的浮点格式编码,已经有一份详细的研究论文。然而,这远远超出了问题的范围,涉及许多额外的复杂性和问题需要解决。也许有更好的方法,但IEEE 754设计具有明显适用于所有用例的优势。

英文:

I found the reference "Radix Tricks" on the Wikipedia article for the IEEE 754 standard, where in the section titled "Floating point support" the author describes the steps necessary to compare two floating-point numbers as unsigned 2's complement integers (specifically, 32-bit IEEE 754 single-precision floating-point numbers).

In it, the author points out that simply flipping the sign bit is insufficient because the encoded significand of a large (higher magnitude) negative number interpreted as an unsigned integer will be greater than that of a smaller negative number, when of course a larger negative number should be lesser than a smaller one. Similarly, a negative number with a larger biased exponent is actually less than one with a smaller biased exponent, such that negative numbers with the unbiased exponent e<sup>max</sup> are less than those with the unbiased exponent e<sup>min</sup>.

In order to correct for this, the sign bit should be flipped for positive numbers, and all bits should be flipped for negative numbers. The author presents the following algorithm:

uint32_t cmp(uint32_t f1, uint32_t f2)
{
	uint32_t f1 = f1 ^ (-(f1 &gt;&gt; 31) | 0x80000000);
	uint32_t f2 = f2 ^ (-(f2 &gt;&gt; 31) | 0x80000000);
	return f1 &lt; f2;
}

The purpose in explaining this is to clarify that inverting the sign bit does not make it possible to directly compare finite floating-point numbers as unsigned 2's complement integers. On the contrary, using sign and magnitude hardware (which must interpret the sign bit as a sign bit, and not as a part of an unsigned integer) requires no additional bitwise operations and should therefore result in the simplest, smallest, and most efficient design.

It is possible to create a floating-point format encoding that uses 2's complement, and it has been studied as detailed in this paper. However, this is far beyond the scope of the question and involves many additional complexities and problems to be solved. Perhaps there is a better way, but the IEEE 754 design has the advantage that it is obviously satisfactory for all use cases.

huangapple
  • 本文由 发表于 2023年3月1日 16:14:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75601037.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定