英文:
What is the output of '%b' verb when it is floating number
问题
根据go doc的说明,%b
与浮点数一起使用时表示:
十进制无小数点的科学计数法,指数为2的幂次,
类似于strconv.FormatFloat的'b'格式,
例如:-123456p-78
根据下面的代码显示,程序的输出是:
8444249301319680p-51
我对浮点数中的%b
有点困惑,有人可以告诉我这个结果是如何计算的吗?还有p-
表示什么意思?
f := 3.75
fmt.Printf("%b\n", f)
fmt.Println(strconv.FormatFloat(f, 'b', -1, 64))
英文:
According to the go doc, %b
used with floating number means:
> decimalless scientific notation with exponent a power of two,
in the manner of strconv.FormatFloat with the 'b' format,
e.g. -123456p-78
As the code shows below, the program output is
> 8444249301319680p-51
I'm a little confused about %b
in floating number, can anybody tell me how this result is calculated? Also what does p-
mean?
f := 3.75
fmt.Printf("%b\n", f)
fmt.Println(strconv.FormatFloat(f, 'b', -1, 64))
答案1
得分: 4
“十进制无小数点的科学计数法,指数为2的幂”的意思如下:
8444249301319680*(2^-51) = 3.75 或者 8444249301319680/(2^51) = 3.75
p-51
表示 2^-51
,也可以计算为 1/(2^51)
关于浮点数算术的好文章。
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
英文:
The decimalless scientific notation with exponent a power of two
that means follows:
8444249301319680*(2^-51) = 3.75 or 8444249301319680/(2^51) = 3.75
p-51
means 2^-51
which can also be calculated as 1/(2^51)
Nice article on Floating-Point Arithmetic.
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
答案2
得分: 1
科学计数法的五个规则如下:
- 基数始终为10。
- 指数必须是非零整数,可以是正数或负数。
- 系数的绝对值大于或等于1,但小于10。
- 系数带有符号(+)或(-)。
- 尾数包含其余的有效数字。
p
%b
以2的幂作为指数的科学计数法(即p
)%e
科学计数法
英文:
The five rules of scientific notation are given below:
- The base is always 10
- The exponent must be a non-zero integer, which means it can be either positive or negative
- The absolute value of the coefficient is greater than or equal to 1 but it should be less than 10
- The coefficient carries the sign (+) or (-)
- he mantissa carries the rest of the significant digits
p
%b
scientific notation with exponent a power of two (itsp
)%e
scientific notation
答案3
得分: 1
值得指出的是,对于运行时系统来说,%b
输出特别容易生成,这是由于浮点数的内部存储格式。
如果我们忽略“非规格化”的浮点数(稍后可以添加它们),浮点数在内部存储为1.bbbbbb...bbb x 2exp,其中exp是一些位(这里是“b”)的集合,例如,值4存储为1.000...000 <exp> 2
。值6存储为1.100...000 <exp> 2
,值7存储为1.110...000 <exp> 2
,值8存储为1.000...000 <exp> 3
。值7.5是1.111 <exp> 2
,七又四分之三是1.1111 <exp> 2
,依此类推。这里的每个位,在1.bbbb中,表示比指数低的下一个二的幂。
要使用%b
格式打印出1.111 <exp> 2
,我们只需注意我们需要连续四个1
位,即十进制值15或0xf或二进制值1111
,这会导致指数需要减3,这样我们就不是乘以22或4,而是乘以2-1或1/2。因此,我们可以取实际指数(2),减去3(因为我们将“点”移动了三次以打印1111
二进制或15),因此打印出字符串15p-1
。
但Go的%b
并不打印出这个结果:它打印出8444249301319680p-50
。这是相同的值(所以任何一个都是正确的输出)-但为什么呢?
嗯,8444249301319680
在十六进制中是1E000000000000
。展开成完整的二进制,这是1 1110 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
。这是53个二进制数字。为什么是53个二进制数字,而不是四个就足够了呢?
答案可以在Nick的回答中找到:IEEE 754浮点数格式使用53位的“尾数”或“有效数字”(后者是更好的术语,我通常尽量使用这个术语,但你会经常看到前者)。也就是说,1.bbb...bbb
有52个b
,再加上强制的前导1
。因此,总是恰好有53个二进制数字(对于IEEE“双精度”)。
如果我们将这个53位的二进制数视为十进制数,我们总是可以打印出它而不需要小数点。这意味着我们只需调整二的幂指数。
在IEEE754格式中,指数本身已经以“过量形式”存储,加上了1023(再次是双精度)。这意味着1.111000...000 <exp> 2
实际上是以指数值2+1023 = 1025存储的。这意味着为了得到实际的二的幂,机器代码格式化数字的过程中已经要减去1023。我们只需让它同时再减去52。
最后,由于隐含的1
始终存在,内部IEEE754数实际上并没有存储1
位。因此,为了读取值并进行转换,代码在内部执行以下操作:
decimalPart := machineDependentReinterpretation1(&doubleprec_value)
expPart := machineDependentReinterpretation2(&doubleprec_value)
其中,机器相关的重新解释只是提取正确的位,在十进制部分根据需要插入隐含的1
位,在指数部分减去偏移量(1023+52),然后执行以下操作:
fmt.Sprint("%dp%d", decimalPart, expPart)
在十进制中打印浮点数时,基本转换(从二进制到十进制)是有问题的,需要大量的代码来正确进行四舍五入。像这样以二进制形式打印它要容易得多。
以下是帮助理解的读者练习:
- 计算1.102 x 22。注意:1.12是1½十进制。
- 计算11.02 x 21。(11.02是3。)
- 根据上述内容,当你“滑动二进制点”左右时会发生什么?
- (更难)为什么我们可以假设有一个前导的1?如果需要,可以继续阅读。
为什么我们可以假设有一个前导的1?
首先让我们注意,在十进制中使用科学计数法时,我们不能假设有一个前导的1
。一个数可能是1.7 x 103,或者是5.1 x 105,或者其他任何数。但是当我们正确使用科学计数法时,第一个数字永远不是零。也就是说,我们不会写成0.3 x 100,而是写成3.0 x 10-1。在这种表示法中,数字的位数告诉我们精度,而第一个数字永远不会是零,通常也不应该是零。如果第一个数字是零,我们只需移动小数点并调整指数(参见上面的练习1和2)。
浮点数也遵循相同的规则。例如,不是存储0.01
,而是将二进制点向左移两个位置得到1.00
,并将指数减去2。如果我们想存储11.1
,我们将二进制点向另一个方向移动一个位置,并增加指数。无论我们如何操作,第一个数字总是变成1。
这里有一个重要的例外,即:当我们这样做时,我们不能存储零!因此,我们不会对数字0.0
这样做。在IEEE754中,我们将0.0
存储为全零位(除了符号,我们可以设置为存储-0.0
)。这具有全零指数,计算机硬件将其作为特殊情况处理。
非规格化数:当我们不能假设有一个前导的1
这个系统有一个显著的缺陷(通过非规格化数并没有完全修复,但是IEEE有非规格化数)。也就是说,我们可以存储的最小数“突然下溢”为零。Kahan在一篇15页的“简短教程”中介绍了渐进下溢,我不打算尝试总结,但是当我们达到允许的最小指数(2-1023)并且想要“变得更小”时,IEEE允许我们停止使用具有前导1
位的这些“规格化”数。
这不会影响Go本身格式化浮点数的方式,因为Go只是将整个有效数字“原样”获取。我们只需在输入值为非规格化数时停止插入253的“隐含1”,其他一切都会正常工作。我们可以将这个魔术隐藏在机器相关的float64
重新解释代码中,或者在Go中明确执行,以便更方便。
英文:
It is worth pointing out that the %b
output is particularly easy for the runtime system to generate as well, due to the internal storage format for floating point numbers.
If we ignore "denormalized" floating point numbers (we can add them back later), a floating point number is stored, internally, as 1.bbbbbb...bbb x 2<sup>exp</sup> for some set of bits ("b" here), e.g., the value four is stored as 1.000...000 <exp> 2
. The value six is stored as 1.100...000 <exp> 2
, the value seven is stored as 1.110...000 <exp> 2
, and eight is stored as 1.000...000 <exp> 3
. The value seven-and-a-half is 1.111 <exp> 2
, seven and three quarters is 1.1111 <exp> 2
, and so on. Each bit here, in the 1.bbbb, represents the next power of two lower than the exponent.
To print out 1.111 <exp> 2
with the %b
format, we simply note that we need four 1
bits in a row, i.e., the value 15 decimal or 0xf or 1111
binary, which causes the exponent to need to be decreased by 3, so that instead of multiplying by 2<sup>2</sup> or 4, we want to multiply by 2<sup>-1</sup> or ½. So we can take the actual exponent (2), subtract 3 (because we moved the "point" three times to print 1111
binary or 15), and hence print out the string 15p-1
.
That's not what Go's %b
prints though: it prints 8444249301319680p-50
. This is the same value (so either one would be correct output)—but why?
Well, 8444249301319680
is, in hexadecimal, 1E000000000000
. Expanded into full binary, this is 1 1110 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
. That's 53 binary digits. Why 53 binary digits, when four would suffice?
The answer to that is found in the link in Nick's answer: IEEE 754 floating point format uses a 53-digit "mantissa" or "significand" (the latter is the better term and the one I usually try to use, but you'll see the former pop up very often). That is, the 1.bbb...bbb
has 52 b
s, plus that forced-in leading 1
. So there are always exactly 53 binary digits (for IEEE "double precision").
If we just treat this 53-binary-digit number as a decimal number, we can always print it out without a decimal point. That means we just adjust the power-of-two exponent.
In IEEE754 format, the exponent itself is already stored in "excess form", with 1023 added (for double precision again). That means that 1.111000...000 <exp> 2
is actually stored with an exponent value of 2+1023 = 1025. What this means is that to get the actual power of two, the machine code formatting the number is already going to have to subtract 1023. We can just have it subtract 52 more at the same time.
Last, because the implied 1
is always there, the internal IEEE754 number doesn't actually store the 1
bit. So to read out the value and convert it, the code internally does:
decimalPart := machineDependentReinterpretation1(&doubleprec_value)
expPart := machineDependentReinterpretation2(&doubleprec_value)
where the machine-dependent-reinterpretation simply extracts the correct bits, puts in the implied 1
bit as needed in the decimal part, subtracts the offset (1023+52) for the exponent part, and then does:
fmt.Sprint("%dp%d", decimalPart, expPart)
When printing a floating-point number in decimal, the base conversion (from base 2 to base 10) is problematic, requiring a lot of code to get the rounding right. Printing it in binary like this is much easier.
Exercises for the reader, to help with understanding this:
- Compute 1.10<sub>2</sub> x 2<sup>2</sup>. Note: 1.1<sub>2</sub> is 1½ decimal.
- Compute 11.0<sub>2</sub> x 2<sup>1</sup>. (11.0<sub>2</sub> is 3.)
- Based on the above, what happens as you "slide the binary point" left and right?
- (more difficult) Why can we assume a leading 1? If necessary, read on.
Why we can assume a leading 1?
Let's first note that when we use scientific notation in decimal, we can't assume a leading 1
. A number might be 1.7 x 10<sup>3</sup>, or 5.1 x 10<sup>5</sup>, or whatever. But when we use scientific notation "correctly", the first digit is never zero. That is, we do not write 0.3 x 10<sup>0</sup> but rather 3.0 x 10<sup>-1</sup>. In this kind of notation, the number of digits tells us about the precision, and the first digit never has to be zero and generally isn't supposed to be zero. If the first digit were zero, we just move the decimal point and adjust the exponent (see exercises 1 and 2 above).
The same rules apply with floating-point numbers. Instead of storing 0.01
, for instance, we just slide the binary point two over two positions and get 1.00
, and decrease the exponent by 2. If we might want to have stored 11.1
, we slide the binary point one position the other way and increase the exponent. Whenever we do this, the first digit always winds up being a one.
There is one big exception here, which is: when we do this, we can't store zero! So we don't do this for the number 0.0
. In IEEE754, we store 0.0
as all-zero-bits (except for the sign, which we can set to store -0.0
). This has an all-zero exponent, which the computer hardware handles as a special case.
Denormalized numbers: when we can't assume a leading 1
This system has one notable flaw (which isn't entirely fixed by denorms, but nonetheless, IEEE has denorms). That is: the smallest number we can store "abruptly underflows" to zero. Kahan has a 15 page "brief tutorial" on gradual underflow, which I am not going to attempt to summarize, but when we hit the minimum allowed exponent (2<sup>-1023</sup>) and want to "get smaller", IEEE lets us stop using these "normalized" numbers with the leading 1
bit.
This doesn't affect the way that Go itself formats floating point numbers, because Go just takes the entire significand "as is". All we have to do is stop inserting the 2<sup>53</sup> "implied 1" when the input value is a denormalized number, and everything else Just Works. We can hide this magic inside the machine-dependent float64
reinterpretation code, or do it explicitly in Go, whichever is more convenient.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论