Go扫描器 – 空白符的正确性?

huangapple go评论84阅读模式
英文:

Go scanner - correctness for whitespace?

问题

Go语言中的text/scanner/scanner.go包中的Go扫描器使用了一个技巧来查找空白字符:

const GoWhitespace = 1<<'\\t' | 1<<'\\n' | 1<<'\\r' | 1<<' '

然后:

// 跳过空白字符
for s.Whitespace&(1<<uint(ch)) != 0 {
    ch = s.next()
}

由于字符值左移超过31位,是否可能存在不唯一的情况?我的意思是,当某个字符与32取模后与制表符相同时,它会被识别为空白字符吗?

英文:

The Go scanner package in text/scanner/scanner.go uses trick to find whitespace:

const GoWhitespace = 1<<'\t' | 1<<'\n' | 1<<'\r' | 1<<' '

And then:

// skip white space
for s.Whitespace&(1<<uint(ch)) != 0 {
	ch = s.next()
}

Since character values shift left by more than 31, can there be cases where this is not unique? I mean, when some char is the same as tab modulo 32, it will be recognized as whitespace?

答案1

得分: 2

规范中所述,<< 是一个位移操作符:

位移操作符将左操作数按照右操作数指定的位数进行位移。如果左操作数是有符号整数,则实现算术位移;如果是无符号整数,则实现逻辑位移。位移计数没有上限。位移的行为就好像左操作数按照位移计数 n 进行 n 次位移。因此,x << 1 等同于 x*2,x >> 1 等同于 x/2,但结果向负无穷截断。

对于较大的 ch 值,1<<uint(ch)导致溢出

对于无符号整数值,操作 +、-、* 和 << 是模 2^n 运算,其中 n 是无符号整数类型的位宽。粗略地说,这些无符号整数操作在溢出时会丢弃高位,程序可以依赖于“环绕”。

对于有符号整数,操作 +、-、* 和 << 可以合法地溢出,结果值存在并由有符号整数表示、操作和操作数确定。溢出不会引发异常。编译器不能在假设不会发生溢出的情况下优化代码。例如,它不能假设 x < x + 1 总是为真。

因此,使用位旋转操作符(你所描述的)来实现 << 将违反规范。对于大于 int 类型大小的 ch 值,1<<uint(ch) 的结果将为零,因此不会导致任何错误判断。

英文:

As mentioned in the spec, the &lt;&lt; is a shift operation:

> The shift operators shift the left operand by the shift count specified by the right operand. They implement arithmetic shifts if the left operand is a signed integer and logical shifts if it is an unsigned integer. There is no upper limit on the shift count. Shifts behave as if the left operand is shifted n times by 1 for a shift count of n. As a result, x << 1 is the same as x*2 and x >> 1 is the same as x/2 but truncated towards negative infinity.

For large values of ch, 1&lt;&lt;uint(ch) will cause an overflow:

> For unsigned integer values, the operations +, -, *, and << are computed modulo 2<sup>n</sup>, where n is the bit width of the unsigned integer's type. Loosely speaking, these unsigned integer operations discard high bits upon overflow, and programs may rely on ``wrap around''.
>
> For signed integers, the operations +, -, *, and << may legally overflow and the resulting value exists and is deterministically defined by the signed integer representation, the operation, and its operands. No exception is raised as a result of overflow. A compiler may not optimize code under the assumption that overflow does not occur. For instance, it may not assume that x < x + 1 is always true.

So implementing &lt;&lt; with a bitwise rotation operator (what you seem to be describing) would violate the spec. 1&lt;&lt;uint(ch) will evaluate to zero for values of ch larger than the size of the int type, so won't cause any false positives.

答案2

得分: 0

规范明确指出,对于无符号操作,高位被屏蔽,因此低位实际上是“环绕”的。

它能够工作的原因是:

  1. Scanner.Whitespace 实际上是 uint64,所以 GoWhitespace 的值完全适合。
  2. 在运行时,对无符号整数进行的操作 s.Whitespace&amp;(1&lt;&lt;uint(ch)) 可能会产生任意大的中间值,并且会环绕。因此,如果字符是“a”(96),我们有 1 &lt; &lt; 96,它会溢出,所以模 64 位整数的大小就是 0。
英文:

Fully answer:

Spec explicitly say that for operations on unsigned, we get high bits masked out so the low bits are really "wrap around".

The reason it works is:

  1. Scanner.Whitespace is actually uint64 so value of GoWhitespace fully fit
  2. Operation s.Whitespace&amp;(1&lt;&lt;uint(ch)) on unsigned integer at runtime can have intermediate values arbitrary large and will wrap around. So if say char is "a" (96) we have 1 &lt;&lt; 96 which overflows so modulo size of 64-bit int it's 0.

huangapple
  • 本文由 发表于 2014年3月16日 22:50:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/22438387.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定