英文:
Go scanner - correctness for whitespace?
问题
Go语言中的text/scanner/scanner.go
包中的Go扫描器使用了一个技巧来查找空白字符:
const GoWhitespace = 1<<'\\t' | 1<<'\\n' | 1<<'\\r' | 1<<' '
然后:
// 跳过空白字符
for s.Whitespace&(1<<uint(ch)) != 0 {
ch = s.next()
}
由于字符值左移超过31位,是否可能存在不唯一的情况?我的意思是,当某个字符与32取模后与制表符相同时,它会被识别为空白字符吗?
英文:
The Go scanner package in text/scanner/scanner.go
uses trick to find whitespace:
const GoWhitespace = 1<<'\t' | 1<<'\n' | 1<<'\r' | 1<<' '
And then:
// skip white space
for s.Whitespace&(1<<uint(ch)) != 0 {
ch = s.next()
}
Since character values shift left by more than 31, can there be cases where this is not unique? I mean, when some char is the same as tab modulo 32, it will be recognized as whitespace?
答案1
得分: 2
如规范中所述,<<
是一个位移操作符:
位移操作符将左操作数按照右操作数指定的位数进行位移。如果左操作数是有符号整数,则实现算术位移;如果是无符号整数,则实现逻辑位移。位移计数没有上限。位移的行为就好像左操作数按照位移计数 n 进行 n 次位移。因此,x << 1 等同于 x*2,x >> 1 等同于 x/2,但结果向负无穷截断。
对于较大的 ch
值,1<<uint(ch)
会导致溢出:
对于无符号整数值,操作 +、-、* 和 << 是模 2^n 运算,其中 n 是无符号整数类型的位宽。粗略地说,这些无符号整数操作在溢出时会丢弃高位,程序可以依赖于“环绕”。
对于有符号整数,操作 +、-、* 和 << 可以合法地溢出,结果值存在并由有符号整数表示、操作和操作数确定。溢出不会引发异常。编译器不能在假设不会发生溢出的情况下优化代码。例如,它不能假设 x < x + 1 总是为真。
因此,使用位旋转操作符(你所描述的)来实现 <<
将违反规范。对于大于 int
类型大小的 ch
值,1<<uint(ch)
的结果将为零,因此不会导致任何错误判断。
英文:
As mentioned in the spec, the <<
is a shift operation:
> The shift operators shift the left operand by the shift count specified by the right operand. They implement arithmetic shifts if the left operand is a signed integer and logical shifts if it is an unsigned integer. There is no upper limit on the shift count. Shifts behave as if the left operand is shifted n times by 1 for a shift count of n. As a result, x << 1 is the same as x*2 and x >> 1 is the same as x/2 but truncated towards negative infinity.
For large values of ch
, 1<<uint(ch)
will cause an overflow:
> For unsigned integer values, the operations +, -, *, and << are computed modulo 2<sup>n</sup>, where n is the bit width of the unsigned integer's type. Loosely speaking, these unsigned integer operations discard high bits upon overflow, and programs may rely on ``wrap around''.
>
> For signed integers, the operations +, -, *, and << may legally overflow and the resulting value exists and is deterministically defined by the signed integer representation, the operation, and its operands. No exception is raised as a result of overflow. A compiler may not optimize code under the assumption that overflow does not occur. For instance, it may not assume that x < x + 1 is always true.
So implementing <<
with a bitwise rotation operator (what you seem to be describing) would violate the spec. 1<<uint(ch)
will evaluate to zero for values of ch
larger than the size of the int
type, so won't cause any false positives.
答案2
得分: 0
规范明确指出,对于无符号操作,高位被屏蔽,因此低位实际上是“环绕”的。
它能够工作的原因是:
Scanner.Whitespace
实际上是uint64
,所以GoWhitespace
的值完全适合。- 在运行时,对无符号整数进行的操作
s.Whitespace&(1<<uint(ch))
可能会产生任意大的中间值,并且会环绕。因此,如果字符是“a”(96),我们有1 < < 96
,它会溢出,所以模 64 位整数的大小就是 0。
英文:
Fully answer:
Spec explicitly say that for operations on unsigned, we get high bits masked out so the low bits are really "wrap around".
The reason it works is:
Scanner.Whitespace
is actuallyuint64
so value ofGoWhitespace
fully fit- Operation
s.Whitespace&(1<<uint(ch))
on unsigned integer at runtime can have intermediate values arbitrary large and will wrap around. So if say char is "a" (96) we have1 << 96
which overflows so modulo size of 64-bit int it's 0.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论