英文:
Golang strings.EqualFold gives unexpected results
问题
在golang(go1.17 windows/amd64)中,下面的程序给出了以下结果:
rune1 = U+0130 'İ'
rune2 = U+0131 'ı'
lower(rune1) = U+0069 'i'
upper(rune2) = U+0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true
我原以为strings.EqualFold
会在Unicode大小写折叠下检查字符串是否相等;然而,上面的例子似乎给出了一个反例。显然,这两个符文可以(手动)折叠成在大小写折叠下相等的码点。
问题:golang是否正确地认为strings.EqualFold(İ, ı)
是false
?我期望它返回true
。如果golang是正确的,为什么会这样?或者这种行为是根据某个Unicode规范的?
我在这里漏掉了什么。
英文:
In golang (go1.17 windows/amd64) the program below gives the following result:
rune1 = U+0130 'İ'
rune2 = U+0131 'ı'
lower(rune1) = U+0069 'i'
upper(rune2) = U+0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true
I thought that strings.EqualFold
would check strings for equality under Unicode case folding; however, the above example seem to give a counter-example. Clearly both runes can be folded (by hand) into code points that are equal under case folding.
Question: is golang correct that strings.EqualFold(İ, ı)
is false
? I expected it to yield true
. And if golang is correct, why would that be? Or is this behaviour according to some Unicode specification.
What am I missing here.
Program:
func TestRune2(t *testing.T) {
r1 := rune(0x0130) // U+0130 'İ'
r2 := rune(0x0131) // U+0131 'ı'
r1u := unicode.ToLower(r1)
r2u := unicode.ToUpper(r2)
t.Logf("\nrune1 = %#U\nrune2 = %#U\nlower(rune1) = %#U\nupper(rune2) = %#U\nstrings.EqualFold(%s, %s) = %v\nstrings.EqualFold(%s, %s) = %v",
r1, r2, r1u, r2u, string(r1), string(r2), strings.EqualFold(string(r1), string(r2)), string(r1u), string(r2u), strings.EqualFold(string(r1u), string(r2u)))
}
答案1
得分: 4
是的,这是“正确”的行为。这些字母在大小写折叠下不会表现正常。请参考:http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
U+0131具有完全的大小写折叠“F”和特殊的“T”:
T:针对大写字母I和带点的大写字母I的特殊情况
- 对于非土耳其语言,通常不使用此映射。
- 对于土耳其语言(tr,az),可以使用此映射代替这些字符的正常映射。
请注意,土耳其映射在没有额外处理的情况下无法保持规范等价性。
有关更多信息,请参阅Unicode标准中关于大小写映射的讨论。
我认为没有办法强制包字符串使用tr或az映射。
英文:
Yes, this is "correct" behaviour. These letters do not behave normal under case folding. See:
http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
U+0131 has full case folding "F" and special "T":
T: special case for uppercase I and dotted uppercase I
- For non-Turkic languages, this mapping is normally not used.
- For Turkic languages (tr, az), this mapping can be used instead
of the normal mapping for these characters.
Note that the Turkic mappings do not maintain canonical equivalence
without additional processing.
See the discussions of case mapping in the Unicode Standard for more information.
I think there is no way of to force package strings to use the tr or az mapping.
答案2
得分: 2
从strings.EqualFold的源代码中可以看出,它没有使用unicode.ToLower
和unicode.ToUpper
。
相反,它使用unicode.SimpleFold来判断一个特定的符文是否可"折叠",从而可能进行比较:
// 一般情况。SimpleFold(x) 返回大于 x 的下一个等效符文
// 或者回绕到较小的值。
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
r = unicode.SimpleFold(r)
}
符文İ
是不可折叠的。它的小写代码点是:
r := rune(0x0130) // U+0130 'İ'
lr := unicode.ToLower(r) // U+0069 'i'
fmt.Printf("可折叠?%v\n", r != unicode.SimpleFold(r)) // 可折叠?false
fmt.Printf("可折叠?%v\n", lr != unicode.SimpleFold(lr)) // 可折叠?true
如果一个符文不可折叠(即SimpleFold
返回它自身),那么该符文只能与自身匹配,而不能与其他代码点匹配。
https://play.golang.org/p/105x0I714nS
英文:
From the strings.EqualFold source - unicode.ToLower
and unicode.ToUpper
are not used.
Instead, it uses unicode.SimpleFold to see if a particular rune is "foldable" and therefore potentially comparable:
// General case. SimpleFold(x) returns the next equivalent rune > x
// or wraps around to smaller values.
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
r = unicode.SimpleFold(r)
}
The rune İ
is not foldable. It's lowercase code-point is:
r := rune(0x0130) // U+0130 'İ'
lr := unicode.ToLower(r) // U+0069 'i'
fmt.Printf("foldable? %v\n", r != unicode.SimpleFold(r)) // foldable? false
fmt.Printf("foldable? %v\n", lr != unicode.SimpleFold(lr)) // foldable? true
If a rune is not foldable (i.e. SimpleFold
returns itself) - then that rune can only match itself and no other code-point.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论