Golang的strings.EqualFold给出了意外的结果。

huangapple go评论93阅读模式
英文:

Golang strings.EqualFold gives unexpected results

问题

在golang(go1.17 windows/amd64)中,下面的程序给出了以下结果:

rune1 = U+0130 'İ'
rune2 = U+0131 'ı'
lower(rune1) = U+0069 'i'
upper(rune2) = U+0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true

我原以为strings.EqualFold会在Unicode大小写折叠下检查字符串是否相等;然而,上面的例子似乎给出了一个反例。显然,这两个符文可以(手动)折叠成在大小写折叠下相等的码点。

问题:golang是否正确地认为strings.EqualFold(İ, ı)false?我期望它返回true。如果golang是正确的,为什么会这样?或者这种行为是根据某个Unicode规范的?

我在这里漏掉了什么。

英文:

In golang (go1.17 windows/amd64) the program below gives the following result:

rune1 = U+0130 'İ'
rune2 = U+0131 'ı'
lower(rune1) = U+0069 'i'
upper(rune2) = U+0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true

I thought that strings.EqualFold would check strings for equality under Unicode case folding; however, the above example seem to give a counter-example. Clearly both runes can be folded (by hand) into code points that are equal under case folding.

Question: is golang correct that strings.EqualFold(İ, ı) is false? I expected it to yield true. And if golang is correct, why would that be? Or is this behaviour according to some Unicode specification.

What am I missing here.


Program:

func TestRune2(t *testing.T) {
   r1 := rune(0x0130) // U+0130 'İ'
   r2 := rune(0x0131) // U+0131 'ı'
   r1u := unicode.ToLower(r1)
   r2u := unicode.ToUpper(r2)

   t.Logf("\nrune1 = %#U\nrune2 = %#U\nlower(rune1) = %#U\nupper(rune2) = %#U\nstrings.EqualFold(%s, %s) = %v\nstrings.EqualFold(%s, %s) = %v",
      r1, r2, r1u, r2u, string(r1), string(r2), strings.EqualFold(string(r1), string(r2)), string(r1u), string(r2u), strings.EqualFold(string(r1u), string(r2u)))
}

答案1

得分: 4

是的,这是“正确”的行为。这些字母在大小写折叠下不会表现正常。请参考:http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

U+0131具有完全的大小写折叠“F”和特殊的“T”:

T:针对大写字母I和带点的大写字母I的特殊情况
   - 对于非土耳其语言,通常不使用此映射。
   - 对于土耳其语言(tr,az),可以使用此映射代替这些字符的正常映射。
     请注意,土耳其映射在没有额外处理的情况下无法保持规范等价性。
     有关更多信息,请参阅Unicode标准中关于大小写映射的讨论。

我认为没有办法强制包字符串使用tr或az映射。

英文:

Yes, this is "correct" behaviour. These letters do not behave normal under case folding. See:
http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

U+0131 has full case folding "F" and special "T":

T: special case for uppercase I and dotted uppercase I
   - For non-Turkic languages, this mapping is normally not used.
   - For Turkic languages (tr, az), this mapping can be used instead
     of the normal mapping for these characters.
     Note that the Turkic mappings do not maintain canonical equivalence
     without additional processing.
     See the discussions of case mapping in the Unicode Standard for more information.

I think there is no way of to force package strings to use the tr or az mapping.

答案2

得分: 2

strings.EqualFold的源代码中可以看出,它没有使用unicode.ToLowerunicode.ToUpper

相反,它使用unicode.SimpleFold来判断一个特定的符文是否可"折叠",从而可能进行比较:

// 一般情况。SimpleFold(x) 返回大于 x 的下一个等效符文
// 或者回绕到较小的值。
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
    r = unicode.SimpleFold(r)
}

符文İ是不可折叠的。它的小写代码点是:

r := rune(0x0130)        // U+0130 'İ'
lr := unicode.ToLower(r) // U+0069 'i'

fmt.Printf("可折叠?%v\n", r != unicode.SimpleFold(r))   // 可折叠?false
fmt.Printf("可折叠?%v\n", lr != unicode.SimpleFold(lr)) // 可折叠?true

如果一个符文不可折叠(即SimpleFold返回它自身),那么该符文只能与自身匹配,而不能与其他代码点匹配。

https://play.golang.org/p/105x0I714nS

英文:

From the strings.EqualFold source - unicode.ToLower and unicode.ToUpper are not used.

Instead, it uses unicode.SimpleFold to see if a particular rune is "foldable" and therefore potentially comparable:

// General case. SimpleFold(x) returns the next equivalent rune &gt; x
// or wraps around to smaller values.
r := unicode.SimpleFold(sr)
for r != sr &amp;&amp; r &lt; tr {
	r = unicode.SimpleFold(r)
}

The rune İ is not foldable. It's lowercase code-point is:

r := rune(0x0130)        // U+0130 &#39;İ&#39;
lr := unicode.ToLower(r) // U+0069 &#39;i&#39;

fmt.Printf(&quot;foldable? %v\n&quot;, r != unicode.SimpleFold(r))   // foldable? false
fmt.Printf(&quot;foldable? %v\n&quot;, lr != unicode.SimpleFold(lr)) // foldable? true

If a rune is not foldable (i.e. SimpleFold returns itself) - then that rune can only match itself and no other code-point.

https://play.golang.org/p/105x0I714nS

huangapple
  • 本文由 发表于 2021年11月3日 20:53:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/69825197.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定