2021年11月3日 20:53:06go评论93阅读模式

英文:

Golang strings.EqualFold gives unexpected results

问题

在golang（go1.17 windows/amd64）中，下面的程序给出了以下结果：

rune1 = U+0130 'İ'
rune2 = U+0131 'ı'
lower(rune1) = U+0069 'i'
upper(rune2) = U+0049 'I'
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true

我原以为strings.EqualFold会在Unicode大小写折叠下检查字符串是否相等；然而，上面的例子似乎给出了一个反例。显然，这两个符文可以（手动）折叠成在大小写折叠下相等的码点。

问题：golang是否正确地认为strings.EqualFold(İ, ı)是false？我期望它返回true。如果golang是正确的，为什么会这样？或者这种行为是根据某个Unicode规范的？

我在这里漏掉了什么。

英文:

In golang (go1.17 windows/amd64) the program below gives the following result:

rune1 = U+0130 &#39;İ&#39;
rune2 = U+0131 &#39;ı&#39;
lower(rune1) = U+0069 &#39;i&#39;
upper(rune2) = U+0049 &#39;I&#39;
strings.EqualFold(İ, ı) = false
strings.EqualFold(i, I) = true

I thought that strings.EqualFold would check strings for equality under Unicode case folding; however, the above example seem to give a counter-example. Clearly both runes can be folded (by hand) into code points that are equal under case folding.

Question: is golang correct that strings.EqualFold(İ, ı) is false? I expected it to yield true. And if golang is correct, why would that be? Or is this behaviour according to some Unicode specification.

What am I missing here.

Program:

func TestRune2(t *testing.T) {
   r1 := rune(0x0130) // U+0130 &#39;İ&#39;
   r2 := rune(0x0131) // U+0131 &#39;ı&#39;
   r1u := unicode.ToLower(r1)
   r2u := unicode.ToUpper(r2)

   t.Logf(&quot;\nrune1 = %#U\nrune2 = %#U\nlower(rune1) = %#U\nupper(rune2) = %#U\nstrings.EqualFold(%s, %s) = %v\nstrings.EqualFold(%s, %s) = %v&quot;,
      r1, r2, r1u, r2u, string(r1), string(r2), strings.EqualFold(string(r1), string(r2)), string(r1u), string(r2u), strings.EqualFold(string(r1u), string(r2u)))
}

答案1

得分: 4

是的，这是“正确”的行为。这些字母在大小写折叠下不会表现正常。请参考：http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

U+0131具有完全的大小写折叠“F”和特殊的“T”：

T：针对大写字母I和带点的大写字母I的特殊情况
   - 对于非土耳其语言，通常不使用此映射。
   - 对于土耳其语言（tr，az），可以使用此映射代替这些字符的正常映射。
     请注意，土耳其映射在没有额外处理的情况下无法保持规范等价性。
     有关更多信息，请参阅Unicode标准中关于大小写映射的讨论。

我认为没有办法强制包字符串使用tr或az映射。

英文:

Yes, this is "correct" behaviour. These letters do not behave normal under case folding. See:
http://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

U+0131 has full case folding "F" and special "T":

T: special case for uppercase I and dotted uppercase I
   - For non-Turkic languages, this mapping is normally not used.
   - For Turkic languages (tr, az), this mapping can be used instead
     of the normal mapping for these characters.
     Note that the Turkic mappings do not maintain canonical equivalence
     without additional processing.
     See the discussions of case mapping in the Unicode Standard for more information.

I think there is no way of to force package strings to use the tr or az mapping.

答案2

得分: 2

从strings.EqualFold的源代码中可以看出，它没有使用unicode.ToLower和unicode.ToUpper。

相反，它使用unicode.SimpleFold来判断一个特定的符文是否可"折叠"，从而可能进行比较：

// 一般情况。SimpleFold(x) 返回大于 x 的下一个等效符文
// 或者回绕到较小的值。
r := unicode.SimpleFold(sr)
for r != sr && r < tr {
    r = unicode.SimpleFold(r)
}

符文İ是不可折叠的。它的小写代码点是：

r := rune(0x0130)        // U+0130 'İ'
lr := unicode.ToLower(r) // U+0069 'i'

fmt.Printf("可折叠？%v\n", r != unicode.SimpleFold(r))   // 可折叠？false
fmt.Printf("可折叠？%v\n", lr != unicode.SimpleFold(lr)) // 可折叠？true

如果一个符文不可折叠（即SimpleFold返回它自身），那么该符文只能与自身匹配，而不能与其他代码点匹配。

https://play.golang.org/p/105x0I714nS

英文:

From the strings.EqualFold source - unicode.ToLower and unicode.ToUpper are not used.

Instead, it uses unicode.SimpleFold to see if a particular rune is "foldable" and therefore potentially comparable:

// General case. SimpleFold(x) returns the next equivalent rune &gt; x
// or wraps around to smaller values.
r := unicode.SimpleFold(sr)
for r != sr &amp;&amp; r &lt; tr {
	r = unicode.SimpleFold(r)
}

The rune İ is not foldable. It's lowercase code-point is:

r := rune(0x0130)        // U+0130 &#39;İ&#39;
lr := unicode.ToLower(r) // U+0069 &#39;i&#39;

fmt.Printf(&quot;foldable? %v\n&quot;, r != unicode.SimpleFold(r))   // foldable? false
fmt.Printf(&quot;foldable? %v\n&quot;, lr != unicode.SimpleFold(lr)) // foldable? true

If a rune is not foldable (i.e. SimpleFold returns itself) - then that rune can only match itself and no other code-point.

https://play.golang.org/p/105x0I714nS

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Golang的strings.EqualFold给出了意外的结果。

问题

答案1

答案2

请求体被Go中的验证器视为无效。

确认在Go中结构字段非零

如何解码 aes-256-cfb？

如何在Golang中使用UTF-8编码gob？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论