2023年3月23日 08:56:54go评论93阅读模式

英文:

Unicode normalization of homoglyphs to ASCII using Rust

问题

fn convert_to_ascii(input: &str) -> String {
    let normalized = input.nfc().collect::<String>(); // 使用 NFC 规范进行归一化
    let result = normalized.to_lowercase(); // 转换为小写字母
    result
}

a α а ꭺ ᗅ ᴀ ꓮ ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;

英文:

Given a homoglyph, I want a Rust function to convert it to the nearest ASCII character.

All of these Unicode "a"s

A Α А Ꭺ ᗅ ᴀ ꓮ Ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;

should be converted to:

a a a a a a a a a a a a a a a a a a a a a a a a a a a a a

I tried this but it didn't work:

let input = &quot;A Α А Ꭺ ᗅ ᴀ ꓮ Ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;&quot;;
let normalized = input.nfc().collect::&lt;String&gt;(); // normalize using NFC
let result = normalized.to_lowercase(); // convert to lower case
println!(&quot;{}&quot;, result);

It output:

a α а ꭺ ᗅ ᴀ ꓮ ａ 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐

答案1

得分: 4

以下是翻译好的部分：

正确的工具将取决于转换的目的，但Unicode标准确实指示这些字符与"A"是“混淆”的。

您可以尝试使用 unicode-security crate 及其 skeleton() 函数，该函数遵循 Unicode 安全机制进行混淆检测。使用它会得到以下结果：

fn main() {
    let input = "A Α А Ꭺ ᗅ ᴀ ꓮ Ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;";
    let normalized = unicode_security::skeleton(input).collect::<String>();
    let result = normalized.to_lowercase(); // 转换为小写
    println!("{}", result);
}

a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a

唯一的异常是 "ᴀ"： U+1D00（拉丁小写大写字母A）。我不知道它为什么与众不同，但我验证了它与Unicode的confusables.txt映射一致。尽管它可以与 "ꭺ"： U+AB7A（切罗基小写字母GO）混淆。

我找到了 decancer crate，它*"从字符串中删除常见的混淆字符"，并似乎使用了一个扩展定义的"混淆"*。以下是它的用法：

fn main() {
    let input = "A Α А Ꭺ ᗅ ᴀ ꓮ Ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;";
    let normalized = decancer::cure(input).into_str();
    println!("{}", normalized);
}

a a a a a a a a a a a a a a a a a a a a a a a a a a a

请注意，它似乎会自动转换为小写。因此，将 "a" 和 "A" 视为相同的 “同形异体”，这可能适合您。

英文:

The right tool will depend on the purpose of the transformation, but the Unicode standard does indicate these are "confusable" with "A".

You can try using the unicode-security crate and its skeleton() function which follows the Unicode security mechanisms for Confusable Detection. Using it yields this result:

fn main() {
    let input = &quot;A Α А Ꭺ ᗅ ᴀ ꓮ Ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;&quot;;
    let normalized = unicode_security::skeleton(input).collect::&lt;String&gt;();
    let result = normalized.to_lowercase(); // convert to lower case
    println!(&quot;{}&quot;, result);
}

a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a

The only outlier there is "ᴀ": U+1D00 (LATIN LETTER SMALL CAPITAL A). I don't know why it is distinct but I verified it is consistent with Unicode's confusables.txt mappings. Though it is confusable with "ꭺ": U+AB7A (CHEROKEE SMALL LETTER GO).

I have found the decancer crate that "removes common confusables from strings" and seems to use an expanded definition of "confusable". Here's how that would look:

fn main() {
    let input = &quot;A Α А Ꭺ ᗅ ᴀ ꓮ Ａ &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;&quot;;
    let normalized = decancer::cure(input).into_str();
    println!(&quot;{}&quot;, normalized);
}

a a a a a a a a a a a a a a a a a a a a a a a a a a a

Note that it seems to automatically convert to lowercase. So your idea of "homoglyph" is to treat "a" and "A" the same, this may work for you.

答案2

得分: 1

我假设你使用了 use unicode_normalization::UnicodeNormalization; 来调用 .nfc() 吗？（始终很好地提及这些信息。）

根据相关的标准附件，这只会执行“规范分解，然后规范合成”。据我理解这些术语，这意味着它只会改变字符表示的形状，但不会改变它们应该如何呈现。你可能需要的是“兼容性分解”，正如这里所示，它包括像 ℌ → H 这样的替代。兼容性分解可以通过 unicode_normalization 库中的 .nfkc() 或 .nfkd() 实现。

英文:

I assume you use unicode_normalization::UnicodeNormalization; for .nfc()? (Always nice to mention these things.)

According to the relevant standard annex, that will only do "Canonical Decomposition,
followed by Canonical Composition". From what I understand of the jargon, that means it will only change how grapheme clusters are represented by characters, but not how they're supposed to be rendered. What you want is probably the "Compatibility Decomposition", which, as indicated here, includes substitutions like ℌ → H. The Compatibility Decomposition is available through .nfkc() or .nfkd() in the unicode_normalization crate.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Rust进行同形字符的Unicode规范化为ASCII。

问题

答案1

答案2

在avr_hal中等价于 “tone()” 的函数是什么？

如何将两个变量连接起来以创建一个声明性宏中的标识符？

如何在Java/Rust服务之间通过JNI使用原始指针重用对象

为什么Rust库中的异步函数不生成poll函数？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。