使用Rust进行同形字符的Unicode规范化为ASCII。

huangapple go评论57阅读模式
英文:

Unicode normalization of homoglyphs to ASCII using Rust

问题

fn convert_to_ascii(input: &str) -> String {
    let normalized = input.nfc().collect::<String>(); // 使用 NFC 规范进行归一化
    let result = normalized.to_lowercase(); // 转换为小写字母
    result
}
a α а ꭺ ᗅ ᴀ ꓮ a &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;
英文:

Given a homoglyph, I want a Rust function to convert it to the nearest ASCII character.

All of these Unicode "a"s

A Α А Ꭺ ᗅ ᴀ ꓮ A &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;

should be converted to:

a a a a a a a a a a a a a a a a a a a a a a a a a a a a a

I tried this but it didn't work:

let input = &quot;A Α А Ꭺ ᗅ ᴀ ꓮ A &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;&quot;;
let normalized = input.nfc().collect::&lt;String&gt;(); // normalize using NFC
let result = normalized.to_lowercase(); // convert to lower case
println!(&quot;{}&quot;, result);

It output:

a α а ꭺ ᗅ ᴀ ꓮ a 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐

答案1

得分: 4

以下是翻译好的部分:


正确的工具将取决于转换的目的,但Unicode标准确实指示这些字符与"A"是“混淆”的。

您可以尝试使用 unicode-security crate 及其 skeleton() 函数,该函数遵循 Unicode 安全机制进行 混淆检测。 使用它会得到以下结果:

fn main() {
    let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;";
    let normalized = unicode_security::skeleton(input).collect::<String>();
    let result = normalized.to_lowercase(); // 转换为小写
    println!("{}", result);
}
a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a

唯一的异常是 "ᴀ": U+1D00(拉丁小写大写字母A)。我不知道它为什么与众不同,但我验证了它与Unicode的confusables.txt映射一致。尽管它可以与 "ꭺ": U+AB7A(切罗基小写字母GO)混淆。


我找到了 decancer crate,它*"从字符串中删除常见的混淆字符",并似乎使用了一个扩展定义的"混淆"*。以下是它的用法:

fn main() {
    let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;";
    let normalized = decancer::cure(input).into_str();
    println!("{}", normalized);
}
a a a a a a a a a a a a a a a a a a a a a a a a a a a

请注意,它似乎会自动转换为小写。因此,将 "a" 和 "A" 视为相同的 “同形异体”,这可能适合您。

英文:

The right tool will depend on the purpose of the transformation, but the Unicode standard does indicate these are "confusable" with "A".

You can try using the unicode-security crate and its skeleton() function which follows the Unicode security mechanisms for Confusable Detection. Using it yields this result:

fn main() {
    let input = &quot;A Α А Ꭺ ᗅ ᴀ ꓮ A &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;&quot;;
    let normalized = unicode_security::skeleton(input).collect::&lt;String&gt;();
    let result = normalized.to_lowercase(); // convert to lower case
    println!(&quot;{}&quot;, result);
}
a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a

The only outlier there is "ᴀ": U+1D00 (LATIN LETTER SMALL CAPITAL A). I don't know why it is distinct but I verified it is consistent with Unicode's confusables.txt mappings. Though it is confusable with "ꭺ": U+AB7A (CHEROKEE SMALL LETTER GO).


I have found the decancer crate that "removes common confusables from strings" and seems to use an expanded definition of "confusable". Here's how that would look:

fn main() {
    let input = &quot;A Α А Ꭺ ᗅ ᴀ ꓮ A &#66208; &#119808; &#119860; &#119912; &#119964; &#120016; &#120068; &#120120; &#120172; &#120224; &#120276; &#120328; &#120380; &#120432; &#120488; &#120546; &#120604; &#120662; &#120720;&quot;;
    let normalized = decancer::cure(input).into_str();
    println!(&quot;{}&quot;, normalized);
}
a a a a a a a a a a a a a a a a a a a a a a a a a a a

Note that it seems to automatically convert to lowercase. So your idea of "homoglyph" is to treat "a" and "A" the same, this may work for you.

答案2

得分: 1

我假设你使用了 use unicode_normalization::UnicodeNormalization; 来调用 .nfc() 吗?(始终很好地提及这些信息。)

根据相关的标准附件,这只会执行“规范分解,然后规范合成”。据我理解这些术语,这意味着它只会改变字符表示的形状,但不会改变它们应该如何呈现。你可能需要的是“兼容性分解”,正如这里所示,它包括像 ℌ → H 这样的替代。兼容性分解可以通过 unicode_normalization 库中的 .nfkc().nfkd() 实现。

英文:

I assume you use unicode_normalization::UnicodeNormalization; for .nfc()? (Always nice to mention these things.)

According to the relevant standard annex, that will only do "Canonical Decomposition,
followed by Canonical Composition". From what I understand of the jargon, that means it will only change how grapheme clusters are represented by characters, but not how they're supposed to be rendered. What you want is probably the "Compatibility Decomposition", which, as indicated here, includes substitutions like ℌ → H. The Compatibility Decomposition is available through .nfkc() or .nfkd() in the unicode_normalization crate.

huangapple
  • 本文由 发表于 2023年3月23日 08:56:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75818436.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定