英文:
Unicode normalization of homoglyphs to ASCII using Rust
问题
fn convert_to_ascii(input: &str) -> String {
let normalized = input.nfc().collect::<String>(); // 使用 NFC 规范进行归一化
let result = normalized.to_lowercase(); // 转换为小写字母
result
}
a α а ꭺ ᗅ ᴀ ꓮ a 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐
英文:
Given a homoglyph, I want a Rust function to convert it to the nearest ASCII character.
All of these Unicode "a"s
A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐
should be converted to:
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
I tried this but it didn't work:
let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐";
let normalized = input.nfc().collect::<String>(); // normalize using NFC
let result = normalized.to_lowercase(); // convert to lower case
println!("{}", result);
It output:
a α а ꭺ ᗅ ᴀ ꓮ a 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐
答案1
得分: 4
以下是翻译好的部分:
正确的工具将取决于转换的目的,但Unicode标准确实指示这些字符与"A"是“混淆”的。
您可以尝试使用 unicode-security crate 及其 skeleton()
函数,该函数遵循 Unicode 安全机制进行 混淆检测。 使用它会得到以下结果:
fn main() {
let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐";
let normalized = unicode_security::skeleton(input).collect::<String>();
let result = normalized.to_lowercase(); // 转换为小写
println!("{}", result);
}
a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a
唯一的异常是 "ᴀ": U+1D00
(拉丁小写大写字母A)。我不知道它为什么与众不同,但我验证了它与Unicode的confusables.txt
映射一致。尽管它可以与 "ꭺ": U+AB7A
(切罗基小写字母GO)混淆。
我找到了 decancer crate,它*"从字符串中删除常见的混淆字符",并似乎使用了一个扩展定义的"混淆"*。以下是它的用法:
fn main() {
let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐";
let normalized = decancer::cure(input).into_str();
println!("{}", normalized);
}
a a a a a a a a a a a a a a a a a a a a a a a a a a a
请注意,它似乎会自动转换为小写。因此,将 "a" 和 "A" 视为相同的 “同形异体”,这可能适合您。
英文:
The right tool will depend on the purpose of the transformation, but the Unicode standard does indicate these are "confusable" with "A".
You can try using the unicode-security crate and its skeleton()
function which follows the Unicode security mechanisms for Confusable Detection. Using it yields this result:
fn main() {
let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐";
let normalized = unicode_security::skeleton(input).collect::<String>();
let result = normalized.to_lowercase(); // convert to lower case
println!("{}", result);
}
a a a a a ᴀ a a a a a a a a a a a a a a a a a a a a a
The only outlier there is "ᴀ": U+1D00
(LATIN LETTER SMALL CAPITAL A). I don't know why it is distinct but I verified it is consistent with Unicode's confusables.txt
mappings. Though it is confusable with "ꭺ": U+AB7A
(CHEROKEE SMALL LETTER GO).
I have found the decancer crate that "removes common confusables from strings" and seems to use an expanded definition of "confusable". Here's how that would look:
fn main() {
let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐";
let normalized = decancer::cure(input).into_str();
println!("{}", normalized);
}
a a a a a a a a a a a a a a a a a a a a a a a a a a a
Note that it seems to automatically convert to lowercase. So your idea of "homoglyph" is to treat "a" and "A" the same, this may work for you.
答案2
得分: 1
我假设你使用了 use unicode_normalization::UnicodeNormalization;
来调用 .nfc()
吗?(始终很好地提及这些信息。)
根据相关的标准附件,这只会执行“规范分解,然后规范合成”。据我理解这些术语,这意味着它只会改变字符表示的形状,但不会改变它们应该如何呈现。你可能需要的是“兼容性分解”,正如这里所示,它包括像 ℌ → H
这样的替代。兼容性分解可以通过 unicode_normalization
库中的 .nfkc()
或 .nfkd()
实现。
英文:
I assume you use unicode_normalization::UnicodeNormalization;
for .nfc()
? (Always nice to mention these things.)
According to the relevant standard annex, that will only do "Canonical Decomposition,
followed by Canonical Composition". From what I understand of the jargon, that means it will only change how grapheme clusters are represented by characters, but not how they're supposed to be rendered. What you want is probably the "Compatibility Decomposition", which, as indicated here, includes substitutions like ℌ → H
. The Compatibility Decomposition is available through .nfkc()
or .nfkd()
in the unicode_normalization
crate.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论