JavaScript函数将Unicode伪字母转换为普通字符?

huangapple go评论83阅读模式
英文:

JavaScript function to convert unicode pseduo-alphabet to regular characters?

问题

我正在尝试编写一个函数,该函数接受包含Unicode伪字母字符的任何字符串,并返回一个等效的字符串,其中这些字符已被替换为ASCII中的常规字符。

const toRegularCharacters = s => {
  // ?
};

toRegularCharacters('ⓗⓔⓛⓛⓞ, ⓦⓞⓡⓛⓓ'); // "hello, world"
toRegularCharacters('𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭'); // "hello, world"
toRegularCharacters('ん乇レレo, wo尺レd'); // "hello, world"

我不想自己创建查找表。我已经查看了各种“slugify”库,但它们只会删除重音等。理想情况下,该函数应在Node和浏览器中工作。

当然,并非每个特殊字符都有常规等效字符。在这些情况下,解决方案应该做出合理的猜测(例如,"尺" -> "R")。对于具有“真正转换”的伪字母,它应该能够无缝运行:

当前的真正转换包括:
圆圈、负圆圈、亚洲全宽、数学粗体、数学粗体弗拉克图、数学粗体斜体、数学粗体脚本、数学双线、数学等宽、数学无衬线、数学无衬线粗体、数学无衬线粗斜体、数学无衬线斜体、带括号、区域指示符号、方形、负方形和标签文本(对于隐藏的元数据标记而言是不可见的)。

  • 来自https://qaz.wtf/u/convert.cgi

我该如何着手解决这个问题?


从“常规”字符串转换为伪字母字符串的方法在这里实现:https://qaz.wtf/u/convert.cgi?text=hello%2C+world

英文:

I am trying to write a function that takes any string containing characters in the unicode pseduo-alphabets and returns an equivalent string where such characters have been replaced with the regular characters found in ASCII.

const toRegularCharacters = s => {
  // ?
};

toRegularCharacters('ⓗⓔⓛⓛⓞ, ⓦⓞⓡⓛⓓ'); // "hello, world"
toRegularCharacters('𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭'); // "hello, world"
toRegularCharacters('ん乇レレo, wo尺レd'); // "hello, world"

I don't want to write a look-up table myself. I have looked at various "slugify" libraries, but they only remove accents etc. Ideally the function should work in Node and the browser.

Of course, not every special character will have a regular equivalent. The solution should make a reasonable guess in these cases (e.g. "尺" -> "R"). It should work flawlessly for the pseudo-alphabets with "true transforms":

> Current true transforms:
circled, negative circled, Asian fullwidth, math bold, math bold Fraktur, math bold italic, math bold script, math double-struck, math monospace, math sans, math sans-serif bold, math sans-serif bold italic, math sans-serif italic, parenthesized, regional indicator symbols, squared, negative squared, and tagging text (invisible for hidden metadata tagging).

How should I go about this?


Going from a "regular" string to a pseudo-alphabet one is implemented here: https://qaz.wtf/u/convert.cgi?text=hello%2C+world

答案1

得分: 3

You could write your code to query the Unicode database, which you can download from the Unicode consortium (or query via the character utility, but that's presumably rate-limited). The database includes things like what glyphs are "confusables" for other glyphs.

For instance, your 𝁑 from 𝁑𝁖𝁵𝁵𝁸, 𝁠𝁸𝁻𝁵𝁅 is U+1D4F1, which has lots of confusables, one of which is of course the standard Latin lowercase h (U+0068). So you could go through each character in the input string, look it up, and if it had a Latin a-z confusable (perhaps 0-9 as well), replace it with that.

It won't be perfect. As deceze pointed out, doesn't list any confusables, even if it does look vaguely like an "h" to an English reader. Neither does . So you may need to supplement with your own lookup even though you've said you don't want to (or just live with the imperfection).

英文:

You could write your code to query the Unicode database, which you can download from the Unicode consortium (or query via the character utility, but that's presumably rate-limited). The database includes things like what glyphs are "confusables" for other glyphs.

For instance, your 𝓱 from 𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭 is U+1D4F1, which has lots of confusables, one of which is of course the standard latin lower case h (U+0068). So you could go through each char in the input string, look it up, and if it had a latin a-z confusable (perhaps 0-9 as well), replace it with that.

It won't be perfect. As deceze pointed out, doesn't list any confusables, even if it does look vaguely like an "h" to an English reader. Neither does . So you may need to supplement with your own lookup even though you've said you don't want to (or just live with the imperfection).

答案2

得分: 1

根据这个答案的建议,此解决方案使用了unicode-12.1.0 NPM包:

const unicodeNames = require('unicode-12.1.0/Names');

const overrides = Object.freeze({
  'ん': 'h',
  '乇': 'E',
  'レ': 'l',
  '尺': 'r',
  // ...
});

const toRegularCharacters = xs => {
  if (typeof xs !== 'string') {
    throw new TypeError('xs必须是字符串');
  }

  return [ ...xs ].map(x => {
    const override = overrides[x];

    if (override) {
      return override;
    }

    const names = unicodeNames
      .get(x.codePointAt(0))
      .split(/\s+/);

    // console.log({
    //   x,
    //   names,
    // });

    const isCapital = names.some(x => x == 'CAPITAL');

    const isLetter = isCapital || names.some(x => x == 'SMALL');

    if (isLetter) {
      // 例如,"Ŧ"被命名为"LATIN CAPITAL LETTER T WITH STROKE"
      const c = names.some(x => x == 'WITH') ?
        names[names.length - 3] :
        names[names.length - 1];

      return isCapital ?
        c :
        c.toLowerCase();
    }

    return x;
  }).join('');
};

console.log(
  toRegularCharacters('𝕩𝕩.𝕒𝕝𝕖𝕤𝕙𝕪.𝕩𝕩')
);

console.log(
  toRegularCharacters('🅰🅱🅲🅳-🅴🅵🅷')
);

console.log(
  toRegularCharacters('ん乇レレo, wo尺レd')
);

console.log(
  toRegularCharacters('ŦɆSŦƗNǤ')
);

Names数据表包含所需信息,但不是最佳形式,因此需要进行一些巧妙的字符串操作来提取字符。

覆盖映射用于处理像这样的情况。

更好的解决方案将提取idn_mapping属性,正如@Seth所提到的。

英文:

Following the suggestion from this answer, this solution uses the unicode-12.1.0 NPM package:

const unicodeNames = require('unicode-12.1.0/Names');

const overrides = Object.freeze({
  '': 'h',
  '': 'E',
  '': 'l',
  '': 'r',
  // ...
});

const toRegularCharacters = xs => {
  if (typeof xs !== 'string') {
    throw new TypeError('xs must be a string');
  }

  return [ ...xs ].map(x => {
    const override = overrides[x];

    if (override) {
      return override;
    }

    const names = unicodeNames
      .get(x.codePointAt(0))
      .split(/\s+/);

    // console.log({
    //   x,
    //   names,
    // });

    const isCapital = names.some(x => x == 'CAPITAL');

    const isLetter = isCapital || names.some(x => x == 'SMALL');

    if (isLetter) {
      // e.g. "Ŧ" is named "LATIN CAPITAL LETTER T WITH STROKE"
      const c = names.some(x => x == 'WITH') ?
        names[names.length - 3] :
        names[names.length - 1];

      return isCapital ?
        c :
        c.toLowerCase();
    }

    return x;
  }).join('');
};

console.log(
  toRegularCharacters('𝕩𝕩.𝕒𝕝𝕖𝕤𝕙𝕪.𝕩𝕩')
);

console.log(
  toRegularCharacters('🅰🅱🅲🅳-🅴🅵🅷')
);

console.log(
  toRegularCharacters('ん乇レレo, wo尺レd')
);

console.log(
  toRegularCharacters('ŦɆSŦƗNǤ')
);

The Names data-table contains the required information, but not in the best form, so there is some hacky string manipulation to get the character out.

A map of overrides is used for cases such as '尺'.

A better solution would extract the idn_mapping property as mentioned by @Seth.

huangapple
  • 本文由 发表于 2020年1月6日 22:35:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/59613915.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定