2020年1月6日 22:35:02go评论83阅读模式

英文:

JavaScript function to convert unicode pseduo-alphabet to regular characters?

问题

我正在尝试编写一个函数，该函数接受包含Unicode伪字母字符的任何字符串，并返回一个等效的字符串，其中这些字符已被替换为ASCII中的常规字符。

const toRegularCharacters = s =&gt; {
  // ?
};

toRegularCharacters(&#39;ⓗⓔⓛⓛⓞ, ⓦⓞⓡⓛⓓ&#39;); // &quot;hello, world&quot;
toRegularCharacters(&#39;&#120049;&#120046;&#120053;&#120053;&#120056;, &#120064;&#120056;&#120059;&#120053;&#120045;&#39;); // &quot;hello, world&quot;
toRegularCharacters(&#39;ん乇ﾚﾚo, wo尺ﾚd&#39;); // &quot;hello, world&quot;

我不想自己创建查找表。我已经查看了各种“slugify”库，但它们只会删除重音等。理想情况下，该函数应在Node和浏览器中工作。

当然，并非每个特殊字符都有常规等效字符。在这些情况下，解决方案应该做出合理的猜测（例如，"尺" -> "R"）。对于具有“真正转换”的伪字母，它应该能够无缝运行：

当前的真正转换包括：
圆圈、负圆圈、亚洲全宽、数学粗体、数学粗体弗拉克图、数学粗体斜体、数学粗体脚本、数学双线、数学等宽、数学无衬线、数学无衬线粗体、数学无衬线粗斜体、数学无衬线斜体、带括号、区域指示符号、方形、负方形和标签文本（对于隐藏的元数据标记而言是不可见的）。

来自https://qaz.wtf/u/convert.cgi

我该如何着手解决这个问题？

从“常规”字符串转换为伪字母字符串的方法在这里实现：https://qaz.wtf/u/convert.cgi?text=hello%2C+world

英文:

I am trying to write a function that takes any string containing characters in the unicode pseduo-alphabets and returns an equivalent string where such characters have been replaced with the regular characters found in ASCII.

const toRegularCharacters = s =&gt; {
  // ?
};

toRegularCharacters(&#39;ⓗⓔⓛⓛⓞ, ⓦⓞⓡⓛⓓ&#39;); // &quot;hello, world&quot;
toRegularCharacters(&#39;&#120049;&#120046;&#120053;&#120053;&#120056;, &#120064;&#120056;&#120059;&#120053;&#120045;&#39;); // &quot;hello, world&quot;
toRegularCharacters(&#39;ん乇ﾚﾚo, wo尺ﾚd&#39;); // &quot;hello, world&quot;

I don't want to write a look-up table myself. I have looked at various "slugify" libraries, but they only remove accents etc. Ideally the function should work in Node and the browser.

Of course, not every special character will have a regular equivalent. The solution should make a reasonable guess in these cases (e.g. "尺" -> "R"). It should work flawlessly for the pseudo-alphabets with "true transforms":

> Current true transforms:
circled, negative circled, Asian fullwidth, math bold, math bold Fraktur, math bold italic, math bold script, math double-struck, math monospace, math sans, math sans-serif bold, math sans-serif bold italic, math sans-serif italic, parenthesized, regional indicator symbols, squared, negative squared, and tagging text (invisible for hidden metadata tagging).

From https://qaz.wtf/u/convert.cgi

How should I go about this?

Going from a "regular" string to a pseudo-alphabet one is implemented here: https://qaz.wtf/u/convert.cgi?text=hello%2C+world

答案1

得分: 3

You could write your code to query the Unicode database, which you can download from the Unicode consortium (or query via the character utility, but that's presumably rate-limited). The database includes things like what glyphs are "confusables" for other glyphs.

For instance, your 𝁑 from 𝁑𝁖𝁵𝁵𝁸, 𝁠𝁸𝁻𝁵𝁅 is U+1D4F1, which has lots of confusables, one of which is of course the standard Latin lowercase h (U+0068). So you could go through each character in the input string, look it up, and if it had a Latin a-z confusable (perhaps 0-9 as well), replace it with that.

It won't be perfect. As deceze pointed out, ん doesn't list any confusables, even if it does look vaguely like an "h" to an English reader. Neither does ⓗ. So you may need to supplement with your own lookup even though you've said you don't want to (or just live with the imperfection).

英文:

For instance, your 𝓱 from 𝓱𝓮𝓵𝓵𝓸, 𝔀𝓸𝓻𝓵𝓭 is U+1D4F1, which has lots of confusables, one of which is of course the standard latin lower case h (U+0068). So you could go through each char in the input string, look it up, and if it had a latin a-z confusable (perhaps 0-9 as well), replace it with that.

答案2

得分: 1

根据这个答案的建议，此解决方案使用了unicode-12.1.0 NPM包：

const unicodeNames = require('unicode-12.1.0/Names');

const overrides = Object.freeze({
  'ん': 'h',
  '乇': 'E',
  'ﾚ': 'l',
  '尺': 'r',
  // ...
});

const toRegularCharacters = xs => {
  if (typeof xs !== 'string') {
    throw new TypeError('xs必须是字符串');
  }

  return [ ...xs ].map(x => {
    const override = overrides[x];

    if (override) {
      return override;
    }

    const names = unicodeNames
      .get(x.codePointAt(0))
      .split(/\s+/);

    // console.log({
    //   x,
    //   names,
    // });

    const isCapital = names.some(x => x == 'CAPITAL');

    const isLetter = isCapital || names.some(x => x == 'SMALL');

    if (isLetter) {
      // 例如，"Ŧ"被命名为"LATIN CAPITAL LETTER T WITH STROKE"
      const c = names.some(x => x == 'WITH') ?
        names[names.length - 3] :
        names[names.length - 1];

      return isCapital ?
        c :
        c.toLowerCase();
    }

    return x;
  }).join('');
};

console.log(
  toRegularCharacters('&#120169;&#120169;.&#120146;&#120157;&#120150;&#120164;&#120153;&#120170;.&#120169;&#120169;')
);

console.log(
  toRegularCharacters('&#127344;&#127345;&#127346;&#127347;-&#127348;&#127349;&#127351;')
);

console.log(
  toRegularCharacters('ん乇ﾚﾚo, wo尺ﾚd')
);

console.log(
  toRegularCharacters('ŦɆSŦƗNǤ')
);

Names数据表包含所需信息，但不是最佳形式，因此需要进行一些巧妙的字符串操作来提取字符。

覆盖映射用于处理像尺这样的情况。

更好的解决方案将提取idn_mapping属性，正如@Seth所提到的。

英文:

Following the suggestion from this answer, this solution uses the unicode-12.1.0 NPM package:

const unicodeNames = require(&#39;unicode-12.1.0/Names&#39;);

const overrides = Object.freeze({
  &#39;ん&#39;: &#39;h&#39;,
  &#39;乇&#39;: &#39;E&#39;,
  &#39;ﾚ&#39;: &#39;l&#39;,
  &#39;尺&#39;: &#39;r&#39;,
  // ...
});

const toRegularCharacters = xs =&gt; {
  if (typeof xs !== &#39;string&#39;) {
    throw new TypeError(&#39;xs must be a string&#39;);
  }

  return [ ...xs ].map(x =&gt; {
    const override = overrides[x];

    if (override) {
      return override;
    }

    const names = unicodeNames
      .get(x.codePointAt(0))
      .split(/\s+/);

    // console.log({
    //   x,
    //   names,
    // });

    const isCapital = names.some(x =&gt; x == &#39;CAPITAL&#39;);

    const isLetter = isCapital || names.some(x =&gt; x == &#39;SMALL&#39;);

    if (isLetter) {
      // e.g. &quot;Ŧ&quot; is named &quot;LATIN CAPITAL LETTER T WITH STROKE&quot;
      const c = names.some(x =&gt; x == &#39;WITH&#39;) ?
        names[names.length - 3] :
        names[names.length - 1];

      return isCapital ?
        c :
        c.toLowerCase();
    }

    return x;
  }).join(&#39;&#39;);
};

console.log(
  toRegularCharacters(&#39;&#120169;&#120169;.&#120146;&#120157;&#120150;&#120164;&#120153;&#120170;.&#120169;&#120169;&#39;)
);

console.log(
  toRegularCharacters(&#39;&#127344;&#127345;&#127346;&#127347;-&#127348;&#127349;&#127351;&#39;)
);

console.log(
  toRegularCharacters(&#39;ん乇ﾚﾚo, wo尺ﾚd&#39;)
);

console.log(
  toRegularCharacters(&#39;ŦɆSŦƗNǤ&#39;)
);

The Names data-table contains the required information, but not in the best form, so there is some hacky string manipulation to get the character out.

A map of overrides is used for cases such as '尺'.

A better solution would extract the idn_mapping property as mentioned by @Seth.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

JavaScript函数将Unicode伪字母转换为普通字符？

问题

答案1

答案2

将整数转换为“powernumber”。

使用未定义作为属性键是否保证结果将是未定义？

当单击React Flatpickr上的按钮时如何获取值？

如何在Material UI DataGrid中删除特定行（Reactjs）

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论