XPATH内的IMPORTXML:如何查找所有语言的文本?

huangapple go评论65阅读模式
英文:

XPATH inside IMPORTXML: how to find text in all languages?

问题

我试图在某个网页上找到对Google Sheets的单元格值的所有提及。单元格值可以用任何语言编写。网页也可以是任何语言。

现在我有这个公式:

=COUNTA(IFERROR(IMPORTXML(A2;"//*[contains(translate(text(),'ABCDEFGHJIKLMNOPQRSTUVWXYZАБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЭЮЯ', 'abcdefghjiklmnopqrstuvwxyzабвгдеёжзиклмнопрстуфхцчшщэюя'),'"&LOWER(B2)&"')]")))

但它工作不正确。它无法找到拉丁字母和西里尔字母中的所有单词,也没有考虑葡萄牙语、德语和其他语言的字符。如何使这个公式通用?或者如何编写适用于Google Sheets的适当脚本?

我的公式工作不正确。我想要使用Google Sheets的公式或脚本在网页上找到任何语言的任何文本。

英文:

I'm trying to find all mentions of a Google Sheets' cell value on some website page. The cell value can be written in any language. The website page can also be in any language.

Now I have this formula:

=COUNTA(IFERROR(IMPORTXML(A2;"//*[contains(translate(text(),'ABCDEFGHJIKLMNOPQRSTUVWXYZАБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЭЮЯ', 'abcdefghjiklmnopqrstuvwxyzабвгдеёжзиклмнопрстуфхцчшщэюя'),'"&LOWER(B2)&"')]")))

But it works incorrectly. It's not possible to find all words in Latin and Cyrillic, characters from Portuguese, German and other languages ​​are not taken into account. How to make the formula universal? Or how can I write the appropriate script for Google Sheets?

My formula works incorrectly. I want to find any text in any language on website page with Google Sheets' formula or script.

答案1

得分: 0

我们可以更加优雅地解决这个问题,如果Google Docs支持XPath 2.0,并且我们可以使用fn:lower-case()。在XPath 1.0中,我们被限制使用translate(),需要自己提供大写和小写的转换。然而,这种方式很难涵盖所有可能的(Unicode)字符和所有语言。<sup>1</sup> 我已经将拉丁字母的重音符号添加到您的公式中,这应该涵盖大多数西方语言。

=COUNTA(IFERROR(IMPORTXML(A2;"//*[contains(translate(text(),
'ABCDEFGHJIKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞŸŽŠŒАБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЭЮЯ',
'abcdefghjiklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿžšœабвгдеёжзиклмнопрстуфхцчшщэюя'))&LOWER(B2)&"')]"))

<hr>

<sup>1</sup>: 此类转换对于所有语言都不完美,例如,德语的 'ß'(小写)在大写时可能写作 'SS',这使得查找所有匹配变得困难。

英文:

We could solve this more elegantly if Google Docs would support Xpath 2.0 and we could use the fn:lower-case(). In Xpath 1.0 we are stuck with translate() and need to provide the upper-case and lower-case translation ourselves. However, it is difficult to cover all possible (Unicode) characters in all languages this way.<sup>1</sup> I have added the diacritics from the Latin alphabet to your formula, with should cover most Western languages.

=COUNTA(IFERROR(IMPORTXML(A2;&quot;//*[contains(translate(text(),
&#39;ABCDEFGHJIKLMNOPQRSTUVWXYZ&#192;&#193;&#194;&#195;&#196;&#197;&#198;&#199;&#200;&#201;&#202;&#203;&#204;&#205;&#206;&#207;&#208;&#209;&#210;&#211;&#212;&#213;&#214;&#216;&#217;&#218;&#219;&#220;&#221;&#222;ŸŽŠŒАБВГДЕЁЖЗИКЛМНОПРСТУФХЦЧШЩЭЮЯ&#39;, 
&#39;abcdefghjiklmnopqrstuvwxyz&#224;&#225;&#226;&#227;&#228;&#229;&#230;&#231;&#232;&#233;&#234;&#235;&#236;&#237;&#238;&#239;&#240;&#241;&#242;&#243;&#244;&#245;&#246;&#248;&#249;&#250;&#251;&#252;&#253;&#254;&#255;žšœабвгдеёжзиклмнопрстуфхцчшщэюя&#39;),&#39;&quot;&amp;LOWER(B2)&amp;&quot;&#39;)]&quot;)))

<hr>

<sup>1</sup>: Furthermore, such a transformation does not work perfectly for all languages, e.g., the German 'ß' (lowercase) may be written as 'SS' when in uppercase which makes it difficult to find all matches.

huangapple
  • 本文由 发表于 2023年6月22日 15:06:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76529345.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定