如何将一个包含非ASCII Unicode字符的字符类长向量转换为它们的转义版本?

huangapple go评论95阅读模式
英文:

How to convert a long vector of class character containing non-ASCII unicode characters to their escaped version?

问题

我有一个R包,其中有一组大学名称,我想要将其与用户输入匹配。这些名称列表包含特殊字符,这在R CMD检查中生成警告:

  1. checking data for non-ASCII characters (855ms)
  2. Warning: found non-ASCII strings

理想情况下,我希望将这些非ASCII Unicode字符转换为它们的ASCII兼容转义版本,以消除此警告。与其手动处理几乎有10,000行的所有内容,我宁愿从数据生成脚本在data-raw文件夹中自动化此过程。

我认为使用stringi::stri_escape_unicode()非常接近,但它会添加额外的反斜杠,很难去掉。以下是我的尝试的reprex:

  1. uni <- c("Université d'Abobo-Adjamé",
  2. "Université de Bouaké",
  3. "Universidad Católica Cardenal Raúl Silva Henríquez")
  4. uni
  5. #> [1] "Université d'Abobo-Adjamé"
  6. #> [2] "Université de Bouaké"
  7. #> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
  8. uni2 <- stringi::stri_escape_unicode(uni)
  9. uni2
  10. #> [1] "Universit\\u00e9 d\\'Abobo-Adjam\\u00e9"
  11. #> [2] "Universit\\u00e9 de Bouak\\u00e9"
  12. #> [3] "Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
  13. # gsub removes too many and the special characters are lost
  14. gsub("\\\\", "", uni2)
  15. #> [1] "Universitu00e9 d'Abobo-Adjamu00e9"
  16. #> [2] "Universitu00e9 de Bouaku00e9"
  17. #> [3] "Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez"
  18. # sub removes only the first one so would not work... unless we make it a list!
  19. uni3 <- as.list(uni2)
  20. # But sometimes there are more than one non-ASCII characters and those get missed...
  21. lapply(uni3, \(x) {
  22. sub("\\\\", "", x)
  23. }) |> unlist()
  24. #> [1] "Universitu00e9 d\\'Abobo-Adjam\\u00e9"
  25. #> [2] "Universitu00e9 de Bouak\\u00e9"
  26. #> [3] "Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

我觉得正确的方法应该是使用stringi::stri_encode(),但我还没有找到正确的使用方法:

  1. uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
  2. &quot;Universit&#233; de Bouak&#233;&quot;,
  3. &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
  4. # Not the expected result
  5. stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin2&quot;)
  6. #&gt; [1] &quot;Universit\\xe9 d&#39;Abobo-Adjam\\xe9&quot;
  7. #&gt; [2] &quot;Universit\\xe9 de Bouak\\xe9&quot;
  8. #&gt; [3] &quot;Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez&quot;
  9. stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin1&quot;)
  10. #&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;
  11. #&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;
  12. #&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

肯定有更好的方法来做这件事吧?如果stringi::stri_escape_unicode()只能指定一个反斜杠,那就可以工作。

英文:

I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:

  1. checking data for non-ASCII characters (855ms)
  2. Warning: found non-ASCII strings

Ideally, I would like to convert these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning. Instead of doing it by hand on all of almost 10k rows, I would rather have a way to automatize the process from the data-generating script in the data-raw folder.

I think I am really close using stringi::stri_escape_unicode(), but it adds an extra backslash which is hard to get rid of. Here is a reprex with my attempts:

  1. uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
  2. &quot;Universit&#233; de Bouak&#233;&quot;,
  3. &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
  4. uni
  5. #&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;
  6. #&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;
  7. #&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;
  8. uni2 &lt;- stringi::stri_escape_unicode(uni)
  9. uni2
  10. #&gt; [1] &quot;Universit\\u00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;
  11. #&gt; [2] &quot;Universit\\u00e9 de Bouak\\u00e9&quot;
  12. #&gt; [3] &quot;Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;
  13. # gsub removes too many and the special characters are lost
  14. gsub(&quot;\\\\&quot;, &quot;&quot;, uni2)
  15. #&gt; [1] &quot;Universitu00e9 d&#39;Abobo-Adjamu00e9&quot;
  16. #&gt; [2] &quot;Universitu00e9 de Bouaku00e9&quot;
  17. #&gt; [3] &quot;Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez&quot;
  18. # sub removes only the first one so would not work... unless we make it a list!
  19. uni3 &lt;- as.list(uni2)
  20. # But sometimes there are more than one non-ASCII characters and those get missed...
  21. lapply(uni3, \(x) {
  22. sub(&quot;\\\\&quot;, &quot;&quot;, x)
  23. }) |&gt; unlist()
  24. #&gt; [1] &quot;Universitu00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;
  25. #&gt; [2] &quot;Universitu00e9 de Bouak\\u00e9&quot;
  26. #&gt; [3] &quot;Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

I feel like the correct approach must be with stringi::stri_encode(), but I did not find the right way to use it yet:

  1. uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
  2. &quot;Universit&#233; de Bouak&#233;&quot;,
  3. &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
  4. # Not the expected result
  5. stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin2&quot;)
  6. #&gt; [1] &quot;Universit\\xe9 d&#39;Abobo-Adjam\\xe9&quot;
  7. #&gt; [2] &quot;Universit\\xe9 de Bouak\\xe9&quot;
  8. #&gt; [3] &quot;Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez&quot;
  9. stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin1&quot;)
  10. #&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;
  11. #&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;
  12. #&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

Surely, there is a better way to do this? If only stringi::stri_escape_unicode() had an argument to specify a single backslash, that would would work.

答案1

得分: 2

双重反斜杠之所以存在是因为在R中需要转义反斜杠 - 你所看到的就是它的意思。如果我们将转义后的向量写入CSV文件,就会看到双重转义的字符消失:

  1. uni %>%
  2. stringi::stri_escape_unicode() %>%
  3. as.data.frame() %>%
  4. write_csv("test.csv", col_names = FALSE)
  5. # test.csv
  6. Université d'Abobo-Adjamé
  7. Université de Bouaké
  8. Universidad Católica Cardenal Raúl Silva Henríquez
英文:

Doubled backslashes are there because in R backslashes need to be escaped - what you have is how it is meant to be. If we write the escaped vector to a csv file, we can see the double escaped characters go away:

  1. uni %&gt;%
  2. stringi::stri_escape_unicode() %&gt;%
  3. as.data.frame() %&gt;%
  4. write_csv(&quot;test.csv&quot;, col_names = FALSE)
  5. # test.csv
  6. Universit\u00e9 d\&#39;Abobo-Adjam\u00e9
  7. Universit\u00e9 de Bouak\u00e9
  8. Universidad Cat\u00f3lica Cardenal Ra\u00fal Silva Henr\u00edquez

huangapple
  • 本文由 发表于 2023年7月20日 08:18:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76725934.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定