如何将一个包含非ASCII Unicode字符的字符类长向量转换为它们的转义版本?

huangapple go评论68阅读模式
英文:

How to convert a long vector of class character containing non-ASCII unicode characters to their escaped version?

问题

我有一个R包,其中有一组大学名称,我想要将其与用户输入匹配。这些名称列表包含特殊字符,这在R CMD检查中生成警告:

checking data for non-ASCII characters (855ms)
     Warning: found non-ASCII strings

理想情况下,我希望将这些非ASCII Unicode字符转换为它们的ASCII兼容转义版本,以消除此警告。与其手动处理几乎有10,000行的所有内容,我宁愿从数据生成脚本在data-raw文件夹中自动化此过程。

我认为使用stringi::stri_escape_unicode()非常接近,但它会添加额外的反斜杠,很难去掉。以下是我的尝试的reprex:

uni <- c("Université d'Abobo-Adjamé",
         "Université de Bouaké",
         "Universidad Católica Cardenal Raúl Silva Henríquez")
uni
#> [1] "Université d'Abobo-Adjamé"                         
#> [2] "Université de Bouaké"                              
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"

uni2 <- stringi::stri_escape_unicode(uni)
uni2
#> [1] "Universit\\u00e9 d\\'Abobo-Adjam\\u00e9"                             
#> [2] "Universit\\u00e9 de Bouak\\u00e9"                                      
#> [3] "Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"

# gsub removes too many and the special characters are lost
gsub("\\\\", "", uni2)
#> [1] "Universitu00e9 d'Abobo-Adjamu00e9"                             
#> [2] "Universitu00e9 de Bouaku00e9"                                  
#> [3] "Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez"

# sub removes only the first one so would not work... unless we make it a list!
uni3 <- as.list(uni2)

# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
  sub("\\\\", "", x)
}) |> unlist()
#> [1] "Universitu00e9 d\\'Abobo-Adjam\\u00e9"                             
#> [2] "Universitu00e9 de Bouak\\u00e9"                                      
#> [3] "Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

我觉得正确的方法应该是使用stringi::stri_encode(),但我还没有找到正确的使用方法:

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)

# Not the expected result
stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin2&quot;)
#&gt; [1] &quot;Universit\\xe9 d&#39;Abobo-Adjam\\xe9&quot;                            
#&gt; [2] &quot;Universit\\xe9 de Bouak\\xe9&quot;                                
#&gt; [3] &quot;Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez&quot;

stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin1&quot;)
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

肯定有更好的方法来做这件事吧?如果stringi::stri_escape_unicode()只能指定一个反斜杠,那就可以工作。

英文:

I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:

checking data for non-ASCII characters (855ms)
     Warning: found non-ASCII strings

Ideally, I would like to convert these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning. Instead of doing it by hand on all of almost 10k rows, I would rather have a way to automatize the process from the data-generating script in the data-raw folder.

I think I am really close using stringi::stri_escape_unicode(), but it adds an extra backslash which is hard to get rid of. Here is a reprex with my attempts:

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
uni
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

uni2 &lt;- stringi::stri_escape_unicode(uni)
uni2
#&gt; [1] &quot;Universit\\u00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;                             
#&gt; [2] &quot;Universit\\u00e9 de Bouak\\u00e9&quot;                                    
#&gt; [3] &quot;Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;

# gsub removes too many and the special characters are lost
gsub(&quot;\\\\&quot;, &quot;&quot;, uni2)
#&gt; [1] &quot;Universitu00e9 d&#39;Abobo-Adjamu00e9&quot;                             
#&gt; [2] &quot;Universitu00e9 de Bouaku00e9&quot;                                  
#&gt; [3] &quot;Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez&quot;

# sub removes only the first one so would not work... unless we make it a list!
uni3 &lt;- as.list(uni2)

# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
  sub(&quot;\\\\&quot;, &quot;&quot;, x)
}) |&gt; unlist()
#&gt; [1] &quot;Universitu00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;                             
#&gt; [2] &quot;Universitu00e9 de Bouak\\u00e9&quot;                                    
#&gt; [3] &quot;Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

I feel like the correct approach must be with stringi::stri_encode(), but I did not find the right way to use it yet:

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)

# Not the expected result
stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin2&quot;)
#&gt; [1] &quot;Universit\\xe9 d&#39;Abobo-Adjam\\xe9&quot;                            
#&gt; [2] &quot;Universit\\xe9 de Bouak\\xe9&quot;                                 
#&gt; [3] &quot;Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez&quot;

stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin1&quot;)
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

<sup>Created on 2023-07-19 with reprex v2.0.2</sup>

Surely, there is a better way to do this? If only stringi::stri_escape_unicode() had an argument to specify a single backslash, that would would work.

答案1

得分: 2

双重反斜杠之所以存在是因为在R中需要转义反斜杠 - 你所看到的就是它的意思。如果我们将转义后的向量写入CSV文件,就会看到双重转义的字符消失:

uni  %>%
  stringi::stri_escape_unicode() %>%
  as.data.frame() %>%
  write_csv("test.csv", col_names = FALSE)

# test.csv
Université d'Abobo-Adjamé
Université de Bouaké
Universidad Católica Cardenal Raúl Silva Henríquez
英文:

Doubled backslashes are there because in R backslashes need to be escaped - what you have is how it is meant to be. If we write the escaped vector to a csv file, we can see the double escaped characters go away:

uni  %&gt;% 
  stringi::stri_escape_unicode() %&gt;%
  as.data.frame() %&gt;%
  write_csv(&quot;test.csv&quot;, col_names = FALSE)

# test.csv
Universit\u00e9 d\&#39;Abobo-Adjam\u00e9
Universit\u00e9 de Bouak\u00e9
Universidad Cat\u00f3lica Cardenal Ra\u00fal Silva Henr\u00edquez

huangapple
  • 本文由 发表于 2023年7月20日 08:18:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76725934.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定