2023年7月20日 08:18:14go评论95阅读模式

英文:

How to convert a long vector of class character containing non-ASCII unicode characters to their escaped version?

问题

我有一个R包，其中有一组大学名称，我想要将其与用户输入匹配。这些名称列表包含特殊字符，这在R CMD检查中生成警告：

checking data for non-ASCII characters (855ms)
     Warning: found non-ASCII strings

理想情况下，我希望将这些非ASCII Unicode字符转换为它们的ASCII兼容转义版本，以消除此警告。与其手动处理几乎有10,000行的所有内容，我宁愿从数据生成脚本在data-raw文件夹中自动化此过程。

我认为使用stringi::stri_escape_unicode()非常接近，但它会添加额外的反斜杠，很难去掉。以下是我的尝试的reprex：

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
uni
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;
uni2 &lt;- stringi::stri_escape_unicode(uni)
uni2
#&gt; [1] &quot;Universit\\u00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;                             
#&gt; [2] &quot;Universit\\u00e9 de Bouak\\u00e9&quot;                                      
#&gt; [3] &quot;Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;
# gsub removes too many and the special characters are lost
gsub(&quot;\\\\&quot;, &quot;&quot;, uni2)
#&gt; [1] &quot;Universitu00e9 d&#39;Abobo-Adjamu00e9&quot;                             
#&gt; [2] &quot;Universitu00e9 de Bouaku00e9&quot;                                  
#&gt; [3] &quot;Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez&quot;
# sub removes only the first one so would not work... unless we make it a list!
uni3 &lt;- as.list(uni2)
# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
  sub(&quot;\\\\&quot;, &quot;&quot;, x)
}) |&gt; unlist()
#&gt; [1] &quot;Universitu00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;                             
#&gt; [2] &quot;Universitu00e9 de Bouak\\u00e9&quot;                                      
#&gt; [3] &quot;Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;

Created on 2023-07-19 with reprex v2.0.2

我觉得正确的方法应该是使用stringi::stri_encode()，但我还没有找到正确的使用方法：

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
# Not the expected result
stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin2&quot;)
#&gt; [1] &quot;Universit\\xe9 d&#39;Abobo-Adjam\\xe9&quot;                            
#&gt; [2] &quot;Universit\\xe9 de Bouak\\xe9&quot;                                
#&gt; [3] &quot;Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez&quot;
stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin1&quot;)
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

Created on 2023-07-19 with reprex v2.0.2

肯定有更好的方法来做这件事吧？如果stringi::stri_escape_unicode()只能指定一个反斜杠，那就可以工作。

英文:

I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:

checking data for non-ASCII characters (855ms)
     Warning: found non-ASCII strings

Ideally, I would like to convert these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning. Instead of doing it by hand on all of almost 10k rows, I would rather have a way to automatize the process from the data-generating script in the data-raw folder.

I think I am really close using stringi::stri_escape_unicode(), but it adds an extra backslash which is hard to get rid of. Here is a reprex with my attempts:

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
uni
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;
uni2 &lt;- stringi::stri_escape_unicode(uni)
uni2
#&gt; [1] &quot;Universit\\u00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;                             
#&gt; [2] &quot;Universit\\u00e9 de Bouak\\u00e9&quot;                                    
#&gt; [3] &quot;Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;
# gsub removes too many and the special characters are lost
gsub(&quot;\\\\&quot;, &quot;&quot;, uni2)
#&gt; [1] &quot;Universitu00e9 d&#39;Abobo-Adjamu00e9&quot;                             
#&gt; [2] &quot;Universitu00e9 de Bouaku00e9&quot;                                  
#&gt; [3] &quot;Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez&quot;
# sub removes only the first one so would not work... unless we make it a list!
uni3 &lt;- as.list(uni2)
# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
  sub(&quot;\\\\&quot;, &quot;&quot;, x)
}) |&gt; unlist()
#&gt; [1] &quot;Universitu00e9 d\\&#39;Abobo-Adjam\\u00e9&quot;                             
#&gt; [2] &quot;Universitu00e9 de Bouak\\u00e9&quot;                                    
#&gt; [3] &quot;Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez&quot;

Created on 2023-07-19 with reprex v2.0.2

I feel like the correct approach must be with stringi::stri_encode(), but I did not find the right way to use it yet:

uni &lt;- c(&quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;,
         &quot;Universit&#233; de Bouak&#233;&quot;,
         &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;)
# Not the expected result
stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin2&quot;)
#&gt; [1] &quot;Universit\\xe9 d&#39;Abobo-Adjam\\xe9&quot;                            
#&gt; [2] &quot;Universit\\xe9 de Bouak\\xe9&quot;                                 
#&gt; [3] &quot;Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez&quot;
stringi::stri_encode(uni, from = &quot;UTF-8&quot;, to = &quot;latin1&quot;)
#&gt; [1] &quot;Universit&#233; d&#39;Abobo-Adjam&#233;&quot;                         
#&gt; [2] &quot;Universit&#233; de Bouak&#233;&quot;                              
#&gt; [3] &quot;Universidad Cat&#243;lica Cardenal Ra&#250;l Silva Henr&#237;quez&quot;

Created on 2023-07-19 with reprex v2.0.2

Surely, there is a better way to do this? If only stringi::stri_escape_unicode() had an argument to specify a single backslash, that would would work.

答案1

得分: 2

双重反斜杠之所以存在是因为在R中需要转义反斜杠 - 你所看到的就是它的意思。如果我们将转义后的向量写入CSV文件，就会看到双重转义的字符消失：

uni  %>%
  stringi::stri_escape_unicode() %>%
  as.data.frame() %>%
  write_csv("test.csv", col_names = FALSE)
# test.csv
Université d'Abobo-Adjamé
Université de Bouaké
Universidad Católica Cardenal Raúl Silva Henríquez

英文:

Doubled backslashes are there because in R backslashes need to be escaped - what you have is how it is meant to be. If we write the escaped vector to a csv file, we can see the double escaped characters go away:

uni  %&gt;% 
  stringi::stri_escape_unicode() %&gt;%
  as.data.frame() %&gt;%
  write_csv(&quot;test.csv&quot;, col_names = FALSE)
# test.csv
Universit\u00e9 d\&#39;Abobo-Adjam\u00e9
Universit\u00e9 de Bouak\u00e9
Universidad Cat\u00f3lica Cardenal Ra\u00fal Silva Henr\u00edquez

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将一个包含非ASCII Unicode字符的字符类长向量转换为它们的转义版本？

问题

答案1

在VS Code中创建一个正则表达式搜索。

使用grep从环境中获取数据框名称，然后使用R中的rbind函数堆叠行。

在R中，统计列表中元素的数量，然后将计数作为一个列表。

如何编写正则表达式模式以匹配以下字符串

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论