英文:
How to convert a long vector of class character containing non-ASCII unicode characters to their escaped version?
问题
我有一个R包,其中有一组大学名称,我想要将其与用户输入匹配。这些名称列表包含特殊字符,这在R CMD检查中生成警告:
checking data for non-ASCII characters (855ms)
Warning: found non-ASCII strings
理想情况下,我希望将这些非ASCII Unicode字符转换为它们的ASCII兼容转义版本,以消除此警告。与其手动处理几乎有10,000行的所有内容,我宁愿从数据生成脚本在data-raw文件夹中自动化此过程。
我认为使用stringi::stri_escape_unicode()
非常接近,但它会添加额外的反斜杠,很难去掉。以下是我的尝试的reprex:
uni <- c("Université d'Abobo-Adjamé",
"Université de Bouaké",
"Universidad Católica Cardenal Raúl Silva Henríquez")
uni
#> [1] "Université d'Abobo-Adjamé"
#> [2] "Université de Bouaké"
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
uni2 <- stringi::stri_escape_unicode(uni)
uni2
#> [1] "Universit\\u00e9 d\\'Abobo-Adjam\\u00e9"
#> [2] "Universit\\u00e9 de Bouak\\u00e9"
#> [3] "Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
# gsub removes too many and the special characters are lost
gsub("\\\\", "", uni2)
#> [1] "Universitu00e9 d'Abobo-Adjamu00e9"
#> [2] "Universitu00e9 de Bouaku00e9"
#> [3] "Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez"
# sub removes only the first one so would not work... unless we make it a list!
uni3 <- as.list(uni2)
# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
sub("\\\\", "", x)
}) |> unlist()
#> [1] "Universitu00e9 d\\'Abobo-Adjam\\u00e9"
#> [2] "Universitu00e9 de Bouak\\u00e9"
#> [3] "Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
<sup>Created on 2023-07-19 with reprex v2.0.2</sup>
我觉得正确的方法应该是使用stringi::stri_encode()
,但我还没有找到正确的使用方法:
uni <- c("Université d'Abobo-Adjamé",
"Université de Bouaké",
"Universidad Católica Cardenal Raúl Silva Henríquez")
# Not the expected result
stringi::stri_encode(uni, from = "UTF-8", to = "latin2")
#> [1] "Universit\\xe9 d'Abobo-Adjam\\xe9"
#> [2] "Universit\\xe9 de Bouak\\xe9"
#> [3] "Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez"
stringi::stri_encode(uni, from = "UTF-8", to = "latin1")
#> [1] "Université d'Abobo-Adjamé"
#> [2] "Université de Bouaké"
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
<sup>Created on 2023-07-19 with reprex v2.0.2</sup>
肯定有更好的方法来做这件事吧?如果stringi::stri_escape_unicode()
只能指定一个反斜杠,那就可以工作。
英文:
I have an R package in which I have a list of university names that I want to match to the user input. The list of names contains special characters and this is generating a warning in R CMD check:
checking data for non-ASCII characters (855ms)
Warning: found non-ASCII strings
Ideally, I would like to convert these non-ASCII unicode characters to their ASCII-compliant escaped version to get rid of this warning. Instead of doing it by hand on all of almost 10k rows, I would rather have a way to automatize the process from the data-generating script in the data-raw folder.
I think I am really close using stringi::stri_escape_unicode()
, but it adds an extra backslash which is hard to get rid of. Here is a reprex with my attempts:
uni <- c("Université d'Abobo-Adjamé",
"Université de Bouaké",
"Universidad Católica Cardenal Raúl Silva Henríquez")
uni
#> [1] "Université d'Abobo-Adjamé"
#> [2] "Université de Bouaké"
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
uni2 <- stringi::stri_escape_unicode(uni)
uni2
#> [1] "Universit\\u00e9 d\\'Abobo-Adjam\\u00e9"
#> [2] "Universit\\u00e9 de Bouak\\u00e9"
#> [3] "Universidad Cat\\u00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
# gsub removes too many and the special characters are lost
gsub("\\\\", "", uni2)
#> [1] "Universitu00e9 d'Abobo-Adjamu00e9"
#> [2] "Universitu00e9 de Bouaku00e9"
#> [3] "Universidad Catu00f3lica Cardenal Rau00fal Silva Henru00edquez"
# sub removes only the first one so would not work... unless we make it a list!
uni3 <- as.list(uni2)
# But sometimes there are more than one non-ASCII characters and those get missed...
lapply(uni3, \(x) {
sub("\\\\", "", x)
}) |> unlist()
#> [1] "Universitu00e9 d\\'Abobo-Adjam\\u00e9"
#> [2] "Universitu00e9 de Bouak\\u00e9"
#> [3] "Universidad Catu00f3lica Cardenal Ra\\u00fal Silva Henr\\u00edquez"
<sup>Created on 2023-07-19 with reprex v2.0.2</sup>
I feel like the correct approach must be with stringi::stri_encode()
, but I did not find the right way to use it yet:
uni <- c("Université d'Abobo-Adjamé",
"Université de Bouaké",
"Universidad Católica Cardenal Raúl Silva Henríquez")
# Not the expected result
stringi::stri_encode(uni, from = "UTF-8", to = "latin2")
#> [1] "Universit\\xe9 d'Abobo-Adjam\\xe9"
#> [2] "Universit\\xe9 de Bouak\\xe9"
#> [3] "Universidad Cat\\xf3lica Cardenal Ra\\xfal Silva Henr\\xedquez"
stringi::stri_encode(uni, from = "UTF-8", to = "latin1")
#> [1] "Université d'Abobo-Adjamé"
#> [2] "Université de Bouaké"
#> [3] "Universidad Católica Cardenal Raúl Silva Henríquez"
<sup>Created on 2023-07-19 with reprex v2.0.2</sup>
Surely, there is a better way to do this? If only stringi::stri_escape_unicode()
had an argument to specify a single backslash, that would would work.
答案1
得分: 2
双重反斜杠之所以存在是因为在R中需要转义反斜杠 - 你所看到的就是它的意思。如果我们将转义后的向量写入CSV文件,就会看到双重转义的字符消失:
uni %>%
stringi::stri_escape_unicode() %>%
as.data.frame() %>%
write_csv("test.csv", col_names = FALSE)
# test.csv
Université d'Abobo-Adjamé
Université de Bouaké
Universidad Católica Cardenal Raúl Silva Henríquez
英文:
Doubled backslashes are there because in R backslashes need to be escaped - what you have is how it is meant to be. If we write the escaped vector to a csv file, we can see the double escaped characters go away:
uni %>%
stringi::stri_escape_unicode() %>%
as.data.frame() %>%
write_csv("test.csv", col_names = FALSE)
# test.csv
Universit\u00e9 d\'Abobo-Adjam\u00e9
Universit\u00e9 de Bouak\u00e9
Universidad Cat\u00f3lica Cardenal Ra\u00fal Silva Henr\u00edquez
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论