英文:
R {quanteda}: remove accents in a dictionary
问题
我想从字典中去除重音和标点符号。例如,我想将 ""à l'épreuve"" 转换为 ""a l epreuve""。字典的链接是这个:https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat)。
关于数据框的说明可以在这里找到(https://stackoverflow.com/questions/39148759/remove-accents-from-a-dataframe-column-in-r),但我找不到去除字典中重音的方法。
到目前为止,我的代码是这样的:
dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")
有什么建议吗?
英文:
I want to remove accents and punctuation from a dictionary. For example, I want to transform "à l'épreuve" into "a l epreuve". The dictionary is this one: https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat).
There are explanations for dataframes (https://stackoverflow.com/questions/39148759/remove-accents-from-a-dataframe-column-in-r), but I could not find a way of removing for dictionaries.
My code so far:
dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")
Any suggestion?
答案1
得分: 1
这应该可以工作:
library(quanteda)
library(stringi)
library(stringr)
dict_lg_ascii <-
dict_lg |>
rapply(f = \(term) term |>
## 根据需要使用字符串工具进行组合
stri_trans_general(id = 'Latin-ASCII') |>
str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
how = 'replace'
)
输出:
## > dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
- a cornes, a court de personnel , a l etroit, a peine , abais ,
## 截断
来自文档:
可以使用 [ 和 [[ 来子集化字典,与等效的列表运算符操作相同。
因此,rapply
(递归地在嵌套列表上应用函数)可以工作。在这种情况下,我们按照这里建议的方式应用 stri_trans_general
。
英文:
This should work:
library(quanteda)
library(stringi)
library(stringr)
dict_lg_ascii <-
dict_lg |>
rapply(f = \(term) term |>
## compose from string utilities as desired
stri_trans_general(id = 'Latin-ASCII') |>
str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
how = 'replace'
)
output:
## > dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
- a cornes, a court de personnel , a l etroit, a peine , abais ,
## truncated
from the docs:
> Dictionaries can be subsetted using [ and [[, operating the same as the
> equivalent list operators.
Thus rapply
(recursively applying a function over nested lists) works. In this case, we apply stri_trans_general
as suggested here.
答案2
得分: 0
这篇帖子可能会有所帮助: https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r
stringr
包与正则表达式一起可能是处理这个问题的好方法。
英文:
This post might help: https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r
The stringr
-package together with regular expressions are probably a good way to deal with it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论