R {quanteda}:在字典中去除重音符号

huangapple go评论139阅读模式
英文:

R {quanteda}: remove accents in a dictionary

问题

我想从字典中去除重音和标点符号。例如,我想将 ""à l'épreuve"" 转换为 ""a l epreuve""。字典的链接是这个:https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat)。
关于数据框的说明可以在这里找到(https://stackoverflow.com/questions/39148759/remove-accents-from-a-dataframe-column-in-r),但我找不到去除字典中重音的方法。

到目前为止,我的代码是这样的:

dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")

有什么建议吗?

英文:

I want to remove accents and punctuation from a dictionary. For example, I want to transform "à l'épreuve" into "a l epreuve". The dictionary is this one: https://www.poltext.org/fr/donnees-et-analyses/lexicoder (.cat).
There are explanations for dataframes (https://stackoverflow.com/questions/39148759/remove-accents-from-a-dataframe-column-in-r), but I could not find a way of removing for dictionaries.

My code so far:

dict_lg <- dictionary(file = "frlsd/frlsd.cat", encoding = "UTF-8")

Any suggestion?

答案1

得分: 1

这应该可以工作:

library(quanteda)
library(stringi)
library(stringr)

dict_lg_ascii <- 
  dict_lg |>
  rapply(f = \(term) term |>
              ## 根据需要使用字符串工具进行组合
              stri_trans_general(id = 'Latin-ASCII') |>
              str_replace_all(pattern = '[[:punct:]]', replacement = ' '),
         how = 'replace'
         )

输出:

## > dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
  - a cornes, a court de personnel , a l etroit, a peine , abais , 
## 截断

来自文档:

可以使用 [ 和 [[ 来子集化字典,与等效的列表运算符操作相同。

因此,rapply(递归地在嵌套列表上应用函数)可以工作。在这种情况下,我们按照这里建议的方式应用 stri_trans_general

英文:

This should work:

library(quanteda)
library(stringi)
library(stringr)

dict_lg_ascii &lt;- 
  dict_lg |&gt; 
  rapply(f = \(term) term |&gt;
              ## compose from string utilities as desired       
              stri_trans_general(id = &#39;Latin-ASCII&#39;) |&gt;
              str_replace_all(pattern = &#39;[[:punct:]]&#39;, replacement = &#39; &#39;),
         how = &#39;replace&#39;
         )

output:

## &gt; dict_lg_ascii
Dictionary object with 2 primary key entries and 2 nested levels.
- [NEGATIVE]:
  - a cornes, a court de personnel , a l etroit, a peine , abais , 
## truncated

from the docs:
> Dictionaries can be subsetted using [ and [[, operating the same as the
> equivalent list operators.

Thus rapply (recursively applying a function over nested lists) works. In this case, we apply stri_trans_general as suggested here.

答案2

得分: 0

这篇帖子可能会有所帮助: https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r

stringr包与正则表达式一起可能是处理这个问题的好方法。

英文:

This post might help: https://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r

The stringr-package together with regular expressions are probably a good way to deal with it.

huangapple
  • 本文由 发表于 2023年7月20日 19:47:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729535.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定