英文:
Is there a way to loop/iterate a series of str_extract calls in R?
问题
我觉得应该有一种简单的方法来完成这个任务,但我陷入了困境。
我有一个大型文本数据集,想知道每个文档中提到了哪些国家。有时会提到“afghanistan”,有时会提到“afghan”,但由于它们指的是同一个国家,我只想提取这两个词中的第一个提及。因此,我有一个模式向量,如下所示:
pattern <- c("afghanistan|afghan", "algeria|algerian", "albania|albanian", "angola|angolan", "argentina|argentine")
text <- c("the first stop on the trip is afghanistan, where he will meet the afghan president", "then he will leave afghanistan and head to argentina", "meetings with the afghan president in afghanistan should last 1 hour, and meetings with the argentine president in argentina should last 2 hours")
目标是创建一系列类似以下的向量/数据框列:
c("afghanistan")
c("afghanistan", "argentina")
c("afghan", "argentine")
我最初创建了一个包括所有国家和国籍的长匹配模式,然后使用str_extract_all() + unique()
,这在大多数情况下都很有效,但当一个文本同时使用“afghanistan”和“afghan”时,该国将被重复计数。
我尝试了各种版本的map()
、mapply()
等函数,但通常结果是一个充满character(0)
的列表。
我最接近的方法是使用for循环:
country <- as.character(1:length(pattern)) #占位符向量
for(i in 1:length(pattern)){
country[i] = str_extract(text, pattern[i])
}
这会得到一个正确长度的向量,但填充了NAs。
对于如何迭代执行类似str_extract()
的调用,您有何建议?
英文:
I feel like there should be an easy way to do this but I've hit a dead end.
I have a large text dataset, and I want to know which countries are mentioned in each document. Sometimes it will say "afghanistan", sometimes "afghan", but since those are referring to the same country I want to only str_extract the first mention of either of those words. I have a pattern vector that therefore looks like this:
pattern <- c("afghanistan|afghan", "algeria|algerian", "albania|albanian", "angola|angolan", "argentina|argentine")
text <- c("the first stop on the trip is afghanistan, where he will meet the afghan president", "then he will leave afghanistan and head to argentina", "meetings with the afghan president in afghanistan should last 1 hour, and meetings with the argentine president in argentina should last 2 hours")
The goal is a series of vectors/df column that looks like the following:
c("afghanistan")
c("afghanistan", "argentina")
c("afghan", "argentine")
I originally made a long match pattern for all of the countries and nationalities all together and used str_extract_all() + unique() - this worked perfectly except when a text used both "afghanistan" and "afghan", in which case that country would be double counted.
I've tried various versions of map(), mapply(), etc and it usually results a list filled with character(0).
The closest I've gotten is a for loop:
country <- as.character(1:length(pattern)) #placeholder vector
for(i in 1:length(pattern)){
country[i] = str_extract(text, pattern[i])
}
This gives a vector of the correct length, but filled with NAs.
Any ideas on how to iterate a str_extract() call like this would be appreciated!
答案1
得分: 0
如果我理解正确,你应该能够只需形成一个正则表达式的选择项,并使用 str_extract_all()
:
pattern <- c("afghanistan|afghan", "algeria|algerian", "albania|albanian", "angola|angolan", "argentina|argentine")
regex <- paste(pattern, collapse="|")
text <- c("the first stop on the trip is afghanistan, where he will meet the afghan president", "then he will leave afghanistan and head to argentina", "meetings with the afghan president in afghanistan should last 1 hour, and meetings with the argentine president in argentina should last 2 hours")
countries <- str_extract_all(text, regex)
countries
[[1]]
[1] "afghanistan" "afghan"
[[2]]
[1] "afghanistan" "argentina"
[[3]]
[1] "afghan" "afghanistan" "argentine" "argentina"
请注意,你是正确的,应该将更具体的匹配项放在不太具体的匹配项之前。例如,在选择项中,`afghanistan` 出现在 `afghan` 之前。
<details>
<summary>英文:</summary>
If I understand correctly, you should be able to just form a single regex alternation and use `str_extract_all()`:
<!-- language: r -->
pattern <- c("afghanistan|afghan", "algeria|algerian", "albania|albanian", "angola|angolan", "argentina|argentine")
regex <- paste(pattern, collapse="|")
text <- c("the first stop on the trip is afghanistan, where he will meet the afghan president", "then he will leave afghanistan and head to argentina", "meetings with the afghan president in afghanistan should last 1 hour, and meetings with the argentine president in argentina should last 2 hours")
countries <- str_extract_all(text, regex)
countries
[[1]]
[1] "afghanistan" "afghan"
[[2]]
[1] "afghanistan" "argentina"
[[3]]
[1] "afghan" "afghanistan" "argentine" "argentina"
Note that you are correct to place more specific matches ahead of less specific ones. For example, `afghanistan` appears before `afghan` in the alternation.
</details>
# 答案2
**得分**: 0
你可以只删除NA值。例如使用`map`函数:
```R
library(purrr)
library(stringr)
text |>
map(function(t) map_chr(pattern, ~str_extract(t, .))) |>
map(~.x[!is.na(.x)])
# [[1]]
# [1] "afghanistan"
#
# [[2]]
# [1] "afghanistan" "argentina"
#
# [[3]]
# [1] "afghan" "argentine"
英文:
You can just remove the NA values. FOr example using map
library(purrr)
library(stringr)
text |>
map(function(t) map_chr(pattern, ~str_extract(t, .))) |>
map(~.x[!is.na(.x)])
# [[1]]
# [1] "afghanistan"
#
# [[2]]
# [1] "afghanistan" "argentina"
#
# [[3]]
# [1] "afghan" "argentine"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论