英文:
How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?
问题
我的Google搜索似乎失败了,因为我觉得之前一定有人遇到过这个问题,而我找不到解决方案。
假设我有以下数据:
df1 <- data.frame(Name = c("Banana", "Apple", "Sponge", "Donut", "Bleach", "Wine"),
Value = c("Fruit", "Fruit", "Cleaner", "Dessert", "Cleaner", "Party"))
df2 <- data.frame(Name = c("Ban", "Ap", "Do", "Wi"),
Value = c("F", "F", "D", "P"))
我需要能够对df1$Name与df2$Name进行模糊字符串匹配,如果它们匹配,将df1$Value替换为df2$Value,如果没有匹配,我想保留df1中的值。因此,我的输出应该如下所示:
Name | Value |
---|---|
Banana | F |
Apple | F |
Sponge | Cleaner |
Donut | D |
Bleach | Cleaner |
Wine | P |
我希望任何以"Ban"开头的都变成"F"。我不关心它是否说"BananaKin"或df2$Name之后的任何内容。df1中的行数是成千上万,而df2只有67行。
此外,df1还有其他不相关的列,但我想保留。
我找到的最接近我想要的代码是:
df3 <- df1 %>% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value, df2$Value))))
然而,我得到了以下错误:
Error in
mutate()
:
In argument:across(...)
.
Caused by error instr_detect()
:
Can't recyclestring
(size 1362) to matchpattern
(size 67).
Runrlang::last_trace()
to see where the error occurred.
我使用以下代码也得到了相同的错误:
df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)
我还尝试过:
df1 <- df1 %>% mutate(across(everything(), ~deframe(df2[.])))
我漏掉了什么?我觉得可能是我没有看到的简单问题。我相对新手,非常感谢帮助!
英文:
My Google-fu must be failing as I feel someone must've had this problem before and I can't find a solution.
Let's say I have the following data:
df1 <- data.frame(Name = c("Banana", "Apple", "Sponge", "Donut", "Bleach", "Wine"),
Value = c("Fruit", "Fruit", "Cleaner", "Dessert", "Cleaner", "Party"))
df2 <- data.frame(Name = c("Ban", "Ap", "Do", "Wi"),
Value = c("F", "F", "D", "P"))
I need to be able to do a fuzzy string match on df1$Name with df2$Name, and where they match replace df1$Value with df2$Value where $Name fuzzy matches. If there is no match, I want to retain the value from df1. Such that, my output would look like this:
Name | Value |
---|---|
Banana | F |
Apple | F |
Sponge | Cleaner |
Donut | D |
Bleach | Cleaner |
Wine | P |
I want anything starting with "Ban" to be F. I don't care if it says Bananakin or anything after df2$Name. The number of rows in df1 is in the thousands, df2 is merely 67.
Further, df1 has additional columns that aren't relevant but I would like to keep.
The closest code I've found to what I want is this:
df3 <- df1 %>% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value .,df2$Name)))
However I get the following error:
> Error in mutate()
:
>ℹ In argument: across(...)
.
>Caused by error in str_detect()
:
>! Can't recycle string
(size 1362) to match pattern
(size 67).
>Run rlang::last_trace()
to see where the error occurred.
I get the same error for this bit of code as well:
>df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)
I've also tried:
df1 <- df1 %>% mutate(across(everything(), ~deframe(df2[.])))
What am I missing? I have a feeling it's something simple I'm just not seeing. I'm relatively new to R, and would greatly appreciate the help!
答案1
得分: 0
df1 %>%
mutate(Name1 = Name %>%
str_replace_all(set_names(df2$Value, str_c(df2$Name, ".*")),
Value = ifelse(Name == Name1, Value, Name1),
Name1 = NULL)
Name Value
1 Banana F
2 Apple F
3 Sponge Cleaner
4 Donut D
5 Bleach Cleaner
6 Wine P
----
fuzzyjoin::regex_left_join(df1, df2, 'Name') %>%
reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))
Name Value
1 Banana F
2 Apple F
3 Sponge Cleaner
4 Donut D
5 Bleach Cleaner
6 Wine P
英文:
df1 %>%
mutate(Name1 = Name %>%
str_replace_all(set_names(df2$Value, str_c(df2$Name, ".*"))),
Value = ifelse(Name == Name1, Value, Name1),
Name1 = NULL)
Name Value
1 Banana F
2 Apple F
3 Sponge Cleaner
4 Donut D
5 Bleach Cleaner
6 Wine P
fuzzyjoin::regex_left_join(df1, df2, 'Name') %>%
reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))
Name Value
1 Banana F
2 Apple F
3 Sponge Cleaner
4 Donut D
5 Bleach Cleaner
6 Wine P
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论