How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?

huangapple go评论68阅读模式
英文:

How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?

问题

我的Google搜索似乎失败了,因为我觉得之前一定有人遇到过这个问题,而我找不到解决方案。

假设我有以下数据:

df1 <- data.frame(Name = c("Banana", "Apple", "Sponge", "Donut", "Bleach", "Wine"),
                 Value = c("Fruit", "Fruit", "Cleaner", "Dessert", "Cleaner", "Party"))

df2 <- data.frame(Name = c("Ban", "Ap", "Do", "Wi"),
                  Value = c("F", "F", "D", "P"))

我需要能够对df1$Name与df2$Name进行模糊字符串匹配,如果它们匹配,将df1$Value替换为df2$Value,如果没有匹配,我想保留df1中的值。因此,我的输出应该如下所示:

Name Value
Banana F
Apple F
Sponge Cleaner
Donut D
Bleach Cleaner
Wine P

我希望任何以"Ban"开头的都变成"F"。我不关心它是否说"BananaKin"或df2$Name之后的任何内容。df1中的行数是成千上万,而df2只有67行。

此外,df1还有其他不相关的列,但我想保留。

我找到的最接近我想要的代码是:

df3 <- df1 %>% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value, df2$Value))))

然而,我得到了以下错误:

Error in mutate():
In argument: across(...).
Caused by error in str_detect():
Can't recycle string (size 1362) to match pattern (size 67).
Run rlang::last_trace() to see where the error occurred.

我使用以下代码也得到了相同的错误:

df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)

我还尝试过:

df1 <- df1 %>% mutate(across(everything(), ~deframe(df2[.])))

我漏掉了什么?我觉得可能是我没有看到的简单问题。我相对新手,非常感谢帮助!

英文:

My Google-fu must be failing as I feel someone must've had this problem before and I can't find a solution.

Let's say I have the following data:

df1 &lt;- data.frame(Name = c(&quot;Banana&quot;, &quot;Apple&quot;, &quot;Sponge&quot;, &quot;Donut&quot;, &quot;Bleach&quot;, &quot;Wine&quot;),
                 Value = c(&quot;Fruit&quot;, &quot;Fruit&quot;, &quot;Cleaner&quot;, &quot;Dessert&quot;, &quot;Cleaner&quot;, &quot;Party&quot;))

df2 &lt;- data.frame(Name = c(&quot;Ban&quot;, &quot;Ap&quot;, &quot;Do&quot;, &quot;Wi&quot;),
                  Value = c(&quot;F&quot;, &quot;F&quot;, &quot;D&quot;, &quot;P&quot;))

I need to be able to do a fuzzy string match on df1$Name with df2$Name, and where they match replace df1$Value with df2$Value where $Name fuzzy matches. If there is no match, I want to retain the value from df1. Such that, my output would look like this:

Name Value
Banana F
Apple F
Sponge Cleaner
Donut D
Bleach Cleaner
Wine P

I want anything starting with "Ban" to be F. I don't care if it says Bananakin or anything after df2$Name. The number of rows in df1 is in the thousands, df2 is merely 67.

Further, df1 has additional columns that aren't relevant but I would like to keep.

The closest code I've found to what I want is this:

df3 &lt;- df1 %&gt;% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value .,df2$Name)))

However I get the following error:

> Error in mutate():
>ℹ In argument: across(...).
>Caused by error in str_detect():
>! Can't recycle string (size 1362) to match pattern (size 67).
>Run rlang::last_trace() to see where the error occurred.

I get the same error for this bit of code as well:

>df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)

I've also tried:

df1 &lt;- df1 %&gt;% mutate(across(everything(), ~deframe(df2[.])))

What am I missing? I have a feeling it's something simple I'm just not seeing. I'm relatively new to R, and would greatly appreciate the help!

答案1

得分: 0

df1 %>%
  mutate(Name1 = Name %>%
           str_replace_all(set_names(df2$Value, str_c(df2$Name, ".*")),
         Value = ifelse(Name == Name1, Value, Name1),
         Name1 = NULL)

   Name   Value
1 Banana       F
2  Apple       F
3 Sponge Cleaner
4  Donut       D
5 Bleach Cleaner
6   Wine       P

----

fuzzyjoin::regex_left_join(df1, df2, 'Name') %>%
   reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))

    Name   Value
1 Banana       F
2  Apple       F
3 Sponge Cleaner
4  Donut       D
5 Bleach Cleaner
6   Wine       P
英文:
df1 %&gt;%
  mutate(Name1 = Name %&gt;%
           str_replace_all(set_names(df2$Value, str_c(df2$Name, &quot;.*&quot;))),
         Value = ifelse(Name == Name1, Value, Name1),
         Name1 = NULL)

   Name   Value
1 Banana       F
2  Apple       F
3 Sponge Cleaner
4  Donut       D
5 Bleach Cleaner
6   Wine       P

fuzzyjoin::regex_left_join(df1, df2, &#39;Name&#39;) %&gt;%
   reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))

    Name   Value
1 Banana       F
2  Apple       F
3 Sponge Cleaner
4  Donut       D
5 Bleach Cleaner
6   Wine       P

huangapple
  • 本文由 发表于 2023年6月16日 08:11:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76486221.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定