How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?

huangapple go评论97阅读模式
英文:

How, in R, would I replace String Values in one column of a dataframe with string values from another dataframe using a fuzzy match on a 3rd column?

问题

我的Google搜索似乎失败了,因为我觉得之前一定有人遇到过这个问题,而我找不到解决方案。

假设我有以下数据:

  1. df1 <- data.frame(Name = c("Banana", "Apple", "Sponge", "Donut", "Bleach", "Wine"),
  2. Value = c("Fruit", "Fruit", "Cleaner", "Dessert", "Cleaner", "Party"))
  3. df2 <- data.frame(Name = c("Ban", "Ap", "Do", "Wi"),
  4. Value = c("F", "F", "D", "P"))

我需要能够对df1$Name与df2$Name进行模糊字符串匹配,如果它们匹配,将df1$Value替换为df2$Value,如果没有匹配,我想保留df1中的值。因此,我的输出应该如下所示:

Name Value
Banana F
Apple F
Sponge Cleaner
Donut D
Bleach Cleaner
Wine P

我希望任何以"Ban"开头的都变成"F"。我不关心它是否说"BananaKin"或df2$Name之后的任何内容。df1中的行数是成千上万,而df2只有67行。

此外,df1还有其他不相关的列,但我想保留。

我找到的最接近我想要的代码是:

  1. df3 <- df1 %>% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value, df2$Value))))

然而,我得到了以下错误:

Error in mutate():
In argument: across(...).
Caused by error in str_detect():
Can't recycle string (size 1362) to match pattern (size 67).
Run rlang::last_trace() to see where the error occurred.

我使用以下代码也得到了相同的错误:

df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)

我还尝试过:

  1. df1 <- df1 %>% mutate(across(everything(), ~deframe(df2[.])))

我漏掉了什么?我觉得可能是我没有看到的简单问题。我相对新手,非常感谢帮助!

英文:

My Google-fu must be failing as I feel someone must've had this problem before and I can't find a solution.

Let's say I have the following data:

  1. df1 &lt;- data.frame(Name = c(&quot;Banana&quot;, &quot;Apple&quot;, &quot;Sponge&quot;, &quot;Donut&quot;, &quot;Bleach&quot;, &quot;Wine&quot;),
  2. Value = c(&quot;Fruit&quot;, &quot;Fruit&quot;, &quot;Cleaner&quot;, &quot;Dessert&quot;, &quot;Cleaner&quot;, &quot;Party&quot;))
  3. df2 &lt;- data.frame(Name = c(&quot;Ban&quot;, &quot;Ap&quot;, &quot;Do&quot;, &quot;Wi&quot;),
  4. Value = c(&quot;F&quot;, &quot;F&quot;, &quot;D&quot;, &quot;P&quot;))

I need to be able to do a fuzzy string match on df1$Name with df2$Name, and where they match replace df1$Value with df2$Value where $Name fuzzy matches. If there is no match, I want to retain the value from df1. Such that, my output would look like this:

Name Value
Banana F
Apple F
Sponge Cleaner
Donut D
Bleach Cleaner
Wine P

I want anything starting with "Ban" to be F. I don't care if it says Bananakin or anything after df2$Name. The number of rows in df1 is in the thousands, df2 is merely 67.

Further, df1 has additional columns that aren't relevant but I would like to keep.

The closest code I've found to what I want is this:

  1. df3 &lt;- df1 %&gt;% mutate(across(c(Name), ~if_else(str_detect(Name, df2$Name), str_replace(Value .,df2$Name)))

However I get the following error:

> Error in mutate():
>ℹ In argument: across(...).
>Caused by error in str_detect():
>! Can't recycle string (size 1362) to match pattern (size 67).
>Run rlang::last_trace() to see where the error occurred.

I get the same error for this bit of code as well:

>df1$Value <- str_replace_all(df1$Value, df1$Name == df2$Name, df2$Value)

I've also tried:

  1. df1 &lt;- df1 %&gt;% mutate(across(everything(), ~deframe(df2[.])))

What am I missing? I have a feeling it's something simple I'm just not seeing. I'm relatively new to R, and would greatly appreciate the help!

答案1

得分: 0

  1. df1 %>%
  2. mutate(Name1 = Name %>%
  3. str_replace_all(set_names(df2$Value, str_c(df2$Name, ".*")),
  4. Value = ifelse(Name == Name1, Value, Name1),
  5. Name1 = NULL)
  6. Name Value
  7. 1 Banana F
  8. 2 Apple F
  9. 3 Sponge Cleaner
  10. 4 Donut D
  11. 5 Bleach Cleaner
  12. 6 Wine P
  13. ----
  14. fuzzyjoin::regex_left_join(df1, df2, 'Name') %>%
  15. reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))
  16. Name Value
  17. 1 Banana F
  18. 2 Apple F
  19. 3 Sponge Cleaner
  20. 4 Donut D
  21. 5 Bleach Cleaner
  22. 6 Wine P
英文:
  1. df1 %&gt;%
  2. mutate(Name1 = Name %&gt;%
  3. str_replace_all(set_names(df2$Value, str_c(df2$Name, &quot;.*&quot;))),
  4. Value = ifelse(Name == Name1, Value, Name1),
  5. Name1 = NULL)
  6. Name Value
  7. 1 Banana F
  8. 2 Apple F
  9. 3 Sponge Cleaner
  10. 4 Donut D
  11. 5 Bleach Cleaner
  12. 6 Wine P

  1. fuzzyjoin::regex_left_join(df1, df2, &#39;Name&#39;) %&gt;%
  2. reframe(Name = Name.x, Value = coalesce(Value.y, Value.x))
  3. Name Value
  4. 1 Banana F
  5. 2 Apple F
  6. 3 Sponge Cleaner
  7. 4 Donut D
  8. 5 Bleach Cleaner
  9. 6 Wine P

huangapple
  • 本文由 发表于 2023年6月16日 08:11:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76486221.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定