dplyr解决方案以精确和部分字符串连接方式

huangapple go评论86阅读模式
英文:

dplyr solution to exact and partial-string join

问题

我需要通过两列将两个数据集连接起来,其中一列是精确匹配,另一列是部分匹配:

  1. pf=data.frame('exact'=c('s1','s2','s3','s2','s4'),'id_part'=c('a','a','a','b','c'), 'value'=c(1,2,3,4,5))
  2. cj=data.frame('exact'=c('s1','s1','s4','s2','s4'), 'id_part'=c('saf','r2a@ff','k5-a','6b4-d','ab1'))

期望的结果应该是一个与cj具有相同行数的数据集,以及一个额外的列,其中的值来自于pf,以这样的方式进行匹配:如果 pf$exact==cj$exact & pf$id_part %in% cj$id_part,那么 output$value<- pf$value,否则 output$value<-0

  1. output <- merge(cj, pf, by.x=c('exact', 'id_part'), by.y=c('exact', 'id_part'), all.x=TRUE)
  2. output[is.na(output$value), 'value'] <- 0

exact列精确匹配,id_part列部分匹配。我尝试过使用stringdist_inner_join(cj, pf, by=c('exact', 'id_part'), method='lv')等方法,但没有成功。

英文:

I need to join two datasets by two columns, one column by an exact match and the other one by a partial match:

  1. pf=data.frame(&#39;exact&#39;=c(&#39;s1&#39;,&#39;s2&#39;,&#39;s3&#39;,&#39;s2&#39;,&#39;s4&#39;),&#39;id_part&#39;=c(&#39;a&#39;,&#39;a&#39;,&#39;a&#39;,&#39;b&#39;,&#39;c&#39;), &#39;value&#39;=c(1,2,3,4,5))
  2. &gt; pf
  3. exact id_part value
  4. 1 s1 a 1
  5. 2 s2 a 2
  6. 3 s3 a 3
  7. 4 s2 b 4
  8. 5 s4 c 5

and

  1. cj=data.frame(&#39;exact&#39;=c(&#39;s1&#39;,&#39;s1&#39;,&#39;s4&#39;,&#39;s2&#39;,&#39;s4&#39;), &#39;id_part&#39;=c(&#39;saf&#39;,&#39;r2a@ff&#39;,&#39;k5-a&#39;,&#39;6b4-d&#39;,&#39;ab1&#39;))
  2. &gt; cj
  3. exact id_part
  4. 1 s1 saf
  5. 2 s1 r2a@ff
  6. 3 s4 k5-a
  7. 4 s2 6b4-d
  8. 5 s4 ab1

the desired outcome should be a datasets with the same rows as cj plus an additional columns with values form pf, in a way that if pf$exact==cj$exact &amp; pf$id_part %in% cj$id_part,then output$value&lt;- pf$value, else output$value&lt;-0:

  1. &gt; output
  2. exact id_part value
  3. 1 s1 saf 1
  4. 2 s1 r2a@ff 1
  5. 3 s4 k5-a 0
  6. 4 s2 6b4-d 4
  7. 5 s4 ab1 0

i.e. an exact match on the exact column and a partial match on the id_part column. I tried with stringdist_inner_join(cj,pf, by=c(&#39;exact&#39;,&#39;id_part&#39;), method=&#39;lv&#39;) and similar, but got nowhere.

any help appreciated.

答案1

得分: 0

你可以使用 regex_left_join

  1. library(fuzzyjoin)
  2. library(dplyr)
  3. library(tidyr)
  4. regex_left_join(cj, pf) %>%
  5. mutate(value = replace_na(value, 0))
  6. # exact.x id_part.x exact.y id_part.y value
  7. # 1 s1 saf s1 a 1
  8. # 2 s1 r2a@ff s1 a 1
  9. # 3 s4 k5-a &lt;NA&gt; &lt;NA&gt; 0
  10. # 4 s2 6b4-d s2 b 4
  11. # 5 s4 ab1 &lt;NA&gt; &lt;NA&gt; 0
英文:

You can use regex_left_join:

  1. library(fuzzyjoin)
  2. library(dplyr)
  3. library(tidyr)
  4. regex_left_join(cj, pf) |&gt;
  5. mutate(value = replace_na(value, 0))
  6. # exact.x id_part.x exact.y id_part.y value
  7. # 1 s1 saf s1 a 1
  8. # 2 s1 r2a@ff s1 a 1
  9. # 3 s4 k5-a &lt;NA&gt; &lt;NA&gt; 0
  10. # 4 s2 6b4-d s2 b 4
  11. # 5 s4 ab1 &lt;NA&gt; &lt;NA&gt; 0

huangapple
  • 本文由 发表于 2023年6月5日 22:20:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76407391.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定