在R中,根据变量的存在与否,在连接后创建新的列可以通过以下方式实现:

huangapple go评论92阅读模式
英文:

Creating new column after join based on presence/absence of variable in R

问题

我有两个数据框:一个包含所有患者的数据框(df1),另一个包含接受额外治疗的患者的子集(df2)。

我想要将这两个数据框合并,保留df1中的所有行,并添加一个新列"0/1",表示患者是否接受额外治疗(0表示未接受,1表示接受)。如果df2中有一些患者不在df1中,我希望能够捕捉到这些情况!

重要的是,我要保留df1中的所有行,因为我想要比较df1和df2之间的结果。

简化的示例:

df1

id		date		
0001    2/2/22
0002	1/1/22	
0002	3/2/22	
0003	3/2/22	
0003	5/2/22	
0004	6/7/22	
0004 	9/2/22	

df2

id			
0002		
0004 

期望的输出:

id		date    tx
0001    2/2/22  0
0002	1/1/22	1
0002	3/2/22	1
0003	3/2/22	0
0003	5/2/22	0
0004	6/7/22	1
0004 	9/2/22	1

我知道首先要基于ID进行合并,但不确定如何根据df2中的数据的存在与否来创建新列。我找到了一个类似的问题,但我不明白他们是如何将这两个数据集合并的。

https://stackoverflow.com/questions/68528437/create-a-presence-absence-column-based-on-presence-records

英文:

I have two dfs: one with all patients (df1) and one with a subset of those patients who are getting additional treatment (df2).

I want to join the two dfs, keeping all rows in df1 and adding a new column for “0/1” if the patient is not getting the additional treatment (0) or is getting the additional treatment (1). There shouldn’t be any issues of patients in df2 who are not in df1, but if there are, I want to catch them!

It's important I keep all of the rows in df1 as I want to compare results between df1 and df2

Simplified example:

df1

id		date		
0001    2/2/22
0002	1/1/22	
0002	3/2/22	
0003	3/2/22	
0003	5/2/22	
0004	6/7/22	
0004 	9/2/22	

df2

id			
0002		
0004 

Desired output:

id		date    tx
0001    2/2/22  0
0002	1/1/22	1
0002	3/2/22	1
0003	3/2/22	0
0003	5/2/22	0
0004	6/7/22	1
0004 	9/2/22	1

I know that I start with a join based on ID, but not sure the most elegant way to create the new column based on absence/presence of data in df2?

df1 %>% 
left_join(df2, by = "id")

I found this other question which is trying to do something similar, but I don't see how they joined the two datasets.

https://stackoverflow.com/questions/68528437/create-a-presence-absence-column-based-on-presence-records

答案1

得分: 1

在这种情况下,你可以简单地在基本R中创建一个布尔向量,无需连接、合并等操作:

df$tx <- +(df$id %in% df2$id)

#    id   date tx
#1 0001 2/2/22  0
#2 0002 1/1/22  1
#3 0002 3/2/22  1
#4 0003 3/2/22  0
#5 0003 5/2/22  0
#6 0004 6/7/22  1
#7 0004 9/2/22  1

+(...) 将其转换为 1/0,只使用 df$tx <- df$id %in% df2$id 将返回 TRUE/FALSE

数据:

df <- structure(list(id = c("0001", "0002", "0002", "0003", "0003", "0004", "0004"), date = c("2/2/22", "1/1/22", "3/2/22", "3/2/22", "5/2/22", "6/7/22", "9/2/22")), row.names = c(NA, -7L), class = "data.frame")

df2 <- structure(list(id = c("0002", "0004")), class = "data.frame", row.names = c(NA, -2L))
英文:

In this case, you can simply create a boolean vector in base R, no need to join, merge, etc:

df$tx &lt;- +(df$id %in% df2$id)

#    id   date tx
#1 0001 2/2/22  0
#2 0002 1/1/22  1
#3 0002 3/2/22  1
#4 0003 3/2/22  0
#5 0003 5/2/22  0
#6 0004 6/7/22  1
#7 0004 9/2/22  1

The +(...) makes it a 1/0, just doing df$tx &lt;- df$id %in% df2$id will return TRUE/FALSE

Data

df &lt;- structure(list(id = c(&quot;0001&quot;, &quot;0002&quot;, &quot;0002&quot;, &quot;0003&quot;, &quot;0003&quot;, 
&quot;0004&quot;, &quot;0004&quot;), date = c(&quot;2/2/22&quot;, &quot;1/1/22&quot;, &quot;3/2/22&quot;, &quot;3/2/22&quot;, 
&quot;5/2/22&quot;, &quot;6/7/22&quot;, &quot;9/2/22&quot;)), row.names = c(NA, -7L), class = &quot;data.frame&quot;)

df2 &lt;- structure(list(id = c(&quot;0002&quot;, &quot;0004&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, 
-2L))

huangapple
  • 本文由 发表于 2023年8月9日 03:13:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76862586.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定