英文:
Creating new column after join based on presence/absence of variable in R
问题
我有两个数据框:一个包含所有患者的数据框(df1),另一个包含接受额外治疗的患者的子集(df2)。
我想要将这两个数据框合并,保留df1中的所有行,并添加一个新列"0/1",表示患者是否接受额外治疗(0表示未接受,1表示接受)。如果df2中有一些患者不在df1中,我希望能够捕捉到这些情况!
重要的是,我要保留df1中的所有行,因为我想要比较df1和df2之间的结果。
简化的示例:
df1
id date
0001 2/2/22
0002 1/1/22
0002 3/2/22
0003 3/2/22
0003 5/2/22
0004 6/7/22
0004 9/2/22
df2
id
0002
0004
期望的输出:
id date tx
0001 2/2/22 0
0002 1/1/22 1
0002 3/2/22 1
0003 3/2/22 0
0003 5/2/22 0
0004 6/7/22 1
0004 9/2/22 1
我知道首先要基于ID进行合并,但不确定如何根据df2中的数据的存在与否来创建新列。我找到了一个类似的问题,但我不明白他们是如何将这两个数据集合并的。
英文:
I have two dfs: one with all patients (df1) and one with a subset of those patients who are getting additional treatment (df2).
I want to join the two dfs, keeping all rows in df1 and adding a new column for “0/1” if the patient is not getting the additional treatment (0) or is getting the additional treatment (1). There shouldn’t be any issues of patients in df2 who are not in df1, but if there are, I want to catch them!
It's important I keep all of the rows in df1 as I want to compare results between df1 and df2
Simplified example:
df1
id date
0001 2/2/22
0002 1/1/22
0002 3/2/22
0003 3/2/22
0003 5/2/22
0004 6/7/22
0004 9/2/22
df2
id
0002
0004
Desired output:
id date tx
0001 2/2/22 0
0002 1/1/22 1
0002 3/2/22 1
0003 3/2/22 0
0003 5/2/22 0
0004 6/7/22 1
0004 9/2/22 1
I know that I start with a join based on ID, but not sure the most elegant way to create the new column based on absence/presence of data in df2?
df1 %>%
left_join(df2, by = "id")
I found this other question which is trying to do something similar, but I don't see how they joined the two datasets.
答案1
得分: 1
在这种情况下,你可以简单地在基本R中创建一个布尔向量,无需连接、合并等操作:
df$tx <- +(df$id %in% df2$id)
# id date tx
#1 0001 2/2/22 0
#2 0002 1/1/22 1
#3 0002 3/2/22 1
#4 0003 3/2/22 0
#5 0003 5/2/22 0
#6 0004 6/7/22 1
#7 0004 9/2/22 1
+(...)
将其转换为 1/0,只使用 df$tx <- df$id %in% df2$id
将返回 TRUE
/FALSE
。
数据:
df <- structure(list(id = c("0001", "0002", "0002", "0003", "0003", "0004", "0004"), date = c("2/2/22", "1/1/22", "3/2/22", "3/2/22", "5/2/22", "6/7/22", "9/2/22")), row.names = c(NA, -7L), class = "data.frame")
df2 <- structure(list(id = c("0002", "0004")), class = "data.frame", row.names = c(NA, -2L))
英文:
In this case, you can simply create a boolean vector in base R, no need to join, merge, etc:
df$tx <- +(df$id %in% df2$id)
# id date tx
#1 0001 2/2/22 0
#2 0002 1/1/22 1
#3 0002 3/2/22 1
#4 0003 3/2/22 0
#5 0003 5/2/22 0
#6 0004 6/7/22 1
#7 0004 9/2/22 1
The +(...)
makes it a 1/0, just doing df$tx <- df$id %in% df2$id
will return TRUE
/FALSE
Data
df <- structure(list(id = c("0001", "0002", "0002", "0003", "0003",
"0004", "0004"), date = c("2/2/22", "1/1/22", "3/2/22", "3/2/22",
"5/2/22", "6/7/22", "9/2/22")), row.names = c(NA, -7L), class = "data.frame")
df2 <- structure(list(id = c("0002", "0004")), class = "data.frame", row.names = c(NA,
-2L))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论