英文:
merging two dataframes with same variable names and (mostly) different na's
问题
以下是您的内容的中文翻译:
我有这样一个情况,我有三个数据框。使用下面的虚拟数据,数据框设置如下:
-
df有一个ID变量和一些附加变量
-
df1有一个与df匹配的ID变量,并且有varX_J的信息,其中X是00:19(作为字符),而J是变量名称的描述。所有变量的前三个字母都保持不变(var)
-
df2与df1相同,但包含不同的信息。
我需要将df1和df2与df合并,同时合并列中的数据。df1和df2具有相同的观测值。它们应该具有不同的信息(例如,如果在df1的var09_married中为ID 1有一个值,那么在df2的相同单元格中就不应该有信息。但是,数据很混乱,可能存在这种情况不成立的地方。
为了创建这个虚拟数据,我有以下脚本:
library('dplyr')
df <- data.frame(id = c(1:20),
og_var1 = sample(c(1:50), 20, replace=TRUE),
state = sample(c(1:52), 20, replace=TRUE),
race = sample(c(1:5), 20, replace=TRUE)
)
df1 <- left_join(data.frame(id = (1:20)), data.frame(
id = c(3,6,9,12),
var09_married = c(1,NA,2,1),
var09_happiness = c(1,NA,3,2),
var10_married = c(NA,1,2,2),
var10_happiness = c(NA,5,2,5)), by=c("id"))
df2 <- left_join(data.frame(id = (1:20)), data.frame(
id = c(3,6,11,15),
var09_married = c(NA,1,1,1),
var09_happiness = c(NA,3,3,2),
var10_married = c(1,NA,2,1),
var10_happiness = c(2,NA,4,4)), by=c("id"))
df <- left_join(df, df1, by=c("id"))
df <- left_join(df, df2, by=c("id"))
我想要的是将这些信息合并在一起,而不重复列。如果在相同的位置(例如,id3在df1和df2中都有var10的信息),那么我希望在最终的数据框中使用df1的信息。但我还想创建一个标志,指示是否删除了这些信息。因此,最终的数据框应该如下所示:
dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), og_var1 = c(6L, 4L, 33L, 7L,
37L, 16L, 34L, 42L, 37L, 37L, 39L, 41L, 24L, 33L, 30L, 2L, 20L,
29L, 33L, 47L), state = c(2L, 35L, 11L, 14L, 16L, 16L, 40L, 39L,
28L, 13L, 5L, 26L, 28L, 15L, 13L, 31L, 43L, 25L, 16L, 28L), race = c(5L,
4L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 5L, 2L, 3L, 2L, 5L,
1L, 5L, 5L), var09_married = c(NA, NA, 1, NA, NA, 1, NA, NA,
2, NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, NA), var09_happiness = c(NA,
NA, 1, NA, NA, 3, NA, NA, 3, NA, 3, 2, NA, NA, 2, NA, NA, NA,
NA, NA), var10_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 2, NA,
2, 2, NA, NA, 1, NA, NA, NA, NA, NA), var10_happiness = c(NA,
NA, 2, NA, NA, 5, NA, NA, 2, NA, 4, 5, NA, NA, 4, NA, NA, NA,
NA, NA), flag = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), row.names = c(NA, -20L), class = "data.frame")
英文:
I have a situation where I have three dataframe. Using the dummy data below, the dataframes are set up as follows:
-
df has an ID variable and a number of additional variables
-
df1 has an ID variable to match df and information for varX_J where X
is 00:19 (as characters), and J is a description for the variable
name. The first three letters stay the same (var) for all variables -
df2 is the same as df1, with different information.
I need to merge df1 and df2 with df, while merging the data in the columns. df1 and df2 have the same observations. They should have different information (e.g, if there is a value for ID 1 in var09_married in df1, then there shouldn't be information in that same cell in df2. However, the data is messy and there are probably places where this isn't true.
To create this dummy data, I have the following script:
library('dplyr')
df <- data.frame(id = c(1:20),
og_var1 = sample(c(1:50), 20, replace=TRUE),
state = sample(c(1:52), 20, replace=TRUE),
race = sample(c(1:5), 20, replace=TRUE)
)
df1 <- left_join(data.frame(id = (1:20)), data.frame(
id = c(3,6,9,12),
var09_married = c(1,NA,2,1),
var09_happiness = c(1,NA,3,2),
var10_married = c(NA,1,2,2),
var10_happiness = c(NA,5,2,5)), by=c("id"))
df2 <- left_join(data.frame(id = (1:20)), data.frame(
id = c(3,6,11,15),
var09_married = c(NA,1,1,1),
var09_happiness = c(NA,3,3,2),
var10_married = c(1,NA,2,1),
var10_happiness = c(2,NA,4,4)), by=c("id"))
df <- left_join(df, df1, by=c("id"))
df <- left_join(df, df2, by=c("id"))
What I want is to merge this information together without duplicating the columns. If there is information in df1 and df2 in the same place (e.g., id3 has information for var10 in both df1 and df2), then I want to have the information from df1 in the final dataframe. But I'd also like to create a flag if this information is dropped. So the final dataframe should look like:
dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), og_var1 = c(6L, 4L, 33L, 7L,
37L, 16L, 34L, 42L, 37L, 37L, 39L, 41L, 24L, 33L, 30L, 2L, 20L,
29L, 33L, 47L), state = c(2L, 35L, 11L, 14L, 16L, 16L, 40L, 39L,
28L, 13L, 5L, 26L, 28L, 15L, 13L, 31L, 43L, 25L, 16L, 28L), race = c(5L,
4L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 5L, 2L, 3L, 2L, 5L,
1L, 5L, 5L), var09_married = c(NA, NA, 1, NA, NA, 1, NA, NA,
2, NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, NA), var09_happiness = c(NA,
NA, 1, NA, NA, 3, NA, NA, 3, NA, 3, 2, NA, NA, 2, NA, NA, NA,
NA, NA), var10_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 2, NA,
2, 2, NA, NA, 1, NA, NA, NA, NA, NA), var10_happiness = c(NA,
NA, 2, NA, NA, 5, NA, NA, 2, NA, 4, 5, NA, NA, 4, NA, NA, NA,
NA, NA), flag = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), row.names = c(NA, -20L), class = "data.frame")
答案1
得分: 2
尝试使用 rows_patch
而不是连接:
library(dplyr)
df1 |>
rows_patch(df2, by = "id") |>
right_join(df, by = "id")
从 ?rows_patch
文档中可以看到:
> 类似于 rows_update()
,但仅覆盖 NA
值
这意味着如果在 df1
中存在值,它们将保留不变。当在 df1
中有 NA
值并且在 df2
中有值时,那些值将被“修补”(即使用来自 df2
的值进行更新)。
但是,根据您的需求,您可能考虑使用 rows_update
,它将在匹配时用来自 df2
的整行更新 df1
中的行。
英文:
Try rows_patch
instead of joining:
library(dplyr)
df1 |>
rows_patch(df2, by = "id") |>
right_join(df, by = "id")
From the documentation ?rows_patch
:
> works like rows_update()
but only overwrites NA
values
This means that if there are values in df1
they will remain. Where there are NA
in df1
and values in df2
then those values will be "patched" (i.e. updated with values from df2
).
However, depending on your needs you might consider using rows_update
which will update the entire row in df1
with a row from df2
if matched.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论