合并具有相同变量名称但(大多数)不同缺失值的两个数据框。

huangapple go评论64阅读模式
英文:

merging two dataframes with same variable names and (mostly) different na's

问题

以下是您的内容的中文翻译:

我有这样一个情况,我有三个数据框。使用下面的虚拟数据,数据框设置如下:

  • df有一个ID变量和一些附加变量

  • df1有一个与df匹配的ID变量,并且有varX_J的信息,其中X是00:19(作为字符),而J是变量名称的描述。所有变量的前三个字母都保持不变(var)

  • df2与df1相同,但包含不同的信息。

我需要将df1和df2与df合并,同时合并列中的数据。df1和df2具有相同的观测值。它们应该具有不同的信息(例如,如果在df1的var09_married中为ID 1有一个值,那么在df2的相同单元格中就不应该有信息。但是,数据很混乱,可能存在这种情况不成立的地方。

为了创建这个虚拟数据,我有以下脚本:

library('dplyr')

df <- data.frame(id = c(1:20),
                 og_var1 = sample(c(1:50), 20, replace=TRUE),
                 state = sample(c(1:52), 20, replace=TRUE),
                 race = sample(c(1:5), 20, replace=TRUE)
                 )

df1 <- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,9,12),
                  var09_married = c(1,NA,2,1),
                  var09_happiness = c(1,NA,3,2),
                  var10_married = c(NA,1,2,2),
                  var10_happiness = c(NA,5,2,5)), by=c("id"))

df2 <- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,11,15),
                  var09_married = c(NA,1,1,1),
                  var09_happiness = c(NA,3,3,2),
                  var10_married = c(1,NA,2,1),
                  var10_happiness = c(2,NA,4,4)), by=c("id"))


df <- left_join(df, df1, by=c("id"))
df <- left_join(df, df2, by=c("id"))

我想要的是将这些信息合并在一起,而不重复列。如果在相同的位置(例如,id3在df1和df2中都有var10的信息),那么我希望在最终的数据框中使用df1的信息。但我还想创建一个标志,指示是否删除了这些信息。因此,最终的数据框应该如下所示:

dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), og_var1 = c(6L, 4L, 33L, 7L, 
37L, 16L, 34L, 42L, 37L, 37L, 39L, 41L, 24L, 33L, 30L, 2L, 20L, 
29L, 33L, 47L), state = c(2L, 35L, 11L, 14L, 16L, 16L, 40L, 39L, 
28L, 13L, 5L, 26L, 28L, 15L, 13L, 31L, 43L, 25L, 16L, 28L), race = c(5L, 
4L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 5L, 2L, 3L, 2L, 5L, 
1L, 5L, 5L), var09_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 
2, NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, NA), var09_happiness = c(NA, 
NA, 1, NA, NA, 3, NA, NA, 3, NA, 3, 2, NA, NA, 2, NA, NA, NA, 
NA, NA), var10_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 2, NA, 
2, 2, NA, NA, 1, NA, NA, NA, NA, NA), var10_happiness = c(NA, 
NA, 2, NA, NA, 5, NA, NA, 2, NA, 4, 5, NA, NA, 4, NA, NA, NA, 
NA, NA), flag = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0)), row.names = c(NA, -20L), class = "data.frame")
英文:

I have a situation where I have three dataframe. Using the dummy data below, the dataframes are set up as follows:

  • df has an ID variable and a number of additional variables

  • df1 has an ID variable to match df and information for varX_J where X
    is 00:19 (as characters), and J is a description for the variable
    name. The first three letters stay the same (var) for all variables

  • df2 is the same as df1, with different information.

I need to merge df1 and df2 with df, while merging the data in the columns. df1 and df2 have the same observations. They should have different information (e.g, if there is a value for ID 1 in var09_married in df1, then there shouldn't be information in that same cell in df2. However, the data is messy and there are probably places where this isn't true.

To create this dummy data, I have the following script:

library(&#39;dplyr&#39;)

df &lt;- data.frame(id = c(1:20),
                 og_var1 = sample(c(1:50), 20, replace=TRUE),
                 state = sample(c(1:52), 20, replace=TRUE),
                 race = sample(c(1:5), 20, replace=TRUE)
                 )

df1 &lt;- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,9,12),
                  var09_married = c(1,NA,2,1),
                  var09_happiness = c(1,NA,3,2),
                  var10_married = c(NA,1,2,2),
                  var10_happiness = c(NA,5,2,5)), by=c(&quot;id&quot;))

df2 &lt;- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,11,15),
                  var09_married = c(NA,1,1,1),
                  var09_happiness = c(NA,3,3,2),
                  var10_married = c(1,NA,2,1),
                  var10_happiness = c(2,NA,4,4)), by=c(&quot;id&quot;))


df &lt;- left_join(df, df1, by=c(&quot;id&quot;))
df &lt;- left_join(df, df2, by=c(&quot;id&quot;))

What I want is to merge this information together without duplicating the columns. If there is information in df1 and df2 in the same place (e.g., id3 has information for var10 in both df1 and df2), then I want to have the information from df1 in the final dataframe. But I'd also like to create a flag if this information is dropped. So the final dataframe should look like:

dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), og_var1 = c(6L, 4L, 33L, 7L, 
37L, 16L, 34L, 42L, 37L, 37L, 39L, 41L, 24L, 33L, 30L, 2L, 20L, 
29L, 33L, 47L), state = c(2L, 35L, 11L, 14L, 16L, 16L, 40L, 39L, 
28L, 13L, 5L, 26L, 28L, 15L, 13L, 31L, 43L, 25L, 16L, 28L), race = c(5L, 
4L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 5L, 2L, 3L, 2L, 5L, 
1L, 5L, 5L), var09_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 
2, NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, NA), var09_happiness = c(NA, 
NA, 1, NA, NA, 3, NA, NA, 3, NA, 3, 2, NA, NA, 2, NA, NA, NA, 
NA, NA), var10_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 2, NA, 
2, 2, NA, NA, 1, NA, NA, NA, NA, NA), var10_happiness = c(NA, 
NA, 2, NA, NA, 5, NA, NA, 2, NA, 4, 5, NA, NA, 4, NA, NA, NA, 
NA, NA), flag = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0)), row.names = c(NA, -20L), class = &quot;data.frame&quot;)

答案1

得分: 2

尝试使用 rows_patch 而不是连接:

library(dplyr)

df1 |&gt;
  rows_patch(df2, by = &quot;id&quot;) |&gt;
  right_join(df, by = &quot;id&quot;)

?rows_patch 文档中可以看到:

> 类似于 rows_update(),但仅覆盖 NA

这意味着如果在 df1 中存在值,它们将保留不变。当在 df1 中有 NA 值并且在 df2 中有值时,那些值将被“修补”(即使用来自 df2 的值进行更新)。

但是,根据您的需求,您可能考虑使用 rows_update,它将在匹配时用来自 df2 的整行更新 df1 中的行。

英文:

Try rows_patch instead of joining:

library(dplyr)

df1 |&gt;
  rows_patch(df2, by = &quot;id&quot;) |&gt;
  right_join(df, by = &quot;id&quot;)

From the documentation ?rows_patch:

> works like rows_update() but only overwrites NA values

This means that if there are values in df1 they will remain. Where there are NA in df1 and values in df2 then those values will be "patched" (i.e. updated with values from df2).

However, depending on your needs you might consider using rows_update which will update the entire row in df1 with a row from df2 if matched.

huangapple
  • 本文由 发表于 2023年5月24日 22:56:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76324874.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定