2023年5月24日 22:56:34go评论84阅读模式

英文:

merging two dataframes with same variable names and (mostly) different na's

问题

以下是您的内容的中文翻译：

我有这样一个情况，我有三个数据框。使用下面的虚拟数据，数据框设置如下：

df有一个ID变量和一些附加变量
df1有一个与df匹配的ID变量，并且有varX_J的信息，其中X是00:19（作为字符），而J是变量名称的描述。所有变量的前三个字母都保持不变（var）
df2与df1相同，但包含不同的信息。

我需要将df1和df2与df合并，同时合并列中的数据。df1和df2具有相同的观测值。它们应该具有不同的信息（例如，如果在df1的var09_married中为ID 1有一个值，那么在df2的相同单元格中就不应该有信息。但是，数据很混乱，可能存在这种情况不成立的地方。

为了创建这个虚拟数据，我有以下脚本：

library('dplyr')
df <- data.frame(id = c(1:20),
                 og_var1 = sample(c(1:50), 20, replace=TRUE),
                 state = sample(c(1:52), 20, replace=TRUE),
                 race = sample(c(1:5), 20, replace=TRUE)
                 )
df1 <- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,9,12),
                  var09_married = c(1,NA,2,1),
                  var09_happiness = c(1,NA,3,2),
                  var10_married = c(NA,1,2,2),
                  var10_happiness = c(NA,5,2,5)), by=c("id"))
df2 <- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,11,15),
                  var09_married = c(NA,1,1,1),
                  var09_happiness = c(NA,3,3,2),
                  var10_married = c(1,NA,2,1),
                  var10_happiness = c(2,NA,4,4)), by=c("id"))
df <- left_join(df, df1, by=c("id"))
df <- left_join(df, df2, by=c("id"))

我想要的是将这些信息合并在一起，而不重复列。如果在相同的位置（例如，id3在df1和df2中都有var10的信息），那么我希望在最终的数据框中使用df1的信息。但我还想创建一个标志，指示是否删除了这些信息。因此，最终的数据框应该如下所示：

dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), og_var1 = c(6L, 4L, 33L, 7L, 
37L, 16L, 34L, 42L, 37L, 37L, 39L, 41L, 24L, 33L, 30L, 2L, 20L, 
29L, 33L, 47L), state = c(2L, 35L, 11L, 14L, 16L, 16L, 40L, 39L, 
28L, 13L, 5L, 26L, 28L, 15L, 13L, 31L, 43L, 25L, 16L, 28L), race = c(5L, 
4L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 5L, 2L, 3L, 2L, 5L, 
1L, 5L, 5L), var09_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 
2, NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, NA), var09_happiness = c(NA, 
NA, 1, NA, NA, 3, NA, NA, 3, NA, 3, 2, NA, NA, 2, NA, NA, NA, 
NA, NA), var10_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 2, NA, 
2, 2, NA, NA, 1, NA, NA, NA, NA, NA), var10_happiness = c(NA, 
NA, 2, NA, NA, 5, NA, NA, 2, NA, 4, 5, NA, NA, 4, NA, NA, NA, 
NA, NA), flag = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0)), row.names = c(NA, -20L), class = "data.frame")

英文:

I have a situation where I have three dataframe. Using the dummy data below, the dataframes are set up as follows:

df has an ID variable and a number of additional variables
df1 has an ID variable to match df and information for varX_J where X
is 00:19 (as characters), and J is a description for the variable
name. The first three letters stay the same (var) for all variables
df2 is the same as df1, with different information.

I need to merge df1 and df2 with df, while merging the data in the columns. df1 and df2 have the same observations. They should have different information (e.g, if there is a value for ID 1 in var09_married in df1, then there shouldn't be information in that same cell in df2. However, the data is messy and there are probably places where this isn't true.

To create this dummy data, I have the following script:

library(&#39;dplyr&#39;)
df &lt;- data.frame(id = c(1:20),
                 og_var1 = sample(c(1:50), 20, replace=TRUE),
                 state = sample(c(1:52), 20, replace=TRUE),
                 race = sample(c(1:5), 20, replace=TRUE)
                 )
df1 &lt;- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,9,12),
                  var09_married = c(1,NA,2,1),
                  var09_happiness = c(1,NA,3,2),
                  var10_married = c(NA,1,2,2),
                  var10_happiness = c(NA,5,2,5)), by=c(&quot;id&quot;))
df2 &lt;- left_join(data.frame(id = (1:20)), data.frame(
                  id = c(3,6,11,15),
                  var09_married = c(NA,1,1,1),
                  var09_happiness = c(NA,3,3,2),
                  var10_married = c(1,NA,2,1),
                  var10_happiness = c(2,NA,4,4)), by=c(&quot;id&quot;))
df &lt;- left_join(df, df1, by=c(&quot;id&quot;))
df &lt;- left_join(df, df2, by=c(&quot;id&quot;))

What I want is to merge this information together without duplicating the columns. If there is information in df1 and df2 in the same place (e.g., id3 has information for var10 in both df1 and df2), then I want to have the information from df1 in the final dataframe. But I'd also like to create a flag if this information is dropped. So the final dataframe should look like:

dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20), og_var1 = c(6L, 4L, 33L, 7L, 
37L, 16L, 34L, 42L, 37L, 37L, 39L, 41L, 24L, 33L, 30L, 2L, 20L, 
29L, 33L, 47L), state = c(2L, 35L, 11L, 14L, 16L, 16L, 40L, 39L, 
28L, 13L, 5L, 26L, 28L, 15L, 13L, 31L, 43L, 25L, 16L, 28L), race = c(5L, 
4L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 2L, 3L, 5L, 2L, 3L, 2L, 5L, 
1L, 5L, 5L), var09_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 
2, NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, NA), var09_happiness = c(NA, 
NA, 1, NA, NA, 3, NA, NA, 3, NA, 3, 2, NA, NA, 2, NA, NA, NA, 
NA, NA), var10_married = c(NA, NA, 1, NA, NA, 1, NA, NA, 2, NA, 
2, 2, NA, NA, 1, NA, NA, NA, NA, NA), var10_happiness = c(NA, 
NA, 2, NA, NA, 5, NA, NA, 2, NA, 4, 5, NA, NA, 4, NA, NA, NA, 
NA, NA), flag = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0)), row.names = c(NA, -20L), class = &quot;data.frame&quot;)

答案1

得分: 2

尝试使用 rows_patch 而不是连接：

library(dplyr)
df1 |&gt;
  rows_patch(df2, by = &quot;id&quot;) |&gt;
  right_join(df, by = &quot;id&quot;)

从 ?rows_patch 文档中可以看到：

> 类似于 rows_update()，但仅覆盖 NA 值

这意味着如果在 df1 中存在值，它们将保留不变。当在 df1 中有 NA 值并且在 df2 中有值时，那些值将被“修补”（即使用来自 df2 的值进行更新）。

但是，根据您的需求，您可能考虑使用 rows_update，它将在匹配时用来自 df2 的整行更新 df1 中的行。

英文:

Try rows_patch instead of joining:

library(dplyr)
df1 |&gt;
  rows_patch(df2, by = &quot;id&quot;) |&gt;
  right_join(df, by = &quot;id&quot;)

From the documentation ?rows_patch:

> works like rows_update() but only overwrites NA values

This means that if there are values in df1 they will remain. Where there are NA in df1 and values in df2 then those values will be "patched" (i.e. updated with values from df2).

However, depending on your needs you might consider using rows_update which will update the entire row in df1 with a row from df2 if matched.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

合并具有相同变量名称但（大多数）不同缺失值的两个数据框。

问题

答案1

有没有一种方法可以在保留索引的同时对按年分组的值进行总结？

My R plot of a time series is contradictory with the same plot on a larger time span, why is that?

自定义分级地图

如何在echarts4r图表周围添加边距以适应轴名称？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。