2023年4月11日 08:48:29go评论67阅读模式

英文:

Trying to compare two dataframes with different rows and columns in R

问题

I am trying to compare two different dataframes which have different columns and rows in R.
Need to get the same data be df3, any row or column are different data be df4. In my example, id F, col1 and col2 in both two tables is the same, but other columns are not.

Below is what my dataset looks like:

set.seed(22)

df1 <- data.frame(id=sample(LETTERS, 9, FALSE), col1=sample(0:2, 9, TRUE),
                  col2 = sample(0:2, 9, TRUE))
df2 <- data.frame(id=sample(LETTERS, 17, FALSE), col1=sample(0:2, 17, TRUE),
                  col2 = sample(0:2, 17, TRUE),
                  col6 = sample(0:2, 17, TRUE))

df1

df2

I've read many solutions but have not yet found a concise solution, any suggestions out there? Any help is much appreciated.

英文:

I am trying to compare two different dataframes which have different columns and rows in R.
Need to get the same data be df3, any row or column are different data be df4.In my example, id F, col1 and col2 in both two tables is the same.but other cols are not.

Below is what my dataset looks like:

set.seed(22)

df1 &lt;- data.frame(id=sample(LETTERS, 9, FALSE), col1=sample(0:2, 9, TRUE),
                  col2 = sample(0:2, 9, TRUE))
df2 &lt;- data.frame(id=sample(LETTERS, 17, FALSE), col1=sample(0:2, 17, TRUE),
                  col2 = sample(0:2, 17, TRUE),
                  col6 = sample(0:2, 17, TRUE))

df1

df2

I've read many solutions but have not yet found a concise solution, any suggestions out there? Any help is much appreciated.

答案1

得分: 1

你可以使用 generics::intersect() 来查找共同的数值，以及 generics::setdiff() 来查找不同的数值。请注意，你需要指定 generics 包以获取所需的格式。

df3 &lt;- generics::intersect(df1, df2[,1:3])
  #    id col1 col2
  # 1  F    1    0
  # 2  K    0    2

df4 &lt;- generics::setdiff(df1, df2[,1:3])
  # id col1 col2
  #1  I    1    2
  #2  X    2    0
  #3  J    0    2
  #4  L    0    1
  #5  Q    1    0
  #6  E    1    0
  #7  C    1    0

英文:

You can use generics::intersect() to find the common values and generics::setdiff() to find the different values. Note you need to specify the generics package to get it in the format you want.

df3 &lt;- generics::intersect(df1, df2[,1:3])
  #    id col1 col2
  # 1  F    1    0
  # 2  K    0    2

df4 &lt;- generics::setdiff(df1, df2[,1:3])
  # id col1 col2
  #1  I    1    2
  #2  X    2    0
  #3  J    0    2
  #4  L    0    1
  #5  Q    1    0
  #6  E    1    0
  #7  C    1    0

答案2

得分: 1

如果您使用 tidyverse 或 dplyr，您可以使用 semi_join 和 anti_join。您需要指定要用于执行比较的列，使用参数 by：by = c("id", "col1", "col2")。（您可以将 by 不指定，*_join 将使用所有匹配的列名执行比较，但最好避免这样做。）

semi_join 返回来自第一个数据框与第二个数据框中匹配的所有行：

library(dplyr)
# 或：
# library(tidyverse)

# 我们将将 df2 作为第一个参数传递，以保留"col6"中的值。
# 交换 "df2" 和 "df1" 的顺序以删除 "col6" 列。
df3 <- semi_join(df2, df1, by = c("id", "col1", "col2"))

df3
#   id col1 col2 col6
# 1  K    0    2    2
# 2  F    1    0    2

anti_join 返回来自第一个数据框中没有与第二个数据框中匹配的所有行。这个比较复杂一些，因为我们只会得到第一个数据框中在第二个数据框中缺失的行。要获取在任何数据框中存在但在另一个数据框中缺失的行，我们需要执行两次连接：

library(dplyr)
# 或：
# library(tidyverse)

df4_a <- anti_join(df1, df2, by = c("id", "col1", "col2"))
df4_b <- anti_join(df2, df1, by = c("id", "col1", "col2"))

df4 <- bind_rows(df4_a, df4_b)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

此外，如果不存储中间结果，您还可以更简洁地获得 df4：

library(dplyr)
# 或：
# library(tidyverse)

df4 <- bind_rows(
  anti_join(df1, df2, by = c("id", "col1", "col2")),
  anti_join(df2, df1, by = c("id", "col1", "col2"))
)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

(Note: The code parts in the original text are not translated, as you requested.)

英文:

If you're using tidyverse or dplyr you can use semi_join and anti_join. You will need to specify the columns you want to use to perform the comparison using parameter by: by = c("id", "col1", "col2"). (You can leave by unspecified and *_join will perform the comparison using all matching colnames, but this is better avoided.)

semi_join returns all rows from the first data.frame with a match in the second data.frame:

library(dplyr)
# Or:
# library(tidyverse)

# We&#39;ll pass df2 as the first argument, to preserve the values in `col6`.
# Swap the order of `df2` and `df1` to drop column `col6`.
df3 &lt;- semi_join(df2, df1, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))

df3
#   id col1 col2 col6
# 1  K    0    2    2
# 2  F    1    0    2

anti_join returns all rows from the first data.frame without a match in the second data.frame. This one is a bit trickier because we'll only get the rows in the first data.frame that are missing in the second. To get the rows that are present in any of the data.frames but missing in the other, we need to perform the join twice:

library(dplyr)
# Or:
# library(tidyverse)

df4_a &lt;- anti_join(df1, df2, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))
df4_b &lt;- anti_join(df2, df1, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))

df4 &lt;- bind_rows(df4_a, df4_b)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

Also, you can get df4 more concisely if you don't store the intermediate results:

library(dplyr)
# Or:
# library(tidyverse)

df4 &lt;- bind_rows(
  anti_join(df1, df2, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;)),
  anti_join(df2, df1, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))
)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

尝试在R中比较两个具有不同行和列的数据框。

问题

答案1

答案2

我正在寻找一个更短的函数来从列表中分组相似的数据集。

Dplyr可以将一个数据框传递给table()函数吗？

创建一个计数器，在列的变化时递增。

在dplyr::group_by中，获取一个或多个分组变量中的观察数量。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论