尝试在R中比较两个具有不同行和列的数据框。

huangapple go评论56阅读模式
英文:

Trying to compare two dataframes with different rows and columns in R

问题

I am trying to compare two different dataframes which have different columns and rows in R.
Need to get the same data be df3, any row or column are different data be df4. In my example, id F, col1 and col2 in both two tables is the same, but other columns are not.

Below is what my dataset looks like:

set.seed(22)

df1 <- data.frame(id=sample(LETTERS, 9, FALSE), col1=sample(0:2, 9, TRUE),
                  col2 = sample(0:2, 9, TRUE))
df2 <- data.frame(id=sample(LETTERS, 17, FALSE), col1=sample(0:2, 17, TRUE),
                  col2 = sample(0:2, 17, TRUE),
                  col6 = sample(0:2, 17, TRUE))

df1
尝试在R中比较两个具有不同行和列的数据框。

df2
尝试在R中比较两个具有不同行和列的数据框。

I've read many solutions but have not yet found a concise solution, any suggestions out there? Any help is much appreciated.

英文:

I am trying to compare two different dataframes which have different columns and rows in R.
Need to get the same data be df3, any row or column are different data be df4.In my example, id F, col1 and col2 in both two tables is the same.but other cols are not.

Below is what my dataset looks like:

set.seed(22)

df1 &lt;- data.frame(id=sample(LETTERS, 9, FALSE), col1=sample(0:2, 9, TRUE),
                  col2 = sample(0:2, 9, TRUE))
df2 &lt;- data.frame(id=sample(LETTERS, 17, FALSE), col1=sample(0:2, 17, TRUE),
                  col2 = sample(0:2, 17, TRUE),
                  col6 = sample(0:2, 17, TRUE))

df1
尝试在R中比较两个具有不同行和列的数据框。

df2
尝试在R中比较两个具有不同行和列的数据框。

I've read many solutions but have not yet found a concise solution, any suggestions out there? Any help is much appreciated.

答案1

得分: 1

你可以使用 generics::intersect() 来查找共同的数值,以及 generics::setdiff() 来查找不同的数值。请注意,你需要指定 generics 包以获取所需的格式。

df3 &lt;- generics::intersect(df1, df2[,1:3])
  #    id col1 col2
  # 1  F    1    0
  # 2  K    0    2

df4 &lt;- generics::setdiff(df1, df2[,1:3])
  # id col1 col2
  #1  I    1    2
  #2  X    2    0
  #3  J    0    2
  #4  L    0    1
  #5  Q    1    0
  #6  E    1    0
  #7  C    1    0
英文:

You can use generics::intersect() to find the common values and generics::setdiff() to find the different values. Note you need to specify the generics package to get it in the format you want.

df3 &lt;- generics::intersect(df1, df2[,1:3])
  #    id col1 col2
  # 1  F    1    0
  # 2  K    0    2

df4 &lt;- generics::setdiff(df1, df2[,1:3])
  # id col1 col2
  #1  I    1    2
  #2  X    2    0
  #3  J    0    2
  #4  L    0    1
  #5  Q    1    0
  #6  E    1    0
  #7  C    1    0

答案2

得分: 1

如果您使用 tidyversedplyr,您可以使用 semi_joinanti_join。您需要指定要用于执行比较的列,使用参数 byby = c("id", "col1", "col2")。(您可以将 by 不指定,*_join 将使用所有匹配的列名执行比较,但最好避免这样做。)

semi_join 返回来自第一个数据框与第二个数据框中匹配的所有行:

library(dplyr)
# 或:
# library(tidyverse)

# 我们将将 df2 作为第一个参数传递,以保留"col6"中的值。
# 交换 "df2" 和 "df1" 的顺序以删除 "col6" 列。
df3 <- semi_join(df2, df1, by = c("id", "col1", "col2"))

df3
#   id col1 col2 col6
# 1  K    0    2    2
# 2  F    1    0    2

anti_join 返回来自第一个数据框中没有与第二个数据框中匹配的所有行。这个比较复杂一些,因为我们只会得到第一个数据框中在第二个数据框中缺失的行。要获取在任何数据框中存在但在另一个数据框中缺失的行,我们需要执行两次连接:

library(dplyr)
# 或:
# library(tidyverse)

df4_a <- anti_join(df1, df2, by = c("id", "col1", "col2"))
df4_b <- anti_join(df2, df1, by = c("id", "col1", "col2"))

df4 <- bind_rows(df4_a, df4_b)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

此外,如果不存储中间结果,您还可以更简洁地获得 df4

library(dplyr)
# 或:
# library(tidyverse)

df4 <- bind_rows(
  anti_join(df1, df2, by = c("id", "col1", "col2")),
  anti_join(df2, df1, by = c("id", "col1", "col2"))
)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

(Note: The code parts in the original text are not translated, as you requested.)

英文:

If you're using tidyverse or dplyr you can use semi_join and anti_join. You will need to specify the columns you want to use to perform the comparison using parameter by: by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;). (You can leave by unspecified and *_join will perform the comparison using all matching colnames, but this is better avoided.)

semi_join returns all rows from the first data.frame with a match in the second data.frame:

library(dplyr)
# Or:
# library(tidyverse)

# We&#39;ll pass df2 as the first argument, to preserve the values in `col6`.
# Swap the order of `df2` and `df1` to drop column `col6`.
df3 &lt;- semi_join(df2, df1, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))

df3
#   id col1 col2 col6
# 1  K    0    2    2
# 2  F    1    0    2

anti_join returns all rows from the first data.frame without a match in the second data.frame. This one is a bit trickier because we'll only get the rows in the first data.frame that are missing in the second. To get the rows that are present in any of the data.frames but missing in the other, we need to perform the join twice:

library(dplyr)
# Or:
# library(tidyverse)

df4_a &lt;- anti_join(df1, df2, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))
df4_b &lt;- anti_join(df2, df1, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))

df4 &lt;- bind_rows(df4_a, df4_b)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

Also, you can get df4 more concisely if you don't store the intermediate results:

library(dplyr)
# Or:
# library(tidyverse)

df4 &lt;- bind_rows(
  anti_join(df1, df2, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;)),
  anti_join(df2, df1, by = c(&quot;id&quot;, &quot;col1&quot;, &quot;col2&quot;))
)

df4
#    id col1 col2 col6
# 1   I    1    2   NA
# 2   X    2    0   NA
# 3   J    0    2   NA
# 4   L    0    1   NA
# 5   Q    1    0   NA
# 6   E    1    0   NA
# 7   C    1    0   NA
# 8   Y    1    0    2
# 9   T    2    1    0
# 10  P    2    0    1
# 11  A    1    2    1
# 12  R    2    0    0
# 13  V    2    1    1
# 14  M    0    0    1
# 15  S    1    2    2
# 16  O    0    0    2
# 17  B    2    0    0
# 18  U    0    1    0
# 19  W    1    1    2
# 20  G    1    2    1
# 21  H    2    1    0
# 22  C    2    0    1

huangapple
  • 本文由 发表于 2023年4月11日 08:48:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75981688.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定