英文:
Trying to compare two dataframes with different rows and columns in R
问题
I am trying to compare two different dataframes which have different columns and rows in R.
Need to get the same data be df3, any row or column are different data be df4. In my example, id F, col1 and col2 in both two tables is the same, but other columns are not.
Below is what my dataset looks like:
set.seed(22)
df1 <- data.frame(id=sample(LETTERS, 9, FALSE), col1=sample(0:2, 9, TRUE),
col2 = sample(0:2, 9, TRUE))
df2 <- data.frame(id=sample(LETTERS, 17, FALSE), col1=sample(0:2, 17, TRUE),
col2 = sample(0:2, 17, TRUE),
col6 = sample(0:2, 17, TRUE))
I've read many solutions but have not yet found a concise solution, any suggestions out there? Any help is much appreciated.
英文:
I am trying to compare two different dataframes which have different columns and rows in R.
Need to get the same data be df3, any row or column are different data be df4.In my example, id F, col1 and col2 in both two tables is the same.but other cols are not.
Below is what my dataset looks like:
set.seed(22)
df1 <- data.frame(id=sample(LETTERS, 9, FALSE), col1=sample(0:2, 9, TRUE),
col2 = sample(0:2, 9, TRUE))
df2 <- data.frame(id=sample(LETTERS, 17, FALSE), col1=sample(0:2, 17, TRUE),
col2 = sample(0:2, 17, TRUE),
col6 = sample(0:2, 17, TRUE))
I've read many solutions but have not yet found a concise solution, any suggestions out there? Any help is much appreciated.
答案1
得分: 1
你可以使用 generics::intersect()
来查找共同的数值,以及 generics::setdiff()
来查找不同的数值。请注意,你需要指定 generics
包以获取所需的格式。
df3 <- generics::intersect(df1, df2[,1:3])
# id col1 col2
# 1 F 1 0
# 2 K 0 2
df4 <- generics::setdiff(df1, df2[,1:3])
# id col1 col2
#1 I 1 2
#2 X 2 0
#3 J 0 2
#4 L 0 1
#5 Q 1 0
#6 E 1 0
#7 C 1 0
英文:
You can use generics::intersect()
to find the common values and generics::setdiff()
to find the different values. Note you need to specify the generics
package to get it in the format you want.
df3 <- generics::intersect(df1, df2[,1:3])
# id col1 col2
# 1 F 1 0
# 2 K 0 2
df4 <- generics::setdiff(df1, df2[,1:3])
# id col1 col2
#1 I 1 2
#2 X 2 0
#3 J 0 2
#4 L 0 1
#5 Q 1 0
#6 E 1 0
#7 C 1 0
答案2
得分: 1
如果您使用 tidyverse
或 dplyr
,您可以使用 semi_join
和 anti_join
。您需要指定要用于执行比较的列,使用参数 by
:by = c("id", "col1", "col2")
。(您可以将 by
不指定,*_join
将使用所有匹配的列名执行比较,但最好避免这样做。)
semi_join
返回来自第一个数据框与第二个数据框中匹配的所有行:
library(dplyr)
# 或:
# library(tidyverse)
# 我们将将 df2 作为第一个参数传递,以保留"col6"中的值。
# 交换 "df2" 和 "df1" 的顺序以删除 "col6" 列。
df3 <- semi_join(df2, df1, by = c("id", "col1", "col2"))
df3
# id col1 col2 col6
# 1 K 0 2 2
# 2 F 1 0 2
anti_join
返回来自第一个数据框中没有与第二个数据框中匹配的所有行。这个比较复杂一些,因为我们只会得到第一个数据框中在第二个数据框中缺失的行。要获取在任何数据框中存在但在另一个数据框中缺失的行,我们需要执行两次连接:
library(dplyr)
# 或:
# library(tidyverse)
df4_a <- anti_join(df1, df2, by = c("id", "col1", "col2"))
df4_b <- anti_join(df2, df1, by = c("id", "col1", "col2"))
df4 <- bind_rows(df4_a, df4_b)
df4
# id col1 col2 col6
# 1 I 1 2 NA
# 2 X 2 0 NA
# 3 J 0 2 NA
# 4 L 0 1 NA
# 5 Q 1 0 NA
# 6 E 1 0 NA
# 7 C 1 0 NA
# 8 Y 1 0 2
# 9 T 2 1 0
# 10 P 2 0 1
# 11 A 1 2 1
# 12 R 2 0 0
# 13 V 2 1 1
# 14 M 0 0 1
# 15 S 1 2 2
# 16 O 0 0 2
# 17 B 2 0 0
# 18 U 0 1 0
# 19 W 1 1 2
# 20 G 1 2 1
# 21 H 2 1 0
# 22 C 2 0 1
此外,如果不存储中间结果,您还可以更简洁地获得 df4
:
library(dplyr)
# 或:
# library(tidyverse)
df4 <- bind_rows(
anti_join(df1, df2, by = c("id", "col1", "col2")),
anti_join(df2, df1, by = c("id", "col1", "col2"))
)
df4
# id col1 col2 col6
# 1 I 1 2 NA
# 2 X 2 0 NA
# 3 J 0 2 NA
# 4 L 0 1 NA
# 5 Q 1 0 NA
# 6 E 1 0 NA
# 7 C 1 0 NA
# 8 Y 1 0 2
# 9 T 2 1 0
# 10 P 2 0 1
# 11 A 1 2 1
# 12 R 2 0 0
# 13 V 2 1 1
# 14 M 0 0 1
# 15 S 1 2 2
# 16 O 0 0 2
# 17 B 2 0 0
# 18 U 0 1 0
# 19 W 1 1 2
# 20 G 1 2 1
# 21 H 2 1 0
# 22 C 2 0 1
(Note: The code parts in the original text are not translated, as you requested.)
英文:
If you're using tidyverse
or dplyr
you can use semi_join
and anti_join
. You will need to specify the columns you want to use to perform the comparison using parameter by
: by = c("id", "col1", "col2")
. (You can leave by
unspecified and *_join
will perform the comparison using all matching colnames, but this is better avoided.)
semi_join
returns all rows from the first data.frame with a match in the second data.frame:
library(dplyr)
# Or:
# library(tidyverse)
# We'll pass df2 as the first argument, to preserve the values in `col6`.
# Swap the order of `df2` and `df1` to drop column `col6`.
df3 <- semi_join(df2, df1, by = c("id", "col1", "col2"))
df3
# id col1 col2 col6
# 1 K 0 2 2
# 2 F 1 0 2
anti_join
returns all rows from the first data.frame without a match in the second data.frame. This one is a bit trickier because we'll only get the rows in the first data.frame that are missing in the second. To get the rows that are present in any of the data.frames but missing in the other, we need to perform the join twice:
library(dplyr)
# Or:
# library(tidyverse)
df4_a <- anti_join(df1, df2, by = c("id", "col1", "col2"))
df4_b <- anti_join(df2, df1, by = c("id", "col1", "col2"))
df4 <- bind_rows(df4_a, df4_b)
df4
# id col1 col2 col6
# 1 I 1 2 NA
# 2 X 2 0 NA
# 3 J 0 2 NA
# 4 L 0 1 NA
# 5 Q 1 0 NA
# 6 E 1 0 NA
# 7 C 1 0 NA
# 8 Y 1 0 2
# 9 T 2 1 0
# 10 P 2 0 1
# 11 A 1 2 1
# 12 R 2 0 0
# 13 V 2 1 1
# 14 M 0 0 1
# 15 S 1 2 2
# 16 O 0 0 2
# 17 B 2 0 0
# 18 U 0 1 0
# 19 W 1 1 2
# 20 G 1 2 1
# 21 H 2 1 0
# 22 C 2 0 1
Also, you can get df4
more concisely if you don't store the intermediate results:
library(dplyr)
# Or:
# library(tidyverse)
df4 <- bind_rows(
anti_join(df1, df2, by = c("id", "col1", "col2")),
anti_join(df2, df1, by = c("id", "col1", "col2"))
)
df4
# id col1 col2 col6
# 1 I 1 2 NA
# 2 X 2 0 NA
# 3 J 0 2 NA
# 4 L 0 1 NA
# 5 Q 1 0 NA
# 6 E 1 0 NA
# 7 C 1 0 NA
# 8 Y 1 0 2
# 9 T 2 1 0
# 10 P 2 0 1
# 11 A 1 2 1
# 12 R 2 0 0
# 13 V 2 1 1
# 14 M 0 0 1
# 15 S 1 2 2
# 16 O 0 0 2
# 17 B 2 0 0
# 18 U 0 1 0
# 19 W 1 1 2
# 20 G 1 2 1
# 21 H 2 1 0
# 22 C 2 0 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论