英文:
Compare three (or more) dataframes
问题
我有三个数据框,想要用 dplyr 进行比较。
数据框 df1:
| id | name | zip | value |
|----|--------|-------|-------|
| 1 | Smith | 12345 | 1 |
| 2 | Winter | 23456 | 2 |
| 3 | Summer | 34567 | 3 |
数据框 df2:
| id | name | zip | value |
|----|--------|-------|-------|
| 1 | Smith | 12345 | 4 |
| 2 | Winter | 23456 | 5 |
| 3 | Summer | 34567 | 6 |
| 5 | Taylor | 56789 | 0 |
数据框 df3:
| id | name | zip | value |
|----|--------|-------|-------|
| 1 | Smith | 12345 | 7 |
| 2 | Winter | 23456 | 8 |
| 4 | Miller | 45678 | 9 |
这些数据框有相似的列(例如 `id`、`name`、`zip`)和一个包含随机数字的列(`value`)。
我想要的结果是一个数据框,显示具有相似值的列(`id`、`name`、`zip`)的哪些行存在于哪些数据框中(我知道可以使用 `select` 删除 `value` 列,我只是想保留它以显示数据集还包含可变元素)。
最终我想要类似下面的结果:
| id | name | zip | present_in_df1 | present_in_df2 | present_in_df3 |
|----|--------|-------|----------------|----------------|----------------|
| 1 | Smith | 12345 | TRUE | TRUE | TRUE |
| 2 | Winter | 23456 | TRUE | TRUE | TRUE |
| 3 | Summer | 34567 | TRUE | TRUE | FALSE |
| 4 | Miller | 45678 | FALSE | FALSE | TRUE |
| 5 | Taylor | 56789 | FALSE | TRUE | FALSE |
当然,如果有比最终结果更好的解决方案,我也愿意尝试。
英文:
I have three dataframes that I want to compare with dplyr.
df1 <- data.frame(
id = c(1, 2, 3),
name = c("Smith", "Winter", "Summer"),
zip = c(12345, 23456, 34567),
value = c(1, 2, 3)
)
df2 <- data.frame(
id = c(1, 2, 3, 5),
name = c("Smith", "Winter", "Summer", "Taylor"),
zip = c(12345, 23456, 34567, 56789),
value = c(4, 5, 6, 0)
)
df3 <- data.frame(
id = c(1, 2, 4),
name = c("Smith", "Winter", "Miller"),
zip = c(12345, 23456, 45678),
value = c(7, 8, 9)
)
The dataframes have columns with similar values (i.e. id
, name
, zip
) and a column with a random number (value
).
What I would like to achieve is a dataframe that shows which rows of the columns with the similar values (id
, name
, zip
) are present in which dataframes (I am aware that I can remove the value
column with select
, I just wanted to leave it in to show that the dataset also contains variable elements).
I am looking for something like this in the end.
id | name | zip | present_in_df1 | present_in_df2 | present_in_df3 |
---|---|---|---|---|---|
1 | Smith | 12345 | TRUE | TRUE | TRUE |
2 | Winter | 23456 | TRUE | TRUE | TRUE |
3 | Summer | 34567 | TRUE | TRUE | FALSE |
4 | Miller | 45678 | FALSE | FALSE | TRUE |
5 | Taylor | 56789 | FALSE | TRUE | FALSE |
Of course, I also open for other solutions, if there is a better way of doing that instead of this representation in the end.
Thank you!
答案1
得分: 4
你可以将你的数据框按行绑定,然后使用例如 pivot_wider
:
library(dplyr, warn=FALSE)
library(tidyr)
dplyr::lst(df1, df2, df3) |>
bind_rows(.id = "df") |>
mutate(value = TRUE) |>
pivot_wider(names_from = df, values_from = value, names_prefix = "present_in_", values_fill = FALSE)
#> # A tibble: 5 × 6
#> id name zip present_in_df1 present_in_df2 present_in_df3
#> <dbl> <chr> <dbl> <lgl> <lgl> <lgl>
#> 1 1 Smith 12345 TRUE TRUE TRUE
#> 2 2 Winter 23456 TRUE TRUE TRUE
#> 3 3 Summer 34567 TRUE TRUE FALSE
#> 4 5 Taylor 56789 FALSE TRUE FALSE
#> 5 4 Miller 45678 FALSE FALSE TRUE
英文:
You could bind your data frames by row, then use e.g. pivot_wider
:
library(dplyr, warn=FALSE)
library(tidyr)
dplyr::lst(df1, df2, df3) |>
bind_rows(.id = "df") |>
mutate(value = TRUE) |>
pivot_wider(names_from = df, values_from = value, names_prefix = "present_in_", values_fill = FALSE)
#> # A tibble: 5 × 6
#> id name zip present_in_df1 present_in_df2 present_in_df3
#> <dbl> <chr> <dbl> <lgl> <lgl> <lgl>
#> 1 1 Smith 12345 TRUE TRUE TRUE
#> 2 2 Winter 23456 TRUE TRUE TRUE
#> 3 3 Summer 34567 TRUE TRUE FALSE
#> 4 5 Taylor 56789 FALSE TRUE FALSE
#> 5 4 Miller 45678 FALSE FALSE TRUE
答案2
得分: 3
使用`reduce`和`joins`:
```r
库(purrr)
库(dplyr)
lst(df1, df2, df3) %>%
imap(\(x, y){colnames(x)[4] <- glue::glue("present_in_{y}"); x}) %>%
reduce(full_join, by = c("id", "name", "zip")) %>%
mutate(across(contains("present"), complete.cases))
id name zip present_in_df1 present_in_df2 present_in_df3
1 1 Smith 12345 TRUE TRUE TRUE
2 2 Winter 23456 TRUE TRUE TRUE
3 3 Summer 34567 TRUE TRUE FALSE
4 5 Taylor 56789 FALSE TRUE FALSE
5 4 Miller 45678 FALSE FALSE TRUE
<details>
<summary>英文:</summary>
With `reduce` and `joins`:
```r
library(purrr)
library(dplyr)
lst(df1, df2, df3) %>%
imap(\(x, y){colnames(x)[4] <- glue::glue("present_in_{y}"); x}) %>%
reduce(full_join, by = c("id", "name", "zip")) %>%
mutate(across(contains("present"), complete.cases))
id name zip present_in_df1 present_in_df2 present_in_df3
1 1 Smith 12345 TRUE TRUE TRUE
2 2 Winter 23456 TRUE TRUE TRUE
3 3 Summer 34567 TRUE TRUE FALSE
4 5 Taylor 56789 FALSE TRUE FALSE
5 4 Miller 45678 FALSE FALSE TRUE
答案3
得分: 2
library(dplyr)
list(df1, df2, df3) |> purrr::reduce(full_join, by = c("id", "name", "zip")) |>
mutate(across(contains("value"), ~ifelse(is.na(.x), FALSE, TRUE))) |>
rename(present_in_df1 = value.x,
present_in_df2 = value.y,
present_in_df3 = value)
英文:
library(dplyr)
list(df1,df2,df3) |> purrr::reduce(full_join, by = c("id", "name", "zip"), ) |>
mutate(across(contains("value"), ~ifelse(is.na(.x), FALSE, TRUE))) |>
rename(present_in_df1 = value.x,
present_in_df2 = value.y,
present_in_df3 = value)
答案4
得分: 2
将它们行绑定,然后重塑为宽格式:
library(data.table)
l <- rbindlist(mget(ls(pattern = "^df")), idcol = "df")
dcast(l, id + name + zip ~ df)
# id name zip 1 2 3
# 1: 1 Smith 12345 1 4 7
# 2: 2 Winter 23456 2 5 8
# 3: 3 Summer 34567 3 6 NA
# 4: 4 Miller 45678 NA NA 9
# 5: 5 Taylor 56789 NA 0 NA
英文:
Rowbind them, then reshape long-to-wide:
library(data.table)
l <- rbindlist(mget(ls(pattern = "^df")), idcol = "df")
dcast(l, id + name + zip ~ df)
# id name zip 1 2 3
# 1: 1 Smith 12345 1 4 7
# 2: 2 Winter 23456 2 5 8
# 3: 3 Summer 34567 3 6 NA
# 4: 4 Miller 45678 NA NA 9
# 5: 5 Taylor 56789 NA 0 NA
答案5
得分: 2
你可以将这三个数据框绑定在一起,通过对相关列进行group_by
,然后使用summarise
来输出包含必要信息的数据框。
library(tidyverse)
bind_rows(df1, df2, df3, .id = "df") %>%
group_by(id, name, zip) %>%
summarize(df = paste(df, collapse = ","))
# A tibble: 5 × 4
id name zip df
<dbl> <chr> <dbl> <chr>
1 1 Smith 12345 1,2,3
2 2 Winter 23456 1,2,3
3 3 Summer 34567 1,2
4 4 Miller 45678 3
5 5 Taylor 56789 2
如果你认为上述格式有用,这可以是你的终点。要将它们提取到三个不同的列中,我们可以使用grepl
函数来检查数据框编号。
bind_rows(df1, df2, df3, .id = "df") %>%
group_by(id, name, zip) %>%
summarize(df = paste(df, collapse = ","), .groups = "drop") %>%
mutate(present_in_df1 = grepl("1", df),
present_in_df2 = grepl("2", df),
present_in_df3 = grepl("3", df), .keep = "unused")
# A tibble: 5 × 6
id name zip present_in_df1 present_in_df2 present_in_df3
<dbl> <chr> <dbl> <lgl> <lgl> <lgl>
1 1 Smith 12345 TRUE TRUE TRUE
2 2 Winter 23456 TRUE TRUE TRUE
3 3 Summer 34567 TRUE TRUE FALSE
4 4 Miller 45678 FALSE FALSE TRUE
5 5 Taylor 56789 FALSE TRUE FALSE
英文:
You can bind the three dfs together, group_by
the relevant columns, then use summarise
to output what df contains the necessary information.
library(tidyverse)
bind_rows(df1, df2, df3, .id = "df") %>%
group_by(id, name, zip) %>%
summarize(df = paste(df, collapse = ","))
# A tibble: 5 × 4
id name zip df
<dbl> <chr> <dbl> <chr>
1 1 Smith 12345 1,2,3
2 2 Winter 23456 1,2,3
3 3 Summer 34567 1,2
4 4 Miller 45678 3
5 5 Taylor 56789 2
This could be your endpoint if you find the above format useful. To extract them into three different columns, we can grepl
on the df number.
bind_rows(df1, df2, df3, .id = "df") %>%
group_by(id, name, zip) %>%
summarize(df = paste(df, collapse = ","), .groups = "drop") %>%
mutate(present_in_df1 = grepl("1", df),
present_in_df2 = grepl("2", df),
present_in_df3 = grepl("3", df), .keep = "unused")
# A tibble: 5 × 6
id name zip present_in_df1 present_in_df2 present_in_df3
<dbl> <chr> <dbl> <lgl> <lgl> <lgl>
1 1 Smith 12345 TRUE TRUE TRUE
2 2 Winter 23456 TRUE TRUE TRUE
3 3 Summer 34567 TRUE TRUE FALSE
4 4 Miller 45678 FALSE FALSE TRUE
5 5 Taylor 56789 FALSE TRUE FALSE
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论