比较三个(或更多)数据框。

huangapple go评论65阅读模式
英文:

Compare three (or more) dataframes

问题

我有三个数据框,想要用 dplyr 进行比较。

数据框 df1:

| id | name   | zip   | value |
|----|--------|-------|-------|
| 1  | Smith  | 12345 | 1     |
| 2  | Winter | 23456 | 2     |
| 3  | Summer | 34567 | 3     |

数据框 df2:

| id | name   | zip   | value |
|----|--------|-------|-------|
| 1  | Smith  | 12345 | 4     |
| 2  | Winter | 23456 | 5     |
| 3  | Summer | 34567 | 6     |
| 5  | Taylor | 56789 | 0     |

数据框 df3:

| id | name   | zip   | value |
|----|--------|-------|-------|
| 1  | Smith  | 12345 | 7     |
| 2  | Winter | 23456 | 8     |
| 4  | Miller | 45678 | 9     |

这些数据框有相似的列(例如 `id`、`name`、`zip`)和一个包含随机数字的列(`value`)。

我想要的结果是一个数据框,显示具有相似值的列(`id`、`name`、`zip`)的哪些行存在于哪些数据框中(我知道可以使用 `select` 删除 `value` 列,我只是想保留它以显示数据集还包含可变元素)。

最终我想要类似下面的结果:

| id | name   | zip   | present_in_df1 | present_in_df2 | present_in_df3 |
|----|--------|-------|----------------|----------------|----------------|
| 1  | Smith  | 12345 | TRUE           | TRUE           | TRUE           |
| 2  | Winter | 23456 | TRUE           | TRUE           | TRUE           |
| 3  | Summer | 34567 | TRUE           | TRUE           | FALSE          |
| 4  | Miller | 45678 | FALSE          | FALSE          | TRUE           |
| 5  | Taylor | 56789 | FALSE          | TRUE           | FALSE          |

当然,如果有比最终结果更好的解决方案,我也愿意尝试。
英文:

I have three dataframes that I want to compare with dplyr.

df1 <- data.frame(
  id = c(1, 2, 3),
  name = c("Smith", "Winter", "Summer"),
  zip = c(12345, 23456, 34567),
  value = c(1, 2, 3)
)

df2 <- data.frame(
  id = c(1, 2, 3, 5),
  name = c("Smith", "Winter", "Summer", "Taylor"),
  zip = c(12345, 23456, 34567, 56789),
  value = c(4, 5, 6, 0)
)

df3 <- data.frame(
  id = c(1, 2, 4),
  name = c("Smith", "Winter", "Miller"),
  zip = c(12345, 23456, 45678),
  value = c(7, 8, 9)
)

The dataframes have columns with similar values (i.e. id, name, zip) and a column with a random number (value).

What I would like to achieve is a dataframe that shows which rows of the columns with the similar values (id, name, zip) are present in which dataframes (I am aware that I can remove the value column with select, I just wanted to leave it in to show that the dataset also contains variable elements).

I am looking for something like this in the end.

id name zip present_in_df1 present_in_df2 present_in_df3
1 Smith 12345 TRUE TRUE TRUE
2 Winter 23456 TRUE TRUE TRUE
3 Summer 34567 TRUE TRUE FALSE
4 Miller 45678 FALSE FALSE TRUE
5 Taylor 56789 FALSE TRUE FALSE

Of course, I also open for other solutions, if there is a better way of doing that instead of this representation in the end.

Thank you!

答案1

得分: 4

你可以将你的数据框按行绑定,然后使用例如 pivot_wider

library(dplyr, warn=FALSE)
library(tidyr)

dplyr::lst(df1, df2, df3) |> 
  bind_rows(.id = "df") |> 
  mutate(value = TRUE) |> 
  pivot_wider(names_from = df, values_from = value, names_prefix = "present_in_", values_fill = FALSE)
#> # A tibble: 5 × 6
#>      id name     zip present_in_df1 present_in_df2 present_in_df3
#>   <dbl> <chr>  <dbl> <lgl>          <lgl>          <lgl>         
#> 1     1 Smith  12345 TRUE           TRUE           TRUE          
#> 2     2 Winter 23456 TRUE           TRUE           TRUE          
#> 3     3 Summer 34567 TRUE           TRUE           FALSE         
#> 4     5 Taylor 56789 FALSE          TRUE           FALSE         
#> 5     4 Miller 45678 FALSE          FALSE          TRUE
英文:

You could bind your data frames by row, then use e.g. pivot_wider:

library(dplyr, warn=FALSE)
library(tidyr)

dplyr::lst(df1, df2, df3) |> 
  bind_rows(.id = "df") |> 
  mutate(value = TRUE) |> 
  pivot_wider(names_from = df, values_from = value, names_prefix = "present_in_", values_fill = FALSE)
#> # A tibble: 5 × 6
#>      id name     zip present_in_df1 present_in_df2 present_in_df3
#>   <dbl> <chr>  <dbl> <lgl>          <lgl>          <lgl>         
#> 1     1 Smith  12345 TRUE           TRUE           TRUE          
#> 2     2 Winter 23456 TRUE           TRUE           TRUE          
#> 3     3 Summer 34567 TRUE           TRUE           FALSE         
#> 4     5 Taylor 56789 FALSE          TRUE           FALSE         
#> 5     4 Miller 45678 FALSE          FALSE          TRUE

答案2

得分: 3

使用`reduce`和`joins`:
```r
库(purrr)
库(dplyr)
lst(df1, df2, df3) %>%
  imap(\(x, y){colnames(x)[4] <- glue::glue("present_in_{y}"); x}) %>%
  reduce(full_join, by = c("id", "name", "zip")) %>%
  mutate(across(contains("present"), complete.cases))

  id   name   zip present_in_df1 present_in_df2 present_in_df3
1  1  Smith 12345           TRUE           TRUE           TRUE
2  2 Winter 23456           TRUE           TRUE           TRUE
3  3 Summer 34567           TRUE           TRUE          FALSE
4  5 Taylor 56789          FALSE           TRUE          FALSE
5  4 Miller 45678          FALSE          FALSE           TRUE

<details>
<summary>英文:</summary>

With `reduce` and `joins`:
```r
library(purrr)
library(dplyr)
lst(df1, df2, df3) %&gt;% 
  imap(\(x, y){colnames(x)[4] &lt;- glue::glue(&quot;present_in_{y}&quot;); x}) %&gt;% 
  reduce(full_join, by = c(&quot;id&quot;, &quot;name&quot;, &quot;zip&quot;)) %&gt;% 
  mutate(across(contains(&quot;present&quot;), complete.cases))

  id   name   zip present_in_df1 present_in_df2 present_in_df3
1  1  Smith 12345           TRUE           TRUE           TRUE
2  2 Winter 23456           TRUE           TRUE           TRUE
3  3 Summer 34567           TRUE           TRUE          FALSE
4  5 Taylor 56789          FALSE           TRUE          FALSE
5  4 Miller 45678          FALSE          FALSE           TRUE

答案3

得分: 2

library(dplyr)
list(df1, df2, df3) |&gt; purrr::reduce(full_join, by = c("id", "name", "zip")) |&gt; 
  mutate(across(contains("value"), ~ifelse(is.na(.x), FALSE, TRUE))) |&gt; 
  rename(present_in_df1 = value.x, 
         present_in_df2 = value.y, 
         present_in_df3 = value)
英文:
library(dplyr)
list(df1,df2,df3) |&gt; purrr::reduce(full_join, by = c(&quot;id&quot;, &quot;name&quot;, &quot;zip&quot;), ) |&gt; 
  mutate(across(contains(&quot;value&quot;), ~ifelse(is.na(.x), FALSE, TRUE))) |&gt; 
  rename(present_in_df1 = value.x, 
         present_in_df2 = value.y, 
         present_in_df3 = value)

答案4

得分: 2

将它们行绑定,然后重塑为宽格式:

library(data.table)

l <- rbindlist(mget(ls(pattern = "^df")), idcol = "df")

dcast(l, id + name + zip ~ df)
#    id   name   zip  1  2  3
# 1:  1  Smith 12345  1  4  7
# 2:  2 Winter 23456  2  5  8
# 3:  3 Summer 34567  3  6 NA
# 4:  4 Miller 45678 NA NA  9
# 5:  5 Taylor 56789 NA  0 NA
英文:

Rowbind them, then reshape long-to-wide:

library(data.table)

l &lt;- rbindlist(mget(ls(pattern = &quot;^df&quot;)), idcol = &quot;df&quot;)

dcast(l, id + name + zip ~ df)
#    id   name   zip  1  2  3
# 1:  1  Smith 12345  1  4  7
# 2:  2 Winter 23456  2  5  8
# 3:  3 Summer 34567  3  6 NA
# 4:  4 Miller 45678 NA NA  9
# 5:  5 Taylor 56789 NA  0 NA

答案5

得分: 2

你可以将这三个数据框绑定在一起,通过对相关列进行group_by,然后使用summarise来输出包含必要信息的数据框。

library(tidyverse)

bind_rows(df1, df2, df3, .id = "df") %>%
  group_by(id, name, zip) %>%
  summarize(df = paste(df, collapse = ","))

# A tibble: 5 × 4
     id name     zip df   
  <dbl> <chr>  <dbl> <chr>
1     1 Smith  12345 1,2,3
2     2 Winter 23456 1,2,3
3     3 Summer 34567 1,2  
4     4 Miller 45678 3    
5     5 Taylor 56789 2

如果你认为上述格式有用,这可以是你的终点。要将它们提取到三个不同的列中,我们可以使用grepl函数来检查数据框编号。

bind_rows(df1, df2, df3, .id = "df") %>%
  group_by(id, name, zip) %>%
  summarize(df = paste(df, collapse = ","), .groups = "drop") %>%
  mutate(present_in_df1 = grepl("1", df),
         present_in_df2 = grepl("2", df),
         present_in_df3 = grepl("3", df), .keep = "unused")

# A tibble: 5 × 6
     id name     zip present_in_df1 present_in_df2 present_in_df3
  <dbl> <chr>  <dbl> <lgl>          <lgl>          <lgl>         
1     1 Smith  12345 TRUE           TRUE           TRUE          
2     2 Winter 23456 TRUE           TRUE           TRUE          
3     3 Summer 34567 TRUE           TRUE           FALSE         
4     4 Miller 45678 FALSE          FALSE          TRUE          
5     5 Taylor 56789 FALSE          TRUE           FALSE 
英文:

You can bind the three dfs together, group_by the relevant columns, then use summarise to output what df contains the necessary information.

library(tidyverse)

bind_rows(df1, df2, df3, .id = &quot;df&quot;) %&gt;% 
  group_by(id, name, zip) %&gt;% 
  summarize(df = paste(df, collapse = &quot;,&quot;))

# A tibble: 5 &#215; 4
     id name     zip df   
  &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
1     1 Smith  12345 1,2,3
2     2 Winter 23456 1,2,3
3     3 Summer 34567 1,2  
4     4 Miller 45678 3    
5     5 Taylor 56789 2

This could be your endpoint if you find the above format useful. To extract them into three different columns, we can grepl on the df number.

bind_rows(df1, df2, df3, .id = &quot;df&quot;) %&gt;% 
  group_by(id, name, zip) %&gt;% 
  summarize(df = paste(df, collapse = &quot;,&quot;), .groups = &quot;drop&quot;) %&gt;% 
  mutate(present_in_df1 = grepl(&quot;1&quot;, df),
         present_in_df2 = grepl(&quot;2&quot;, df),
         present_in_df3 = grepl(&quot;3&quot;, df), .keep = &quot;unused&quot;)

# A tibble: 5 &#215; 6
     id name     zip present_in_df1 present_in_df2 present_in_df3
  &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;lgl&gt;          &lt;lgl&gt;          &lt;lgl&gt;         
1     1 Smith  12345 TRUE           TRUE           TRUE          
2     2 Winter 23456 TRUE           TRUE           TRUE          
3     3 Summer 34567 TRUE           TRUE           FALSE         
4     4 Miller 45678 FALSE          FALSE          TRUE          
5     5 Taylor 56789 FALSE          TRUE           FALSE 

huangapple
  • 本文由 发表于 2023年2月16日 16:44:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75469717.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定