2023年2月16日 16:44:29go评论51阅读模式

英文:

Compare three (or more) dataframes

问题

我有三个数据框，想要用 dplyr 进行比较。

数据框 df1：

| id | name   | zip   | value |
|----|--------|-------|-------|
| 1  | Smith  | 12345 | 1     |
| 2  | Winter | 23456 | 2     |
| 3  | Summer | 34567 | 3     |

数据框 df2：

| id | name   | zip   | value |
|----|--------|-------|-------|
| 1  | Smith  | 12345 | 4     |
| 2  | Winter | 23456 | 5     |
| 3  | Summer | 34567 | 6     |
| 5  | Taylor | 56789 | 0     |

数据框 df3：

| id | name   | zip   | value |
|----|--------|-------|-------|
| 1  | Smith  | 12345 | 7     |
| 2  | Winter | 23456 | 8     |
| 4  | Miller | 45678 | 9     |

这些数据框有相似的列（例如 `id`、`name`、`zip`）和一个包含随机数字的列（`value`）。

我想要的结果是一个数据框，显示具有相似值的列（`id`、`name`、`zip`）的哪些行存在于哪些数据框中（我知道可以使用 `select` 删除 `value` 列，我只是想保留它以显示数据集还包含可变元素）。

最终我想要类似下面的结果：

| id | name   | zip   | present_in_df1 | present_in_df2 | present_in_df3 |
|----|--------|-------|----------------|----------------|----------------|
| 1  | Smith  | 12345 | TRUE           | TRUE           | TRUE           |
| 2  | Winter | 23456 | TRUE           | TRUE           | TRUE           |
| 3  | Summer | 34567 | TRUE           | TRUE           | FALSE          |
| 4  | Miller | 45678 | FALSE          | FALSE          | TRUE           |
| 5  | Taylor | 56789 | FALSE          | TRUE           | FALSE          |

当然，如果有比最终结果更好的解决方案，我也愿意尝试。

英文:

I have three dataframes that I want to compare with dplyr.

df1 &lt;- data.frame(
  id = c(1, 2, 3),
  name = c(&quot;Smith&quot;, &quot;Winter&quot;, &quot;Summer&quot;),
  zip = c(12345, 23456, 34567),
  value = c(1, 2, 3)
)

df2 &lt;- data.frame(
  id = c(1, 2, 3, 5),
  name = c(&quot;Smith&quot;, &quot;Winter&quot;, &quot;Summer&quot;, &quot;Taylor&quot;),
  zip = c(12345, 23456, 34567, 56789),
  value = c(4, 5, 6, 0)
)

df3 &lt;- data.frame(
  id = c(1, 2, 4),
  name = c(&quot;Smith&quot;, &quot;Winter&quot;, &quot;Miller&quot;),
  zip = c(12345, 23456, 45678),
  value = c(7, 8, 9)
)

The dataframes have columns with similar values (i.e. id, name, zip) and a column with a random number (value).

What I would like to achieve is a dataframe that shows which rows of the columns with the similar values (id, name, zip) are present in which dataframes (I am aware that I can remove the value column with select, I just wanted to leave it in to show that the dataset also contains variable elements).

I am looking for something like this in the end.

id	name	zip	present_in_df1	present_in_df2	present_in_df3
1	Smith	12345	TRUE	TRUE	TRUE
2	Winter	23456	TRUE	TRUE	TRUE
3	Summer	34567	TRUE	TRUE	FALSE
4	Miller	45678	FALSE	FALSE	TRUE
5	Taylor	56789	FALSE	TRUE	FALSE

Of course, I also open for other solutions, if there is a better way of doing that instead of this representation in the end.

Thank you!

答案1

得分: 4

你可以将你的数据框按行绑定，然后使用例如 pivot_wider：

library(dplyr, warn=FALSE)
library(tidyr)

dplyr::lst(df1, df2, df3) |&gt; 
  bind_rows(.id = &quot;df&quot;) |&gt; 
  mutate(value = TRUE) |&gt; 
  pivot_wider(names_from = df, values_from = value, names_prefix = &quot;present_in_&quot;, values_fill = FALSE)
#&gt; # A tibble: 5 &#215; 6
#&gt;      id name     zip present_in_df1 present_in_df2 present_in_df3
#&gt;   &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;lgl&gt;          &lt;lgl&gt;          &lt;lgl&gt;         
#&gt; 1     1 Smith  12345 TRUE           TRUE           TRUE          
#&gt; 2     2 Winter 23456 TRUE           TRUE           TRUE          
#&gt; 3     3 Summer 34567 TRUE           TRUE           FALSE         
#&gt; 4     5 Taylor 56789 FALSE          TRUE           FALSE         
#&gt; 5     4 Miller 45678 FALSE          FALSE          TRUE

英文:

You could bind your data frames by row, then use e.g. pivot_wider:

library(dplyr, warn=FALSE)
library(tidyr)

dplyr::lst(df1, df2, df3) |&gt; 
  bind_rows(.id = &quot;df&quot;) |&gt; 
  mutate(value = TRUE) |&gt; 
  pivot_wider(names_from = df, values_from = value, names_prefix = &quot;present_in_&quot;, values_fill = FALSE)
#&gt; # A tibble: 5 &#215; 6
#&gt;      id name     zip present_in_df1 present_in_df2 present_in_df3
#&gt;   &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;lgl&gt;          &lt;lgl&gt;          &lt;lgl&gt;         
#&gt; 1     1 Smith  12345 TRUE           TRUE           TRUE          
#&gt; 2     2 Winter 23456 TRUE           TRUE           TRUE          
#&gt; 3     3 Summer 34567 TRUE           TRUE           FALSE         
#&gt; 4     5 Taylor 56789 FALSE          TRUE           FALSE         
#&gt; 5     4 Miller 45678 FALSE          FALSE          TRUE

答案2

得分: 3

使用`reduce`和`joins`：
```r
库(purrr)
库(dplyr)
lst(df1, df2, df3) %>%
  imap(\(x, y){colnames(x)[4] <- glue::glue("present_in_{y}"); x}) %>%
  reduce(full_join, by = c("id", "name", "zip")) %>%
  mutate(across(contains("present"), complete.cases))

  id   name   zip present_in_df1 present_in_df2 present_in_df3
1  1  Smith 12345           TRUE           TRUE           TRUE
2  2 Winter 23456           TRUE           TRUE           TRUE
3  3 Summer 34567           TRUE           TRUE          FALSE
4  5 Taylor 56789          FALSE           TRUE          FALSE
5  4 Miller 45678          FALSE          FALSE           TRUE


<details>
<summary>英文:</summary>

With `reduce` and `joins`:
```r
library(purrr)
library(dplyr)
lst(df1, df2, df3) %&gt;% 
  imap(\(x, y){colnames(x)[4] &lt;- glue::glue(&quot;present_in_{y}&quot;); x}) %&gt;% 
  reduce(full_join, by = c(&quot;id&quot;, &quot;name&quot;, &quot;zip&quot;)) %&gt;% 
  mutate(across(contains(&quot;present&quot;), complete.cases))

  id   name   zip present_in_df1 present_in_df2 present_in_df3
1  1  Smith 12345           TRUE           TRUE           TRUE
2  2 Winter 23456           TRUE           TRUE           TRUE
3  3 Summer 34567           TRUE           TRUE          FALSE
4  5 Taylor 56789          FALSE           TRUE          FALSE
5  4 Miller 45678          FALSE          FALSE           TRUE

答案3

得分: 2

library(dplyr)
list(df1, df2, df3) |&gt; purrr::reduce(full_join, by = c("id", "name", "zip")) |&gt; 
  mutate(across(contains("value"), ~ifelse(is.na(.x), FALSE, TRUE))) |&gt; 
  rename(present_in_df1 = value.x, 
         present_in_df2 = value.y, 
         present_in_df3 = value)

英文:

library(dplyr)
list(df1,df2,df3) |&gt; purrr::reduce(full_join, by = c(&quot;id&quot;, &quot;name&quot;, &quot;zip&quot;), ) |&gt; 
  mutate(across(contains(&quot;value&quot;), ~ifelse(is.na(.x), FALSE, TRUE))) |&gt; 
  rename(present_in_df1 = value.x, 
         present_in_df2 = value.y, 
         present_in_df3 = value)

答案4

得分: 2

将它们行绑定，然后重塑为宽格式：

library(data.table)

l <- rbindlist(mget(ls(pattern = "^df")), idcol = "df")

dcast(l, id + name + zip ~ df)
#    id   name   zip  1  2  3
# 1:  1  Smith 12345  1  4  7
# 2:  2 Winter 23456  2  5  8
# 3:  3 Summer 34567  3  6 NA
# 4:  4 Miller 45678 NA NA  9
# 5:  5 Taylor 56789 NA  0 NA

英文:

Rowbind them, then reshape long-to-wide:

library(data.table)

l &lt;- rbindlist(mget(ls(pattern = &quot;^df&quot;)), idcol = &quot;df&quot;)

dcast(l, id + name + zip ~ df)
#    id   name   zip  1  2  3
# 1:  1  Smith 12345  1  4  7
# 2:  2 Winter 23456  2  5  8
# 3:  3 Summer 34567  3  6 NA
# 4:  4 Miller 45678 NA NA  9
# 5:  5 Taylor 56789 NA  0 NA

答案5

得分: 2

你可以将这三个数据框绑定在一起，通过对相关列进行group_by，然后使用summarise来输出包含必要信息的数据框。

library(tidyverse)

bind_rows(df1, df2, df3, .id = "df") %>%
  group_by(id, name, zip) %>%
  summarize(df = paste(df, collapse = ","))

# A tibble: 5 × 4
     id name     zip df   
  <dbl> <chr>  <dbl> <chr>
1     1 Smith  12345 1,2,3
2     2 Winter 23456 1,2,3
3     3 Summer 34567 1,2  
4     4 Miller 45678 3    
5     5 Taylor 56789 2

如果你认为上述格式有用，这可以是你的终点。要将它们提取到三个不同的列中，我们可以使用grepl函数来检查数据框编号。

bind_rows(df1, df2, df3, .id = "df") %>%
  group_by(id, name, zip) %>%
  summarize(df = paste(df, collapse = ","), .groups = "drop") %>%
  mutate(present_in_df1 = grepl("1", df),
         present_in_df2 = grepl("2", df),
         present_in_df3 = grepl("3", df), .keep = "unused")

# A tibble: 5 × 6
     id name     zip present_in_df1 present_in_df2 present_in_df3
  <dbl> <chr>  <dbl> <lgl>          <lgl>          <lgl>         
1     1 Smith  12345 TRUE           TRUE           TRUE          
2     2 Winter 23456 TRUE           TRUE           TRUE          
3     3 Summer 34567 TRUE           TRUE           FALSE         
4     4 Miller 45678 FALSE          FALSE          TRUE          
5     5 Taylor 56789 FALSE          TRUE           FALSE

英文:

You can bind the three dfs together, group_by the relevant columns, then use summarise to output what df contains the necessary information.

library(tidyverse)

bind_rows(df1, df2, df3, .id = &quot;df&quot;) %&gt;% 
  group_by(id, name, zip) %&gt;% 
  summarize(df = paste(df, collapse = &quot;,&quot;))

# A tibble: 5 &#215; 4
     id name     zip df   
  &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;chr&gt;
1     1 Smith  12345 1,2,3
2     2 Winter 23456 1,2,3
3     3 Summer 34567 1,2  
4     4 Miller 45678 3    
5     5 Taylor 56789 2

This could be your endpoint if you find the above format useful. To extract them into three different columns, we can grepl on the df number.

bind_rows(df1, df2, df3, .id = &quot;df&quot;) %&gt;% 
  group_by(id, name, zip) %&gt;% 
  summarize(df = paste(df, collapse = &quot;,&quot;), .groups = &quot;drop&quot;) %&gt;% 
  mutate(present_in_df1 = grepl(&quot;1&quot;, df),
         present_in_df2 = grepl(&quot;2&quot;, df),
         present_in_df3 = grepl(&quot;3&quot;, df), .keep = &quot;unused&quot;)

# A tibble: 5 &#215; 6
     id name     zip present_in_df1 present_in_df2 present_in_df3
  &lt;dbl&gt; &lt;chr&gt;  &lt;dbl&gt; &lt;lgl&gt;          &lt;lgl&gt;          &lt;lgl&gt;         
1     1 Smith  12345 TRUE           TRUE           TRUE          
2     2 Winter 23456 TRUE           TRUE           TRUE          
3     3 Summer 34567 TRUE           TRUE           FALSE         
4     4 Miller 45678 FALSE          FALSE          TRUE          
5     5 Taylor 56789 FALSE          TRUE           FALSE

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

比较三个（或更多）数据框。

问题

答案1

答案2

答案3

答案4

答案5

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation:

How to change text colour of links in navbar header AND links in nav pills (in shiny app)?

从R中的数据框创建对角矩阵

需要帮助创建一个具有均值和标准差的函数。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论