比较三个(或更多)数据框。

huangapple go评论91阅读模式
英文:

Compare three (or more) dataframes

问题

  1. 我有三个数据框,想要用 dplyr 进行比较。
  2. 数据框 df1
  3. | id | name | zip | value |
  4. |----|--------|-------|-------|
  5. | 1 | Smith | 12345 | 1 |
  6. | 2 | Winter | 23456 | 2 |
  7. | 3 | Summer | 34567 | 3 |
  8. 数据框 df2
  9. | id | name | zip | value |
  10. |----|--------|-------|-------|
  11. | 1 | Smith | 12345 | 4 |
  12. | 2 | Winter | 23456 | 5 |
  13. | 3 | Summer | 34567 | 6 |
  14. | 5 | Taylor | 56789 | 0 |
  15. 数据框 df3
  16. | id | name | zip | value |
  17. |----|--------|-------|-------|
  18. | 1 | Smith | 12345 | 7 |
  19. | 2 | Winter | 23456 | 8 |
  20. | 4 | Miller | 45678 | 9 |
  21. 这些数据框有相似的列(例如 `id``name``zip`)和一个包含随机数字的列(`value`)。
  22. 我想要的结果是一个数据框,显示具有相似值的列(`id``name``zip`)的哪些行存在于哪些数据框中(我知道可以使用 `select` 删除 `value` 列,我只是想保留它以显示数据集还包含可变元素)。
  23. 最终我想要类似下面的结果:
  24. | id | name | zip | present_in_df1 | present_in_df2 | present_in_df3 |
  25. |----|--------|-------|----------------|----------------|----------------|
  26. | 1 | Smith | 12345 | TRUE | TRUE | TRUE |
  27. | 2 | Winter | 23456 | TRUE | TRUE | TRUE |
  28. | 3 | Summer | 34567 | TRUE | TRUE | FALSE |
  29. | 4 | Miller | 45678 | FALSE | FALSE | TRUE |
  30. | 5 | Taylor | 56789 | FALSE | TRUE | FALSE |
  31. 当然,如果有比最终结果更好的解决方案,我也愿意尝试。
英文:

I have three dataframes that I want to compare with dplyr.

  1. df1 <- data.frame(
  2. id = c(1, 2, 3),
  3. name = c("Smith", "Winter", "Summer"),
  4. zip = c(12345, 23456, 34567),
  5. value = c(1, 2, 3)
  6. )
  7. df2 <- data.frame(
  8. id = c(1, 2, 3, 5),
  9. name = c("Smith", "Winter", "Summer", "Taylor"),
  10. zip = c(12345, 23456, 34567, 56789),
  11. value = c(4, 5, 6, 0)
  12. )
  13. df3 <- data.frame(
  14. id = c(1, 2, 4),
  15. name = c("Smith", "Winter", "Miller"),
  16. zip = c(12345, 23456, 45678),
  17. value = c(7, 8, 9)
  18. )

The dataframes have columns with similar values (i.e. id, name, zip) and a column with a random number (value).

What I would like to achieve is a dataframe that shows which rows of the columns with the similar values (id, name, zip) are present in which dataframes (I am aware that I can remove the value column with select, I just wanted to leave it in to show that the dataset also contains variable elements).

I am looking for something like this in the end.

id name zip present_in_df1 present_in_df2 present_in_df3
1 Smith 12345 TRUE TRUE TRUE
2 Winter 23456 TRUE TRUE TRUE
3 Summer 34567 TRUE TRUE FALSE
4 Miller 45678 FALSE FALSE TRUE
5 Taylor 56789 FALSE TRUE FALSE

Of course, I also open for other solutions, if there is a better way of doing that instead of this representation in the end.

Thank you!

答案1

得分: 4

你可以将你的数据框按行绑定,然后使用例如 pivot_wider

  1. library(dplyr, warn=FALSE)
  2. library(tidyr)
  3. dplyr::lst(df1, df2, df3) |>
  4. bind_rows(.id = "df") |>
  5. mutate(value = TRUE) |>
  6. pivot_wider(names_from = df, values_from = value, names_prefix = "present_in_", values_fill = FALSE)
  7. #> # A tibble: 5 × 6
  8. #> id name zip present_in_df1 present_in_df2 present_in_df3
  9. #> <dbl> <chr> <dbl> <lgl> <lgl> <lgl>
  10. #> 1 1 Smith 12345 TRUE TRUE TRUE
  11. #> 2 2 Winter 23456 TRUE TRUE TRUE
  12. #> 3 3 Summer 34567 TRUE TRUE FALSE
  13. #> 4 5 Taylor 56789 FALSE TRUE FALSE
  14. #> 5 4 Miller 45678 FALSE FALSE TRUE
英文:

You could bind your data frames by row, then use e.g. pivot_wider:

  1. library(dplyr, warn=FALSE)
  2. library(tidyr)
  3. dplyr::lst(df1, df2, df3) |>
  4. bind_rows(.id = "df") |>
  5. mutate(value = TRUE) |>
  6. pivot_wider(names_from = df, values_from = value, names_prefix = "present_in_", values_fill = FALSE)
  7. #> # A tibble: 5 × 6
  8. #> id name zip present_in_df1 present_in_df2 present_in_df3
  9. #> <dbl> <chr> <dbl> <lgl> <lgl> <lgl>
  10. #> 1 1 Smith 12345 TRUE TRUE TRUE
  11. #> 2 2 Winter 23456 TRUE TRUE TRUE
  12. #> 3 3 Summer 34567 TRUE TRUE FALSE
  13. #> 4 5 Taylor 56789 FALSE TRUE FALSE
  14. #> 5 4 Miller 45678 FALSE FALSE TRUE

答案2

得分: 3

  1. 使用`reduce``joins`
  2. ```r
  3. 库(purrr)
  4. 库(dplyr)
  5. lst(df1, df2, df3) %>%
  6. imap(\(x, y){colnames(x)[4] <- glue::glue("present_in_{y}"); x}) %>%
  7. reduce(full_join, by = c("id", "name", "zip")) %>%
  8. mutate(across(contains("present"), complete.cases))
  9. id name zip present_in_df1 present_in_df2 present_in_df3
  10. 1 1 Smith 12345 TRUE TRUE TRUE
  11. 2 2 Winter 23456 TRUE TRUE TRUE
  12. 3 3 Summer 34567 TRUE TRUE FALSE
  13. 4 5 Taylor 56789 FALSE TRUE FALSE
  14. 5 4 Miller 45678 FALSE FALSE TRUE
  1. <details>
  2. <summary>英文:</summary>
  3. With `reduce` and `joins`:
  4. ```r
  5. library(purrr)
  6. library(dplyr)
  7. lst(df1, df2, df3) %&gt;%
  8. imap(\(x, y){colnames(x)[4] &lt;- glue::glue(&quot;present_in_{y}&quot;); x}) %&gt;%
  9. reduce(full_join, by = c(&quot;id&quot;, &quot;name&quot;, &quot;zip&quot;)) %&gt;%
  10. mutate(across(contains(&quot;present&quot;), complete.cases))
  11. id name zip present_in_df1 present_in_df2 present_in_df3
  12. 1 1 Smith 12345 TRUE TRUE TRUE
  13. 2 2 Winter 23456 TRUE TRUE TRUE
  14. 3 3 Summer 34567 TRUE TRUE FALSE
  15. 4 5 Taylor 56789 FALSE TRUE FALSE
  16. 5 4 Miller 45678 FALSE FALSE TRUE

答案3

得分: 2

  1. library(dplyr)
  2. list(df1, df2, df3) |&gt; purrr::reduce(full_join, by = c("id", "name", "zip")) |&gt;
  3. mutate(across(contains("value"), ~ifelse(is.na(.x), FALSE, TRUE))) |&gt;
  4. rename(present_in_df1 = value.x,
  5. present_in_df2 = value.y,
  6. present_in_df3 = value)
英文:
  1. library(dplyr)
  2. list(df1,df2,df3) |&gt; purrr::reduce(full_join, by = c(&quot;id&quot;, &quot;name&quot;, &quot;zip&quot;), ) |&gt;
  3. mutate(across(contains(&quot;value&quot;), ~ifelse(is.na(.x), FALSE, TRUE))) |&gt;
  4. rename(present_in_df1 = value.x,
  5. present_in_df2 = value.y,
  6. present_in_df3 = value)

答案4

得分: 2

将它们行绑定,然后重塑为宽格式:

  1. library(data.table)
  2. l <- rbindlist(mget(ls(pattern = "^df")), idcol = "df")
  3. dcast(l, id + name + zip ~ df)
  4. # id name zip 1 2 3
  5. # 1: 1 Smith 12345 1 4 7
  6. # 2: 2 Winter 23456 2 5 8
  7. # 3: 3 Summer 34567 3 6 NA
  8. # 4: 4 Miller 45678 NA NA 9
  9. # 5: 5 Taylor 56789 NA 0 NA
英文:

Rowbind them, then reshape long-to-wide:

  1. library(data.table)
  2. l &lt;- rbindlist(mget(ls(pattern = &quot;^df&quot;)), idcol = &quot;df&quot;)
  3. dcast(l, id + name + zip ~ df)
  4. # id name zip 1 2 3
  5. # 1: 1 Smith 12345 1 4 7
  6. # 2: 2 Winter 23456 2 5 8
  7. # 3: 3 Summer 34567 3 6 NA
  8. # 4: 4 Miller 45678 NA NA 9
  9. # 5: 5 Taylor 56789 NA 0 NA

答案5

得分: 2

你可以将这三个数据框绑定在一起,通过对相关列进行group_by,然后使用summarise来输出包含必要信息的数据框。

  1. library(tidyverse)
  2. bind_rows(df1, df2, df3, .id = "df") %>%
  3. group_by(id, name, zip) %>%
  4. summarize(df = paste(df, collapse = ","))
  5. # A tibble: 5 × 4
  6. id name zip df
  7. <dbl> <chr> <dbl> <chr>
  8. 1 1 Smith 12345 1,2,3
  9. 2 2 Winter 23456 1,2,3
  10. 3 3 Summer 34567 1,2
  11. 4 4 Miller 45678 3
  12. 5 5 Taylor 56789 2

如果你认为上述格式有用,这可以是你的终点。要将它们提取到三个不同的列中,我们可以使用grepl函数来检查数据框编号。

  1. bind_rows(df1, df2, df3, .id = "df") %>%
  2. group_by(id, name, zip) %>%
  3. summarize(df = paste(df, collapse = ","), .groups = "drop") %>%
  4. mutate(present_in_df1 = grepl("1", df),
  5. present_in_df2 = grepl("2", df),
  6. present_in_df3 = grepl("3", df), .keep = "unused")
  7. # A tibble: 5 × 6
  8. id name zip present_in_df1 present_in_df2 present_in_df3
  9. <dbl> <chr> <dbl> <lgl> <lgl> <lgl>
  10. 1 1 Smith 12345 TRUE TRUE TRUE
  11. 2 2 Winter 23456 TRUE TRUE TRUE
  12. 3 3 Summer 34567 TRUE TRUE FALSE
  13. 4 4 Miller 45678 FALSE FALSE TRUE
  14. 5 5 Taylor 56789 FALSE TRUE FALSE
英文:

You can bind the three dfs together, group_by the relevant columns, then use summarise to output what df contains the necessary information.

  1. library(tidyverse)
  2. bind_rows(df1, df2, df3, .id = &quot;df&quot;) %&gt;%
  3. group_by(id, name, zip) %&gt;%
  4. summarize(df = paste(df, collapse = &quot;,&quot;))
  5. # A tibble: 5 &#215; 4
  6. id name zip df
  7. &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
  8. 1 1 Smith 12345 1,2,3
  9. 2 2 Winter 23456 1,2,3
  10. 3 3 Summer 34567 1,2
  11. 4 4 Miller 45678 3
  12. 5 5 Taylor 56789 2

This could be your endpoint if you find the above format useful. To extract them into three different columns, we can grepl on the df number.

  1. bind_rows(df1, df2, df3, .id = &quot;df&quot;) %&gt;%
  2. group_by(id, name, zip) %&gt;%
  3. summarize(df = paste(df, collapse = &quot;,&quot;), .groups = &quot;drop&quot;) %&gt;%
  4. mutate(present_in_df1 = grepl(&quot;1&quot;, df),
  5. present_in_df2 = grepl(&quot;2&quot;, df),
  6. present_in_df3 = grepl(&quot;3&quot;, df), .keep = &quot;unused&quot;)
  7. # A tibble: 5 &#215; 6
  8. id name zip present_in_df1 present_in_df2 present_in_df3
  9. &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;
  10. 1 1 Smith 12345 TRUE TRUE TRUE
  11. 2 2 Winter 23456 TRUE TRUE TRUE
  12. 3 3 Summer 34567 TRUE TRUE FALSE
  13. 4 4 Miller 45678 FALSE FALSE TRUE
  14. 5 5 Taylor 56789 FALSE TRUE FALSE

huangapple
  • 本文由 发表于 2023年2月16日 16:44:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75469717.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定