英文:
left join + anti join in R
问题
有没有一种方法可以执行left_join,排除匹配项?
A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)
x <- tibble(year, A, B) %>%
mutate(B = B * 12)
y <- tibble(year, B, C)
我想要的是 x + 来自 y 的列,这些列不在 x 中。
编辑:所有数据集的年份都相同,但我已经对 x 中的 B 列进行了转换,使其不再与集合 y 中的列匹配。
我希望最终得到一个框架:x(year, A, B, C),其中的 B 是最初在 x 中的 B 列。
我尝试过使用 coalesce(x, y),但没有成功,有人有关于这个的线索吗?
我是不是在尝试做一些非常愚蠢的事情或者漏掉了一些细节?
我知道我可以简单地从 y 中删除我不想要的列,或者只需通过= "year" 进行连接,然后删除带有 .y 后缀的列,但希望有一种更好的方法来实现这一点!
英文:
Is there a way to perform a left_join that excludes matches?
A<-sample(1:100, 10)
B<-sample(1:100, 10)
C<-sample(1:100, 10)
year<-sample(1000:2000,10)
x<-tibble(year,A,B)%>%
mutate(B=B*12)
y<-tibble(year,B,C)
I want x + the cols from y that are not in x
EDIT: Year is the same across all the datasets, but I have transformed the B column in x so that it no longer matches the one in set y.
I want to finish with a frame: x(year,A,B,C) where the B is the one that was in x originally.
I tried with coalesce(x,y) but to no avail, anyone have any clues about this?
Am I trying to do something really stupid or missing some detail?
I know that I could just drop the columns I don''t want from y, or that I could just join by = "year" and then drop the ones with the .y suffix, but hoping that there a nicer way to accomplish this!
答案1
得分: 2
如果我理解正确,一个潜在的解决方案是使用powerjoin包执行power_left_join()
,如果存在冲突则从x中进行选择,例如:
library(tidyverse)
# install.packages("powerjoin")
library(powerjoin)
set.seed(42)
A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)
y <- tibble(year, B, C)
x <- tibble(year, A, B)
# 修改"x"中的第一个"B"值
x$B[1] <- 1000
x %>%
power_left_join(y, by = "year", conflict = ~.x)
#> # A tibble: 10 × 4
#> year A C B
#> <int> <int> <int> <dbl>
#> 1 1313 49 34 1000
#> 2 1647 65 92 20
#> 3 1291 25 3 26
#> 4 1835 74 58 3
#> 5 1981 18 42 41
#> 6 1145 100 24 89
#> 7 1620 47 30 27
#> 8 1347 24 43 36
#> 9 1196 71 15 5
#> 10 1515 89 22 84
然而,另一种方法是放弃“.y”列并将“.x”列重命名,这并不更加“困难”(这是我个人使用的方法)。
x %>%
left_join(y, by = "year") %>%
select(-ends_with(".y")) %>%
rename_with(~gsub(".x", "", .x))
#> # A tibble: 10 × 4
#> year A B C
#> <int> <int> <dbl> <int>
#> 1 1313 49 1000 34
#> 2 1647 65 20 92
#> 3 1291 25 26 3
#> 4 1835 74 3 58
#> 5 1981 18 41 42
#> 6 1145 100 89 24
#> 7 1620 47 27 30
#> 8 1347 24 36 43
#> 9 1196 71 5 15
#> 10 1515 89 84 22
在2023-07-06使用reprex v2.0.2创建
英文:
If I've understood correctly, one potential solution is to use the powerjoin package to do a power_left_join()
and 'select' the value from x if there are any conflicts, e.g.
library(tidyverse)
# install.packages("powerjoin")
library(powerjoin)
set.seed(42)
A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)
y <- tibble(year, B, C)
x <- tibble(year, A, B)
# change the first "B" value in "x"
x$B[1] <- 1000
x %>%
power_left_join(y, by = "year", conflict = ~.x)
#> # A tibble: 10 × 4
#> year A C B
#> <int> <int> <int> <dbl>
#> 1 1313 49 34 1000
#> 2 1647 65 92 20
#> 3 1291 25 3 26
#> 4 1835 74 58 3
#> 5 1981 18 42 41
#> 6 1145 100 24 89
#> 7 1620 47 30 27
#> 8 1347 24 43 36
#> 9 1196 71 15 5
#> 10 1515 89 22 84
However, the alternative approach of dropping the ".y" columns and renaming the ".x" columns isn't that much 'harder' (this is what I personally use).
x %>%
left_join(y, by = "year") %>%
select(-ends_with(".y")) %>%
rename_with(~gsub(".x", "", .x))
#> # A tibble: 10 × 4
#> year A B C
#> <int> <int> <dbl> <int>
#> 1 1313 49 1000 34
#> 2 1647 65 20 92
#> 3 1291 25 26 3
#> 4 1835 74 3 58
#> 5 1981 18 41 42
#> 6 1145 100 89 24
#> 7 1620 47 27 30
#> 8 1347 24 36 43
#> 9 1196 71 5 15
#> 10 1515 89 84 22
<sup>Created on 2023-07-06 with reprex v2.0.2</sup>
答案2
得分: 1
我们可以使用 setdiff
识别新列名(我使用 base::setdiff
来避免可能的冲突,如果有人加载了 lubridate
)。然后使用年份和来自 y
的这些列进行连接。
new_cols <- base::setdiff(colnames(y), colnames(x))
left_join(x, y[,c("year", new_cols)])
连接依据为 by = join_by(year)
。
英文:
We could identify the new column names with setdiff
(I'm using base::setdiff
to avoid the potential conflict if someone has loaded lubridate
.), and then join using year + those columns from y
.
new_cols <- base::setdiff(colnames(y), colnames(x))
left_join(x, y[,c("year", new_cols)])
Joining with `by = join_by(year)`
# A tibble: 10 × 4
year A B C
<int> <int> <int> <int>
1 1336 8 100 66
2 1643 1 57 76
3 1896 58 24 15
4 1965 65 66 69
5 1361 74 18 43
6 1526 92 38 78
7 1870 36 8 68
8 1151 15 86 92
9 1748 35 16 19
10 1624 82 30 42
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论