left join + anti join 在 R 中的用法

huangapple go评论68阅读模式
英文:

left join + anti join in R

问题

有没有一种方法可以执行left_join,排除匹配项?

A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)

x <- tibble(year, A, B) %>%
  mutate(B = B * 12)
y <- tibble(year, B, C)

我想要的是 x + 来自 y 的列,这些列不在 x 中

编辑:所有数据集的年份都相同,但我已经对 x 中的 B 列进行了转换,使其不再与集合 y 中的列匹配。

我希望最终得到一个框架:x(year, A, B, C),其中的 B 是最初在 x 中的 B 列。

我尝试过使用 coalesce(x, y),但没有成功,有人有关于这个的线索吗?
我是不是在尝试做一些非常愚蠢的事情或者漏掉了一些细节?

我知道我可以简单地从 y 中删除我不想要的列,或者只需通过= "year" 进行连接,然后删除带有 .y 后缀的列,但希望有一种更好的方法来实现这一点!

英文:

Is there a way to perform a left_join that excludes matches?

A&lt;-sample(1:100, 10)
B&lt;-sample(1:100, 10)
C&lt;-sample(1:100, 10)
year&lt;-sample(1000:2000,10)

x&lt;-tibble(year,A,B)%&gt;%
  mutate(B=B*12)
y&lt;-tibble(year,B,C)

I want x + the cols from y that are not in x

EDIT: Year is the same across all the datasets, but I have transformed the B column in x so that it no longer matches the one in set y.

I want to finish with a frame: x(year,A,B,C) where the B is the one that was in x originally.

I tried with coalesce(x,y) but to no avail, anyone have any clues about this?
Am I trying to do something really stupid or missing some detail?

I know that I could just drop the columns I don''t want from y, or that I could just join by = "year" and then drop the ones with the .y suffix, but hoping that there a nicer way to accomplish this!

答案1

得分: 2

如果我理解正确,一个潜在的解决方案是使用powerjoin包执行power_left_join(),如果存在冲突则从x中进行选择,例如:

library(tidyverse)
# install.packages("powerjoin")
library(powerjoin)

set.seed(42)

A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)

y <- tibble(year, B, C)
x <- tibble(year, A, B)

# 修改"x"中的第一个"B"值
x$B[1] <- 1000

x %>%
  power_left_join(y, by = "year", conflict = ~.x)
#> # A tibble: 10 × 4
#>     year     A     C     B
#>    <int> <int> <int> <dbl>
#>  1  1313    49    34  1000
#>  2  1647    65    92    20
#>  3  1291    25     3    26
#>  4  1835    74    58     3
#>  5  1981    18    42    41
#>  6  1145   100    24    89
#>  7  1620    47    30    27
#>  8  1347    24    43    36
#>  9  1196    71    15     5
#> 10  1515    89    22    84

然而,另一种方法是放弃“.y”列并将“.x”列重命名,这并不更加“困难”(这是我个人使用的方法)。

x %>%
  left_join(y, by = "year") %>%
  select(-ends_with(".y")) %>%
  rename_with(~gsub(".x", "", .x))
#> # A tibble: 10 × 4
#>     year     A     B     C
#>    <int> <int> <dbl> <int>
#>  1  1313    49  1000    34
#>  2  1647    65    20    92
#>  3  1291    25    26     3
#>  4  1835    74     3    58
#>  5  1981    18    41    42
#>  6  1145   100    89    24
#>  7  1620    47    27    30
#>  8  1347    24    36    43
#>  9  1196    71     5    15
#> 10  1515    89    84    22

在2023-07-06使用reprex v2.0.2创建

英文:

If I've understood correctly, one potential solution is to use the powerjoin package to do a power_left_join() and 'select' the value from x if there are any conflicts, e.g.

library(tidyverse)
# install.packages(&quot;powerjoin&quot;)
library(powerjoin)

set.seed(42)

A &lt;- sample(1:100, 10)
B &lt;- sample(1:100, 10)
C &lt;- sample(1:100, 10)
year &lt;- sample(1000:2000, 10)

y &lt;- tibble(year, B, C)
x &lt;- tibble(year, A, B)

# change the first &quot;B&quot; value in &quot;x&quot;
x$B[1] &lt;- 1000

x %&gt;%
  power_left_join(y, by = &quot;year&quot;, conflict = ~.x)
#&gt; # A tibble: 10 &#215; 4
#&gt;     year     A     C     B
#&gt;    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt;  1  1313    49    34  1000
#&gt;  2  1647    65    92    20
#&gt;  3  1291    25     3    26
#&gt;  4  1835    74    58     3
#&gt;  5  1981    18    42    41
#&gt;  6  1145   100    24    89
#&gt;  7  1620    47    30    27
#&gt;  8  1347    24    43    36
#&gt;  9  1196    71    15     5
#&gt; 10  1515    89    22    84

However, the alternative approach of dropping the ".y" columns and renaming the ".x" columns isn't that much 'harder' (this is what I personally use).

x %&gt;%
  left_join(y, by = &quot;year&quot;) %&gt;%
  select(-ends_with(&quot;.y&quot;)) %&gt;%
  rename_with(~gsub(&quot;.x&quot;, &quot;&quot;, .x))
#&gt; # A tibble: 10 &#215; 4
#&gt;     year     A     B     C
#&gt;    &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt;  1  1313    49  1000    34
#&gt;  2  1647    65    20    92
#&gt;  3  1291    25    26     3
#&gt;  4  1835    74     3    58
#&gt;  5  1981    18    41    42
#&gt;  6  1145   100    89    24
#&gt;  7  1620    47    27    30
#&gt;  8  1347    24    36    43
#&gt;  9  1196    71     5    15
#&gt; 10  1515    89    84    22

<sup>Created on 2023-07-06 with reprex v2.0.2</sup>

答案2

得分: 1

我们可以使用 setdiff 识别新列名(我使用 base::setdiff 来避免可能的冲突,如果有人加载了 lubridate)。然后使用年份和来自 y 的这些列进行连接。

new_cols <- base::setdiff(colnames(y), colnames(x))

left_join(x, y[,c("year", new_cols)])

连接依据为 by = join_by(year)

英文:

We could identify the new column names with setdiff (I'm using base::setdiff to avoid the potential conflict if someone has loaded lubridate.), and then join using year + those columns from y.

new_cols &lt;- base::setdiff(colnames(y), colnames(x))

left_join(x, y[,c(&quot;year&quot;, new_cols)])


Joining with `by = join_by(year)`
# A tibble: 10 &#215; 4
    year     A     B     C
   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
 1  1336     8   100    66
 2  1643     1    57    76
 3  1896    58    24    15
 4  1965    65    66    69
 5  1361    74    18    43
 6  1526    92    38    78
 7  1870    36     8    68
 8  1151    15    86    92
 9  1748    35    16    19
10  1624    82    30    42

huangapple
  • 本文由 发表于 2023年7月6日 11:01:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76625220.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定