2023年7月6日 11:01:41go评论98阅读模式

英文:

left join + anti join in R

问题

有没有一种方法可以执行left_join，排除匹配项？

A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)
x <- tibble(year, A, B) %>%
  mutate(B = B * 12)
y <- tibble(year, B, C)

我想要的是 x + 来自 y 的列，这些列不在 x 中。

编辑：所有数据集的年份都相同，但我已经对 x 中的 B 列进行了转换，使其不再与集合 y 中的列匹配。

我希望最终得到一个框架：x(year, A, B, C)，其中的 B 是最初在 x 中的 B 列。

我尝试过使用 coalesce(x, y)，但没有成功，有人有关于这个的线索吗？
我是不是在尝试做一些非常愚蠢的事情或者漏掉了一些细节？

我知道我可以简单地从 y 中删除我不想要的列，或者只需通过= "year" 进行连接，然后删除带有 .y 后缀的列，但希望有一种更好的方法来实现这一点！

英文:

Is there a way to perform a left_join that excludes matches?

A&lt;-sample(1:100, 10)
B&lt;-sample(1:100, 10)
C&lt;-sample(1:100, 10)
year&lt;-sample(1000:2000,10)
x&lt;-tibble(year,A,B)%&gt;%
  mutate(B=B*12)
y&lt;-tibble(year,B,C)

I want x + the cols from y that are not in x

EDIT: Year is the same across all the datasets, but I have transformed the B column in x so that it no longer matches the one in set y.

I want to finish with a frame: x(year,A,B,C) where the B is the one that was in x originally.

I tried with coalesce(x,y) but to no avail, anyone have any clues about this?
Am I trying to do something really stupid or missing some detail?

I know that I could just drop the columns I don''t want from y, or that I could just join by = "year" and then drop the ones with the .y suffix, but hoping that there a nicer way to accomplish this!

答案1

得分: 2

如果我理解正确，一个潜在的解决方案是使用powerjoin包执行power_left_join()，如果存在冲突则从x中进行选择，例如：

library(tidyverse)
# install.packages("powerjoin")
library(powerjoin)
set.seed(42)
A <- sample(1:100, 10)
B <- sample(1:100, 10)
C <- sample(1:100, 10)
year <- sample(1000:2000, 10)
y <- tibble(year, B, C)
x <- tibble(year, A, B)
# 修改"x"中的第一个"B"值
x$B[1] <- 1000
x %>%
  power_left_join(y, by = "year", conflict = ~.x)
#> # A tibble: 10 × 4
#>     year     A     C     B
#>    <int> <int> <int> <dbl>
#>  1  1313    49    34  1000
#>  2  1647    65    92    20
#>  3  1291    25     3    26
#>  4  1835    74    58     3
#>  5  1981    18    42    41
#>  6  1145   100    24    89
#>  7  1620    47    30    27
#>  8  1347    24    43    36
#>  9  1196    71    15     5
#> 10  1515    89    22    84

然而，另一种方法是放弃“.y”列并将“.x”列重命名，这并不更加“困难”（这是我个人使用的方法）。

x %>%
  left_join(y, by = "year") %>%
  select(-ends_with(".y")) %>%
  rename_with(~gsub(".x", "", .x))
#> # A tibble: 10 × 4
#>     year     A     B     C
#>    <int> <int> <dbl> <int>
#>  1  1313    49  1000    34
#>  2  1647    65    20    92
#>  3  1291    25    26     3
#>  4  1835    74     3    58
#>  5  1981    18    41    42
#>  6  1145   100    89    24
#>  7  1620    47    27    30
#>  8  1347    24    36    43
#>  9  1196    71     5    15
#> 10  1515    89    84    22

^{在2023-07-06使用reprex v2.0.2创建}

英文:

If I've understood correctly, one potential solution is to use the powerjoin package to do a power_left_join() and 'select' the value from x if there are any conflicts, e.g.

library(tidyverse)
# install.packages(&quot;powerjoin&quot;)
library(powerjoin)
set.seed(42)
A &lt;- sample(1:100, 10)
B &lt;- sample(1:100, 10)
C &lt;- sample(1:100, 10)
year &lt;- sample(1000:2000, 10)
y &lt;- tibble(year, B, C)
x &lt;- tibble(year, A, B)
# change the first &quot;B&quot; value in &quot;x&quot;
x$B[1] &lt;- 1000
x %&gt;%
  power_left_join(y, by = &quot;year&quot;, conflict = ~.x)
#&gt; # A tibble: 10 &#215; 4
#&gt;     year     A     C     B
#&gt;    &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt;  1  1313    49    34  1000
#&gt;  2  1647    65    92    20
#&gt;  3  1291    25     3    26
#&gt;  4  1835    74    58     3
#&gt;  5  1981    18    42    41
#&gt;  6  1145   100    24    89
#&gt;  7  1620    47    30    27
#&gt;  8  1347    24    43    36
#&gt;  9  1196    71    15     5
#&gt; 10  1515    89    22    84

However, the alternative approach of dropping the ".y" columns and renaming the ".x" columns isn't that much 'harder' (this is what I personally use).

x %&gt;%
  left_join(y, by = &quot;year&quot;) %&gt;%
  select(-ends_with(&quot;.y&quot;)) %&gt;%
  rename_with(~gsub(&quot;.x&quot;, &quot;&quot;, .x))
#&gt; # A tibble: 10 &#215; 4
#&gt;     year     A     B     C
#&gt;    &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt;  1  1313    49  1000    34
#&gt;  2  1647    65    20    92
#&gt;  3  1291    25    26     3
#&gt;  4  1835    74     3    58
#&gt;  5  1981    18    41    42
#&gt;  6  1145   100    89    24
#&gt;  7  1620    47    27    30
#&gt;  8  1347    24    36    43
#&gt;  9  1196    71     5    15
#&gt; 10  1515    89    84    22

<sup>Created on 2023-07-06 with reprex v2.0.2</sup>

答案2

得分: 1

我们可以使用 setdiff 识别新列名（我使用 base::setdiff 来避免可能的冲突，如果有人加载了 lubridate）。然后使用年份和来自 y 的这些列进行连接。

new_cols <- base::setdiff(colnames(y), colnames(x))
left_join(x, y[,c("year", new_cols)])

连接依据为 by = join_by(year)。

英文:

We could identify the new column names with setdiff (I'm using base::setdiff to avoid the potential conflict if someone has loaded lubridate.), and then join using year + those columns from y.

new_cols &lt;- base::setdiff(colnames(y), colnames(x))
left_join(x, y[,c(&quot;year&quot;, new_cols)])
Joining with `by = join_by(year)`
# A tibble: 10 &#215; 4
    year     A     B     C
   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
 1  1336     8   100    66
 2  1643     1    57    76
 3  1896    58    24    15
 4  1965    65    66    69
 5  1361    74    18    43
 6  1526    92    38    78
 7  1870    36     8    68
 8  1151    15    86    92
 9  1748    35    16    19
10  1624    82    30    42

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

left join + anti join 在 R 中的用法

问题

答案1

答案2

使用rvest从网页上提取表格中的唯一ID。

Go Golang：合并排序 Stack Overflow

在R数据框中如何只分割列中的特定因子？

阻止 Visual Studio Code 中的 R 代码自动完成建议 “%>%” 时出现 “%in%”。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。