返回两个数据框之间值超出一定百分比差异的反连接。

huangapple go评论59阅读模式
英文:

Return anti-join of two data frames with values outside a certain percentage difference

问题

You can achieve this by using the dplyr package in R and writing a custom function for the percentage-based anti-join. Here's the code to perform the desired operation:

library(dplyr)

# Custom anti-join function with percentage difference
antijoin_function <- function(tbl1, tbl2, by, pct) {
  tbl1 %>%
    anti_join(tbl2, by = by) %>%
    filter(if_any(starts_with("var"), ~is.numeric(.x) || is.character(.x)) |
             if_all(starts_with("var"), ~is.numeric(.x) || is.character(.x) || (.x %in% tbl2[[.y]] * (1 + pct) | .x %in% tbl2[[.y]] * (1 - pct))))
}

# Define the data frames
tbl1 <- tibble(var1 = c('r1', 'r2', 'r3', 'r4', 'r5'),
               var2 = c('apple', 'orange', 'banana', 'strawberry', 'lime'),
               var3 = c(1, 2, 3, 4, 5),
               var4 = c('yes', 'no', 'yes', 'yes', 'no'))

tbl2 <- tibble(var1 = c('r6', 'r7', 'r8', 'r9', 'r10'),
               var2 = c('orange', 'banana', 'apple', 'lemon', 'strawberry'),
               var3 = c(2, 3, 1.5, 10, 4.1),
               var4 = c('no', 'yes', 'yes', 'no', 'yes'))

# Use the custom anti-join function
result <- antijoin_function(tbl1, tbl2, by = c('var2' = 'var2', 'var3' = 'var3', 'var4' = 'var4'), pct = 0.2)
result

This code defines the custom antijoin_function that performs the anti-join operation with a percentage difference for numeric columns. It filters rows based on the specified percentage difference and returns the desired result.

英文:

I would like to compare two mixed-type data frames and return the rows that are different between them--but I would like numeric values to only be returned within a certain percentage.

tbl1 &lt;- tibble(var1 = c(&#39;r1&#39;, &#39;r2&#39;, &#39;r3&#39;, &#39;r4&#39;, &#39;r5&#39;),
               var2 = c(&#39;apple&#39;, &#39;orange&#39;, &#39;banana&#39;, &#39;strawberry&#39;, &#39;lime&#39;),
               var3 = c(1, 2, 3, 4, 5),
               var4 = c(&#39;yes&#39;, &#39;no&#39;, &#39;yes&#39;, &#39;yes&#39;, &#39;no&#39;))

tbl2 &lt;- tibble(var1 = c(&#39;r6&#39;, &#39;r7&#39;, &#39;r8&#39;, &#39;r9&#39;, &#39;r10&#39;),
               var2 = c(&#39;orange&#39;, &#39;banana&#39;, &#39;apple&#39;, &#39;lemon&#39;, &#39;strawberry&#39;),
               var3 = c(2, 3, 1.5, 10, 4.1),
               var4 = c(&#39;no&#39;, &#39;yes&#39;, &#39;yes&#39;, &#39;no&#39;, &#39;yes&#39;))

I know there is dplyr::anti_join but that checks for exact matches. So if I was OK with numeric values that were within 20%, then the function would be something like:

tbl1 %&gt;%
  antijoin_function(tbl2, by = c(&#39;var2&#39; = &#39;var2&#39;, &#39;var3&#39; = &#39;var3&#39;, &#39;var4&#39; = &#39;var4&#39;),
                    pct = 0.2)

And return

var1 var2 var3 var4
r1 apple 1 yes
r5 lime 5 no

The row with strawberry would not be returned because the single difference in var3 is less than 20%.

Are there any functions or packages that do this?

答案1

得分: 1

library(dplyr)

使用full_join函数将tbl1和tbl2按"var2"列连接,添加后缀为""".right"然后使用filter函数,筛选满足条件abs(var3 - var3.right)/var3 > 0.2 | if_all(contains(".right"), ~ is.na(.))的行。
最后使用select函数,移除包含".right"的列。

#> # A tibble: 2 × 4
#>   var1  var2   var3 var4 
#>   <chr> <chr> <dbl> <chr>
#> 1 r1    apple     1 yes  
#> 2 r5    lime      5 no

创建于2023-05-22,使用reprex v2.0.2


<details>
<summary>英文:</summary>

``` r
library(dplyr)

full_join(tbl1, tbl2, by = c(&quot;var2&quot; = &quot;var2&quot;), suffix = c(&quot;&quot;, &quot;.right&quot;)) %&gt;% 
  filter(abs(var3 - var3.right)/var3 &gt; 0.2 | if_all(contains(&quot;.right&quot;), ~ is.na(.))) %&gt;% 
  select(-contains(&quot;.right&quot;))

#&gt; # A tibble: 2 &#215; 4
#&gt;   var1  var2   var3 var4 
#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 r1    apple     1 yes  
#&gt; 2 r5    lime      5 no

<sup>Created on 2023-05-22 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年5月23日 01:39:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76308695.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定