英文:
Return anti-join of two data frames with values outside a certain percentage difference
问题
You can achieve this by using the dplyr
package in R and writing a custom function for the percentage-based anti-join. Here's the code to perform the desired operation:
library(dplyr)
# Custom anti-join function with percentage difference
antijoin_function <- function(tbl1, tbl2, by, pct) {
tbl1 %>%
anti_join(tbl2, by = by) %>%
filter(if_any(starts_with("var"), ~is.numeric(.x) || is.character(.x)) |
if_all(starts_with("var"), ~is.numeric(.x) || is.character(.x) || (.x %in% tbl2[[.y]] * (1 + pct) | .x %in% tbl2[[.y]] * (1 - pct))))
}
# Define the data frames
tbl1 <- tibble(var1 = c('r1', 'r2', 'r3', 'r4', 'r5'),
var2 = c('apple', 'orange', 'banana', 'strawberry', 'lime'),
var3 = c(1, 2, 3, 4, 5),
var4 = c('yes', 'no', 'yes', 'yes', 'no'))
tbl2 <- tibble(var1 = c('r6', 'r7', 'r8', 'r9', 'r10'),
var2 = c('orange', 'banana', 'apple', 'lemon', 'strawberry'),
var3 = c(2, 3, 1.5, 10, 4.1),
var4 = c('no', 'yes', 'yes', 'no', 'yes'))
# Use the custom anti-join function
result <- antijoin_function(tbl1, tbl2, by = c('var2' = 'var2', 'var3' = 'var3', 'var4' = 'var4'), pct = 0.2)
result
This code defines the custom antijoin_function
that performs the anti-join operation with a percentage difference for numeric columns. It filters rows based on the specified percentage difference and returns the desired result.
英文:
I would like to compare two mixed-type data frames and return the rows that are different between them--but I would like numeric values to only be returned within a certain percentage.
tbl1 <- tibble(var1 = c('r1', 'r2', 'r3', 'r4', 'r5'),
var2 = c('apple', 'orange', 'banana', 'strawberry', 'lime'),
var3 = c(1, 2, 3, 4, 5),
var4 = c('yes', 'no', 'yes', 'yes', 'no'))
tbl2 <- tibble(var1 = c('r6', 'r7', 'r8', 'r9', 'r10'),
var2 = c('orange', 'banana', 'apple', 'lemon', 'strawberry'),
var3 = c(2, 3, 1.5, 10, 4.1),
var4 = c('no', 'yes', 'yes', 'no', 'yes'))
I know there is dplyr::anti_join
but that checks for exact matches. So if I was OK with numeric values that were within 20%, then the function would be something like:
tbl1 %>%
antijoin_function(tbl2, by = c('var2' = 'var2', 'var3' = 'var3', 'var4' = 'var4'),
pct = 0.2)
And return
var1 | var2 | var3 | var4 |
---|---|---|---|
r1 | apple | 1 | yes |
r5 | lime | 5 | no |
The row with strawberry
would not be returned because the single difference in var3
is less than 20%.
Are there any functions or packages that do this?
答案1
得分: 1
library(dplyr)
使用full_join函数将tbl1和tbl2按"var2"列连接,添加后缀为""和".right"。
然后使用filter函数,筛选满足条件abs(var3 - var3.right)/var3 > 0.2 | if_all(contains(".right"), ~ is.na(.))的行。
最后使用select函数,移除包含".right"的列。
#> # A tibble: 2 × 4
#> var1 var2 var3 var4
#> <chr> <chr> <dbl> <chr>
#> 1 r1 apple 1 yes
#> 2 r5 lime 5 no
创建于2023-05-22,使用reprex v2.0.2
<details>
<summary>英文:</summary>
``` r
library(dplyr)
full_join(tbl1, tbl2, by = c("var2" = "var2"), suffix = c("", ".right")) %>%
filter(abs(var3 - var3.right)/var3 > 0.2 | if_all(contains(".right"), ~ is.na(.))) %>%
select(-contains(".right"))
#> # A tibble: 2 × 4
#> var1 var2 var3 var4
#> <chr> <chr> <dbl> <chr>
#> 1 r1 apple 1 yes
#> 2 r5 lime 5 no
<sup>Created on 2023-05-22 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论