英文:
How to retain a row in a dataframe with similar values in two different column in R
问题
你好!你想要保留估算1和估算2非常相似的行,你可以使用R中的以下代码来实现这一目标:
# 计算估算1和估算2之间的差异
df$diff <- abs(df$estimation1 - df$estimation2)
# 选择差异小于某个阈值的行
threshold <- 0.01 # 你可以根据需要调整阈值
result <- df[df$diff < threshold, c("ID", "estimation1", "estimation2")]
# 移除差异列
result$diff <- NULL
# 输出结果
print(result)
这段代码将计算估算1和估算2之间的差异,并仅保留差异小于指定阈值的行,最后输出结果。
希望这对你有所帮助!
英文:
I have the following dataframe:
ID estimation1 estimation2
A 0.0234 0.0220
A 0.0234 3
A 0.0234 0.034
B -0.005 -1.89
B -0.005 0.03
B -0.005 -0.0052
C 0.10 -0.00067
C 0.10 -0.98
C 0.10 0.11
df <- structure(list(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), estimation1 = c(0.0234, 0.0234, 0.0234, -0.005, -0.005, -0.005, 0.10, 0.10, 0.10), estimation2 = c(0.022, 3, 0.034, -1.89, 0.03, -0.0052, -0.00067, -0.98, 0.11)), class = "data.frame", row.names = c(NA,
-3L))
I would like to retain only the row in which estimation1 and estimation2 are quite similar, in this case only the first row, with the following output:
ID estimation1 estimation2
A 0.0234 0.0220
B -0.005 -0.0052
C 0.10 0.11
Is there a function in R being able to do something like that?
Really thank you!
答案1
得分: 2
更新: 经过澄清:
一种通用的方法可以是分组并找到绝对值的最小差异,然后进行过滤:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(diff = abs(estimation2 - estimation1)) %>%
filter(diff == min(diff)) %>%
select(-diff)
ID estimation1 estimation2
<chr> <dbl> <dbl>
1 A 0.0234 0.022
2 B -0.005 -0.0052
3 C 0.1 0.11
第一个答案:
使用基本的 R 我们可以通过指定“相似性”(这里是 0.02
)进行子集化:
df[abs(df$estimation1 - df$estimation2) < 0.02, ]
ID estimation1 estimation2
1 A 0.0234 0.022
或者使用 dplyr
:
library(dplyr)
df %>% filter(abs(estimation1 - estimation2) < 0.02)
英文:
Update: After clarification:
One general way could be to group and find the lowest difference of the absolute value and filter thereafter:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(diff = abs(estimation2 - estimation1)) %>%
filter(diff == min(diff)) %>%
select(-diff)
ID estimation1 estimation2
<chr> <dbl> <dbl>
1 A 0.0234 0.022
2 B -0.005 -0.0052
3 C 0.1 0.11
First answer:
With base R we could subset by indicating the "similarity" here 0.02
:
df[abs(df$estimation1 - df$estimation2) < 0.02, ]
ID estimation1 estimation2
1 A 0.0234 0.022
or with dplyr
:
library(dplyr)
df %>% filter(abs(estimation1 - estimation2) < 0.02)
答案2
得分: 1
I guess you meant to use the Euclidean distance to filter the "closest" estimations between two columns (grouped by ID
), and the base option below might be one option:
subset(
df,
as.logical(
ave(
abs(estimation1 - estimation2),
ID,
FUN = \(x) seq_along(x) == which.min(x)
)
)
)
which gives
ID estimation1 estimation2
1 A 0.0234 0.0220
6 B -0.0050 -0.0052
9 C 0.1000 0.1100
If you use dplyr
, you can try slice_min
:
df %>%
group_by(ID) %>%
slice_min(abs(estimation2 - estimation1)) %>%
ungroup()
which gives:
# A tibble: 3 × 3
ID estimation1 estimation2
1 A 0.0234 0.022
2 B -0.0050 -0.0052
3 C 0.1000 0.1100
英文:
I guess you meant to use the Euclidean distance to filter the "closest" estimations between two columns (grouped by ID
), and the base option below might be one option
subset(
df,
as.logical(
ave(
abs(estimation1 - estimation2),
ID,
FUN = \(x) seq_along(x) == which.min(x)
)
)
)
which gives
ID estimation1 estimation2
1 A 0.0234 0.0220
6 B -0.0050 -0.0052
9 C 0.1000 0.1100
If you use dplyr
, you can try slice_min
df %>%
group_by(ID) %>%
slice_min(abs(estimation2 - estimation1)) %>%
ungroup()
which gives
# A tibble: 3 × 3
ID estimation1 estimation2
<chr> <dbl> <dbl>
1 A 0.0234 0.022
2 B -0.005 -0.0052
3 C 0.1 0.11
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论