2023年5月20日 22:47:45go评论100阅读模式

英文:

How to retain a row in a dataframe with similar values in two different column in R

问题

你好！你想要保留估算1和估算2非常相似的行，你可以使用R中的以下代码来实现这一目标：

# 计算估算1和估算2之间的差异
df$diff <- abs(df$estimation1 - df$estimation2)
# 选择差异小于某个阈值的行
threshold <- 0.01  # 你可以根据需要调整阈值
result <- df[df$diff < threshold, c("ID", "estimation1", "estimation2")]
# 移除差异列
result$diff <- NULL
# 输出结果
print(result)

这段代码将计算估算1和估算2之间的差异，并仅保留差异小于指定阈值的行，最后输出结果。

希望这对你有所帮助！

英文:

I have the following dataframe:

ID  estimation1   estimation2
A   0.0234         0.0220
A    0.0234            3
A   0.0234         0.034
B   -0.005         -1.89
B   -0.005         0.03
B   -0.005       -0.0052 
C   0.10         -0.00067
C   0.10        -0.98
C   0.10         0.11

df &lt;- structure(list(ID = c(&quot;A&quot;, &quot;A&quot;, &quot;A&quot;, &quot;B&quot;, &quot;B&quot;, &quot;B&quot;, &quot;C&quot;, &quot;C&quot;, &quot;C&quot;), estimation1 = c(0.0234, 0.0234, 0.0234, -0.005, -0.005, -0.005, 0.10, 0.10, 0.10), estimation2 = c(0.022, 3, 0.034, -1.89, 0.03, -0.0052, -0.00067, -0.98, 0.11)), class = &quot;data.frame&quot;, row.names = c(NA, 
-3L))

I would like to retain only the row in which estimation1 and estimation2 are quite similar, in this case only the first row, with the following output:

ID  estimation1   estimation2
    A   0.0234         0.0220
    B   -0.005         -0.0052
    C   0.10           0.11

Is there a function in R being able to do something like that?
Really thank you!

答案1

得分: 2

更新： 经过澄清：

一种通用的方法可以是分组并找到绝对值的最小差异，然后进行过滤：

library(dplyr)
df %>%
  group_by(ID) %>%
  mutate(diff = abs(estimation2 - estimation1)) %>%
  filter(diff == min(diff)) %>%
  select(-diff)
 ID    estimation1 estimation2
 <chr>       <dbl>       <dbl>
1 A          0.0234      0.022 
2 B         -0.005      -0.0052
3 C          0.1         0.11

第一个答案：
使用基本的 R 我们可以通过指定“相似性”（这里是 0.02）进行子集化：

df[abs(df$estimation1 - df$estimation2) < 0.02, ]
  ID estimation1 estimation2
1  A      0.0234       0.022

或者使用 dplyr：

library(dplyr)
df %>% filter(abs(estimation1 - estimation2) < 0.02)

英文:

Update: After clarification:

One general way could be to group and find the lowest difference of the absolute value and filter thereafter:

library(dplyr)
df %&gt;% 
  group_by(ID) %&gt;% 
  mutate(diff = abs(estimation2 - estimation1)) %&gt;% 
  filter(diff == min(diff)) %&gt;% 
  select(-diff)
 ID    estimation1 estimation2
  &lt;chr&gt;       &lt;dbl&gt;       &lt;dbl&gt;
1 A          0.0234      0.022 
2 B         -0.005      -0.0052
3 C          0.1         0.11

First answer:
With base R we could subset by indicating the "similarity" here 0.02:

df[abs(df$estimation1 - df$estimation2) &lt; 0.02, ]
  ID estimation1 estimation2
1  A      0.0234       0.022

or with dplyr:

library(dplyr)
df %&gt;% filter(abs(estimation1 - estimation2) &lt; 0.02)

答案2

得分: 1

I guess you meant to use the Euclidean distance to filter the "closest" estimations between two columns (grouped by ID), and the base option below might be one option:

subset(
    df,
    as.logical(
        ave(
            abs(estimation1 - estimation2),
            ID,
            FUN = \(x) seq_along(x) == which.min(x)
        )
    )
)

which gives

  ID estimation1 estimation2
1  A      0.0234      0.0220
6  B     -0.0050     -0.0052
9  C      0.1000      0.1100

If you use dplyr, you can try slice_min:

df %>%
    group_by(ID) %>%
    slice_min(abs(estimation2 - estimation1)) %>%
    ungroup()

which gives:

# A tibble: 3 × 3
  ID    estimation1 estimation2
1 A         0.0234         0.022
2 B        -0.0050        -0.0052
3 C         0.1000         0.1100

英文:

I guess you meant to use the Euclidean distance to filter the "closest" estimations between two columns (grouped by ID), and the base option below might be one option

subset(
    df,
    as.logical(
        ave(
            abs(estimation1 - estimation2),
            ID,
            FUN = \(x) seq_along(x) == which.min(x)
        )
    )
)

which gives

  ID estimation1 estimation2
1  A      0.0234      0.0220
6  B     -0.0050     -0.0052
9  C      0.1000      0.1100

If you use dplyr, you can try slice_min

df %&gt;%
    group_by(ID) %&gt;%
    slice_min(abs(estimation2 - estimation1)) %&gt;%
    ungroup()

which gives

# A tibble: 3 &#215; 3
  ID    estimation1 estimation2
  &lt;chr&gt;       &lt;dbl&gt;       &lt;dbl&gt;
1 A          0.0234      0.022
2 B         -0.005      -0.0052
3 C          0.1         0.11

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中如何保留具有两个不同列中相似值的数据框中的行。

问题

答案1

答案2

在R中如何向一系列变量添加后缀？

Insert pandas data frame into Postgres

从Python中的行信息创建新列

R: 模拟相关的硬币投掷

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论