基于条件筛选行在 R 中

huangapple go评论91阅读模式
英文:

Filter rows based on conditions in R

问题

我需要根据多个条件过滤数据框中的行。
条件是在group_by之后,IDNo不是NA0,然后Date是最新的。

这是我的数据:

df <- data.frame(ID = rep(c(479,147,643,228,501), c(3,3,1,2,2)),
           No = c("A009145","A009146","0","A146754",NA,"A156764",NA,NA,NA,NA,NA),
           Date = c("2012-06-28","2020-01-10","2020-01-10","2014-11-28",NA,NA,"2019-04-13","2017-10-27","2017-12-01",NA,NA),
           point = c(25.2,31.1,31.1,21.5,18.6,18.6,27.77,18.52,18.52,26.3,26.3))

我想要的结果如下:

    ID        No        Date   point
2  479   A009146  2020-01-10    31.1
4  147   A146754  2014-11-28    21.5
7  643        NA  2019-04-13    27.77
9  228        NA  2017-12-01    18.52
10 501        NA          NA    26.3
11 501        NA          NA    26.3

例如,在ID 479中,

    ID        No        Date   point
1  479   A009145  2012-06-28    25.2 (移除,因为Date不是最新的)
2  479   A009146  2020-01-10    31.1
3  479         0  2020-01-10    31.1 (移除,因为No是0)

例如,在ID 228中,当No都是NA时,将保留具有最新Date的行。

    ID        No        Date   point
8  228        NA  2017-10-27    18.52 (移除,因为Date不是最新的)
9  228        NA  2017-12-01    18.52

例如,在ID 501中,NoDate都是NA,因此保留它们。

这是我尝试过的代码:

df %>%
  group_by(ID) %>%
  filter(!(No == "0" | is.na(No))) %>%
  slice(which.max(as.Date(Date, "%Y-%m-%d")))

但我得到了以下结果:

# A tibble: 2 × 4
# Groups:   ID [2]
     ID No      Date       point
  <dbl> <chr>   <chr>      <dbl>
1   147 A146754 2014-11-28  21.5
2   479 A009146 2020-01-10  31.1

任何帮助将不胜感激。

英文:

I need to filter rows from dataframe based on multiple conditions.
Condition is after group_by ID, and No is not NA or 0, then Date is latest.

Here is my data like:

df &lt;- data.frame(ID = rep(c(479,147,643,228,501), c(3,3,1,2,2)),
           No = c(&quot;A009145&quot;,&quot;A009146&quot;,&quot;0&quot;,&quot;A146754&quot;,NA,&quot;A156764&quot;,NA,NA,NA,NA,NA),
           Date = c(&quot;2012-06-28&quot;,&quot;2020-01-10&quot;,&quot;2020-01-10&quot;,&quot;2014-11-28&quot;,NA,NA,&quot;2019-04-13&quot;,&quot;2017-10-27&quot;,&quot;2017-12-01&quot;,NA,NA),
           point = c(25.2,31.1,31.1,21.5,18.6,18.6,27.77,18.52,18.52,26.3,26.3))

    ID        No        Date   point
1  479   A009145  2012-06-28    25.2
2  479   A009146  2020-01-10    31.1
3  479         0  2020-01-10    31.1
4  147   A146754  2014-11-28    21.5
5  147        NA          NA    18.6
6  147   A156764          NA    18.6
7  643        NA  2019-04-13    27.77
8  228        NA  2017-10-27    18.52
9  228        NA  2017-12-01    18.52
10 501        NA          NA    26.3
11 501        NA          NA    26.3

I want to get result like:

    ID        No        Date   point
2  479   A009146  2020-01-10    31.1
4  147   A146754  2014-11-28    21.5
7  643        NA  2019-04-13    27.77
9  228        NA  2017-12-01    18.52
10 501        NA          NA    26.3
11 501        NA          NA    26.3

For example in ID 479,

    ID        No        Date   point
1  479   A009145  2012-06-28    25.2 (romove because Date is not latest)
2  479   A009146  2020-01-10    31.1
3  479         0  2020-01-10    31.1 (romove because No is 0)

For example in ID 228, when the both of No is NA, will keep the row with latest Date.

    ID        No        Date   point
8  228        NA  2017-10-27    18.52(romove because Date is not latest)
9  228        NA  2017-12-01    18.52

For example in ID 501, both of No and Date is NA, so keep them both.

Here is my code I have tried:

df %&gt;%
  group_by(ID) %&gt;%
  filter(!(No == &quot;0&quot; | is.na(No))) %&gt;%
  slice(which.max(as.Date(Date, &quot;%Y-%m-%d&quot;)))

But I got like

# A tibble: 2 &#215; 4
# Groups:   ID [2]
     ID No      Date       point
  &lt;dbl&gt; &lt;chr&gt;   &lt;chr&gt;      &lt;dbl&gt;
1   147 A146754 2014-11-28  21.5
2   479 A009146 2020-01-10  31.1

Any help will be appreciated.

答案1

得分: 2

使用dplyr,您可以使用以下两个变量按顺序使用slice_max()

  1. !(is.na(No) | No == '0')
  2. Date

请注意,要按多个变量排序,您需要将它们包装在数据框或tibble中。

library(dplyr)

df %>%
  slice_max(order_by = tibble(!(is.na(No) | No == '0'), Date),
                  by = ID)

#     ID      No       Date point
# 2  479 A009146 2020-01-10 31.10
# 4  147 A146754 2014-11-28 21.50
# 7  643    <NA> 2019-04-13 27.77
# 9  228    <NA> 2017-12-01 18.52
# 10 501    <NA>       <NA> 26.30
# 11 501    <NA>       <NA> 26.30
英文:

With dplyr, you can use slice_max() ordered by the following 2 variables:

  1. !(is.na(No) | No == &#39;0&#39;)
  2. Date.

Note that to order by multiple variables, you need to wrap them in a data frame or tibble.

library(dplyr)

df %&gt;%
  slice_max(order_by = tibble(!(is.na(No) | No == &#39;0&#39;), Date),
                  by = ID)

#     ID      No       Date point
# 2  479 A009146 2020-01-10 31.10
# 4  147 A146754 2014-11-28 21.50
# 7  643    &lt;NA&gt; 2019-04-13 27.77
# 9  228    &lt;NA&gt; 2017-12-01 18.52
# 10 501    &lt;NA&gt;       &lt;NA&gt; 26.30
# 11 501    &lt;NA&gt;       &lt;NA&gt; 26.30

答案2

得分: -1

使用tidyverse库中的dplyr

# 根据条件筛选
cells %>%
    dplyr::filter(!is.na(No)) %>%
    dplyr::filter(No != 0)

您可以使用逗号在单个筛选中筛选多个条件。它会同时满足所有条件。

更多信息请参阅dplyr库

在R中处理电子表格时非常有用,可以在这里找到详细信息链接


<details>
<summary>英文:</summary>

Using the library dplyr from tidyverse 

filter by

cells %>%
dplyr::filter(!is.na(No)) %>%
dplyr::filter(No != 0)


You can filter multiple conditions in a single filter using commas. It will AND all conditions.

More info in the [dplyr libray][1].

Useful for working with spreadsheets in R / where I learnt this 
[2]. [1]: https://dplyr.tidyverse.org [2]: https://nacnudus.github.io/spreadsheet-munging-strategies/ </details>

huangapple
  • 本文由 发表于 2023年7月13日 16:15:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677276.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定