英文:
Filter rows based on conditions in R
问题
我需要根据多个条件过滤数据框中的行。
条件是在group_by之后,ID和No不是NA或0,然后Date是最新的。
这是我的数据:
df <- data.frame(ID = rep(c(479,147,643,228,501), c(3,3,1,2,2)),
           No = c("A009145","A009146","0","A146754",NA,"A156764",NA,NA,NA,NA,NA),
           Date = c("2012-06-28","2020-01-10","2020-01-10","2014-11-28",NA,NA,"2019-04-13","2017-10-27","2017-12-01",NA,NA),
           point = c(25.2,31.1,31.1,21.5,18.6,18.6,27.77,18.52,18.52,26.3,26.3))
我想要的结果如下:
    ID        No        Date   point
2  479   A009146  2020-01-10    31.1
4  147   A146754  2014-11-28    21.5
7  643        NA  2019-04-13    27.77
9  228        NA  2017-12-01    18.52
10 501        NA          NA    26.3
11 501        NA          NA    26.3
例如,在ID 479中,
    ID        No        Date   point
1  479   A009145  2012-06-28    25.2 (移除,因为Date不是最新的)
2  479   A009146  2020-01-10    31.1
3  479         0  2020-01-10    31.1 (移除,因为No是0)
例如,在ID 228中,当No都是NA时,将保留具有最新Date的行。
    ID        No        Date   point
8  228        NA  2017-10-27    18.52 (移除,因为Date不是最新的)
9  228        NA  2017-12-01    18.52
例如,在ID 501中,No和Date都是NA,因此保留它们。
这是我尝试过的代码:
df %>%
  group_by(ID) %>%
  filter(!(No == "0" | is.na(No))) %>%
  slice(which.max(as.Date(Date, "%Y-%m-%d")))
但我得到了以下结果:
# A tibble: 2 × 4
# Groups:   ID [2]
     ID No      Date       point
  <dbl> <chr>   <chr>      <dbl>
1   147 A146754 2014-11-28  21.5
2   479 A009146 2020-01-10  31.1
任何帮助将不胜感激。
英文:
I need to filter rows from dataframe based on multiple conditions.
Condition is after group_by ID, and No is not NA or 0, then Date is latest.
Here is my data like:
df <- data.frame(ID = rep(c(479,147,643,228,501), c(3,3,1,2,2)),
           No = c("A009145","A009146","0","A146754",NA,"A156764",NA,NA,NA,NA,NA),
           Date = c("2012-06-28","2020-01-10","2020-01-10","2014-11-28",NA,NA,"2019-04-13","2017-10-27","2017-12-01",NA,NA),
           point = c(25.2,31.1,31.1,21.5,18.6,18.6,27.77,18.52,18.52,26.3,26.3))
    ID        No        Date   point
1  479   A009145  2012-06-28    25.2
2  479   A009146  2020-01-10    31.1
3  479         0  2020-01-10    31.1
4  147   A146754  2014-11-28    21.5
5  147        NA          NA    18.6
6  147   A156764          NA    18.6
7  643        NA  2019-04-13    27.77
8  228        NA  2017-10-27    18.52
9  228        NA  2017-12-01    18.52
10 501        NA          NA    26.3
11 501        NA          NA    26.3
I want to get result like:
    ID        No        Date   point
2  479   A009146  2020-01-10    31.1
4  147   A146754  2014-11-28    21.5
7  643        NA  2019-04-13    27.77
9  228        NA  2017-12-01    18.52
10 501        NA          NA    26.3
11 501        NA          NA    26.3
For example in ID 479,
    ID        No        Date   point
1  479   A009145  2012-06-28    25.2 (romove because Date is not latest)
2  479   A009146  2020-01-10    31.1
3  479         0  2020-01-10    31.1 (romove because No is 0)
For example in ID 228, when the both of No is NA, will keep the row with latest Date.
    ID        No        Date   point
8  228        NA  2017-10-27    18.52(romove because Date is not latest)
9  228        NA  2017-12-01    18.52
For example in ID 501, both of No and Date is NA, so keep them both.
Here is my code I have tried:
df %>%
  group_by(ID) %>%
  filter(!(No == "0" | is.na(No))) %>%
  slice(which.max(as.Date(Date, "%Y-%m-%d")))
But I got like
# A tibble: 2 × 4
# Groups:   ID [2]
     ID No      Date       point
  <dbl> <chr>   <chr>      <dbl>
1   147 A146754 2014-11-28  21.5
2   479 A009146 2020-01-10  31.1
Any help will be appreciated.
答案1
得分: 2
使用dplyr,您可以使用以下两个变量按顺序使用slice_max():
!(is.na(No) | No == '0')Date。
请注意,要按多个变量排序,您需要将它们包装在数据框或tibble中。
library(dplyr)
df %>%
  slice_max(order_by = tibble(!(is.na(No) | No == '0'), Date),
                  by = ID)
#     ID      No       Date point
# 2  479 A009146 2020-01-10 31.10
# 4  147 A146754 2014-11-28 21.50
# 7  643    <NA> 2019-04-13 27.77
# 9  228    <NA> 2017-12-01 18.52
# 10 501    <NA>       <NA> 26.30
# 11 501    <NA>       <NA> 26.30
英文:
With dplyr, you can use slice_max() ordered by the following 2 variables:
!(is.na(No) | No == '0')Date.
Note that to order by multiple variables, you need to wrap them in a data frame or tibble.
library(dplyr)
df %>%
  slice_max(order_by = tibble(!(is.na(No) | No == '0'), Date),
                  by = ID)
#     ID      No       Date point
# 2  479 A009146 2020-01-10 31.10
# 4  147 A146754 2014-11-28 21.50
# 7  643    <NA> 2019-04-13 27.77
# 9  228    <NA> 2017-12-01 18.52
# 10 501    <NA>       <NA> 26.30
# 11 501    <NA>       <NA> 26.30
答案2
得分: -1
使用tidyverse库中的dplyr
# 根据条件筛选
cells %>%
    dplyr::filter(!is.na(No)) %>%
    dplyr::filter(No != 0)
您可以使用逗号在单个筛选中筛选多个条件。它会同时满足所有条件。
更多信息请参阅dplyr库。
在R中处理电子表格时非常有用,可以在这里找到详细信息链接。
<details>
<summary>英文:</summary>
Using the library dplyr from tidyverse 
filter by
cells %>%
dplyr::filter(!is.na(No)) %>%
dplyr::filter(No != 0)
You can filter multiple conditions in a single filter using commas. It will AND all conditions.
More info in the [dplyr libray][1].
Useful for working with spreadsheets in R / where I learnt this [2].
  [1]: https://dplyr.tidyverse.org
  [2]: https://nacnudus.github.io/spreadsheet-munging-strategies/
</details>
				通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论