基于条件筛选行在 R 中

huangapple go评论118阅读模式
英文:

Filter rows based on conditions in R

问题

我需要根据多个条件过滤数据框中的行。
条件是在group_by之后,IDNo不是NA0,然后Date是最新的。

这是我的数据:

  1. df <- data.frame(ID = rep(c(479,147,643,228,501), c(3,3,1,2,2)),
  2. No = c("A009145","A009146","0","A146754",NA,"A156764",NA,NA,NA,NA,NA),
  3. Date = c("2012-06-28","2020-01-10","2020-01-10","2014-11-28",NA,NA,"2019-04-13","2017-10-27","2017-12-01",NA,NA),
  4. point = c(25.2,31.1,31.1,21.5,18.6,18.6,27.77,18.52,18.52,26.3,26.3))

我想要的结果如下:

  1. ID No Date point
  2. 2 479 A009146 2020-01-10 31.1
  3. 4 147 A146754 2014-11-28 21.5
  4. 7 643 NA 2019-04-13 27.77
  5. 9 228 NA 2017-12-01 18.52
  6. 10 501 NA NA 26.3
  7. 11 501 NA NA 26.3

例如,在ID 479中,

  1. ID No Date point
  2. 1 479 A009145 2012-06-28 25.2 (移除,因为Date不是最新的)
  3. 2 479 A009146 2020-01-10 31.1
  4. 3 479 0 2020-01-10 31.1 (移除,因为No0)

例如,在ID 228中,当No都是NA时,将保留具有最新Date的行。

  1. ID No Date point
  2. 8 228 NA 2017-10-27 18.52 (移除,因为Date不是最新的)
  3. 9 228 NA 2017-12-01 18.52

例如,在ID 501中,NoDate都是NA,因此保留它们。

这是我尝试过的代码:

  1. df %>%
  2. group_by(ID) %>%
  3. filter(!(No == "0" | is.na(No))) %>%
  4. slice(which.max(as.Date(Date, "%Y-%m-%d")))

但我得到了以下结果:

  1. # A tibble: 2 × 4
  2. # Groups: ID [2]
  3. ID No Date point
  4. <dbl> <chr> <chr> <dbl>
  5. 1 147 A146754 2014-11-28 21.5
  6. 2 479 A009146 2020-01-10 31.1

任何帮助将不胜感激。

英文:

I need to filter rows from dataframe based on multiple conditions.
Condition is after group_by ID, and No is not NA or 0, then Date is latest.

Here is my data like:

  1. df &lt;- data.frame(ID = rep(c(479,147,643,228,501), c(3,3,1,2,2)),
  2. No = c(&quot;A009145&quot;,&quot;A009146&quot;,&quot;0&quot;,&quot;A146754&quot;,NA,&quot;A156764&quot;,NA,NA,NA,NA,NA),
  3. Date = c(&quot;2012-06-28&quot;,&quot;2020-01-10&quot;,&quot;2020-01-10&quot;,&quot;2014-11-28&quot;,NA,NA,&quot;2019-04-13&quot;,&quot;2017-10-27&quot;,&quot;2017-12-01&quot;,NA,NA),
  4. point = c(25.2,31.1,31.1,21.5,18.6,18.6,27.77,18.52,18.52,26.3,26.3))
  5. ID No Date point
  6. 1 479 A009145 2012-06-28 25.2
  7. 2 479 A009146 2020-01-10 31.1
  8. 3 479 0 2020-01-10 31.1
  9. 4 147 A146754 2014-11-28 21.5
  10. 5 147 NA NA 18.6
  11. 6 147 A156764 NA 18.6
  12. 7 643 NA 2019-04-13 27.77
  13. 8 228 NA 2017-10-27 18.52
  14. 9 228 NA 2017-12-01 18.52
  15. 10 501 NA NA 26.3
  16. 11 501 NA NA 26.3

I want to get result like:

  1. ID No Date point
  2. 2 479 A009146 2020-01-10 31.1
  3. 4 147 A146754 2014-11-28 21.5
  4. 7 643 NA 2019-04-13 27.77
  5. 9 228 NA 2017-12-01 18.52
  6. 10 501 NA NA 26.3
  7. 11 501 NA NA 26.3

For example in ID 479,

  1. ID No Date point
  2. 1 479 A009145 2012-06-28 25.2 (romove because Date is not latest)
  3. 2 479 A009146 2020-01-10 31.1
  4. 3 479 0 2020-01-10 31.1 (romove because No is 0)

For example in ID 228, when the both of No is NA, will keep the row with latest Date.

  1. ID No Date point
  2. 8 228 NA 2017-10-27 18.52(romove because Date is not latest)
  3. 9 228 NA 2017-12-01 18.52

For example in ID 501, both of No and Date is NA, so keep them both.

Here is my code I have tried:

  1. df %&gt;%
  2. group_by(ID) %&gt;%
  3. filter(!(No == &quot;0&quot; | is.na(No))) %&gt;%
  4. slice(which.max(as.Date(Date, &quot;%Y-%m-%d&quot;)))

But I got like

  1. # A tibble: 2 &#215; 4
  2. # Groups: ID [2]
  3. ID No Date point
  4. &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
  5. 1 147 A146754 2014-11-28 21.5
  6. 2 479 A009146 2020-01-10 31.1

Any help will be appreciated.

答案1

得分: 2

使用dplyr,您可以使用以下两个变量按顺序使用slice_max()

  1. !(is.na(No) | No == '0')
  2. Date

请注意,要按多个变量排序,您需要将它们包装在数据框或tibble中。

  1. library(dplyr)
  2. df %>%
  3. slice_max(order_by = tibble(!(is.na(No) | No == '0'), Date),
  4. by = ID)
  5. # ID No Date point
  6. # 2 479 A009146 2020-01-10 31.10
  7. # 4 147 A146754 2014-11-28 21.50
  8. # 7 643 <NA> 2019-04-13 27.77
  9. # 9 228 <NA> 2017-12-01 18.52
  10. # 10 501 <NA> <NA> 26.30
  11. # 11 501 <NA> <NA> 26.30
英文:

With dplyr, you can use slice_max() ordered by the following 2 variables:

  1. !(is.na(No) | No == &#39;0&#39;)
  2. Date.

Note that to order by multiple variables, you need to wrap them in a data frame or tibble.

  1. library(dplyr)
  2. df %&gt;%
  3. slice_max(order_by = tibble(!(is.na(No) | No == &#39;0&#39;), Date),
  4. by = ID)
  5. # ID No Date point
  6. # 2 479 A009146 2020-01-10 31.10
  7. # 4 147 A146754 2014-11-28 21.50
  8. # 7 643 &lt;NA&gt; 2019-04-13 27.77
  9. # 9 228 &lt;NA&gt; 2017-12-01 18.52
  10. # 10 501 &lt;NA&gt; &lt;NA&gt; 26.30
  11. # 11 501 &lt;NA&gt; &lt;NA&gt; 26.30

答案2

得分: -1

使用tidyverse库中的dplyr

  1. # 根据条件筛选
  2. cells %>%
  3. dplyr::filter(!is.na(No)) %>%
  4. dplyr::filter(No != 0)

您可以使用逗号在单个筛选中筛选多个条件。它会同时满足所有条件。

更多信息请参阅dplyr库

在R中处理电子表格时非常有用,可以在这里找到详细信息链接

  1. <details>
  2. <summary>英文:</summary>
  3. Using the library dplyr from tidyverse

filter by

cells %>%
dplyr::filter(!is.na(No)) %>%
dplyr::filter(No != 0)

  1. You can filter multiple conditions in a single filter using commas. It will AND all conditions.
  2. More info in the [dplyr libray][1].
  3. Useful for working with spreadsheets in R / where I learnt this
    [2].
  4. [1]: https://dplyr.tidyverse.org
  5. [2]: https://nacnudus.github.io/spreadsheet-munging-strategies/
  6. </details>

huangapple
  • 本文由 发表于 2023年7月13日 16:15:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677276.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定