在R中不同范围内的数值

huangapple go评论83阅读模式
英文:

Values falling into different ranges in R

问题

我有一个名为grd的网格,其中包含不同的范围,如下所示:

  1. > grd
  2. count treshold
  3. 1 1 0.01
  4. 2 2 0.02
  5. 3 3 0.05
  6. 4 4 0.10
  7. 5 5 0.20

还有一个名为df的数据框,内容如下:

  1. > df
  2. param name
  3. 1 0.124 Tim
  4. 2 0.011 John
  5. 3 0.002 Alex
  6. 4 0.023 Jessica
  7. 5 0.056 Rose

我想要使用grd$treshold将另一列添加到数据框中,命名为df$bucket,用于报告df$param中的值属于哪个范围。

例如,param的第一个值0.124大于阈值0.10,因此它将属于计数5。第二个值0.011在0.01和0.02之间,因此它将属于计数2,以此类推。

这是最终的结果:

  1. > df
  2. param name bucket
  3. 1 0.124 Tim 5
  4. 2 0.011 John 2
  5. 3 0.002 Alex 1
  6. 4 0.023 Jessica 3
  7. 5 0.056 Rose 4
英文:

I have a grid grd of different ranges like this one:

  1. > grd
  2. count treshold
  3. 1 1 0.01
  4. 2 2 0.02
  5. 3 3 0.05
  6. 4 4 0.10
  7. 5 5 0.20

and a dataframe df like this one:

  1. > df
  2. param name
  3. 1 0.124 Tim
  4. 2 0.011 John
  5. 3 0.002 Alex
  6. 4 0.023 Jessica
  7. 5 0.056 Rose

I would like to use grd$treshold to add another column to the dataframe, df$bucket reporting which range the values in df$param fall into.

For instance the first value of param, 0.124, is higher than treshold, 0.10, then it will fall in count 5. The second one, 0.011, is between 0.01 and 0.02, then it will fall into count 2, and so on.

This is the final result:

  1. > df
  2. param name bucket
  3. 1 0.124 Tim 5
  4. 2 0.011 John 2
  5. 3 0.002 Alex 1
  6. 4 0.023 Jessica 3
  7. 5 0.056 Rose 4

答案1

得分: 2

使用findInterval()的基本解决方案:

  1. df$bucket <- findInterval(df$param, grd$treshold) + 1
  2. df$bucket
  3. # [1] 5 2 1 3 4

您还可以使用dplyr的滚动连接(rolling join):

  1. library(dplyr)
  2. df %>%
  3. left_join(grd, by = join_by(closest(param < treshold))) %>%
  4. select(-treshold)
  5. # param name count
  6. # 1 0.124 Tim 5
  7. # 2 0.011 John 2
  8. # 3 0.002 Alex 1
  9. # 4 0.023 Jessica 3
  10. # 5 0.056 Rose 4

数据

  1. grd <- read.table(text = "
  2. count treshold
  3. 1 1 0.01
  4. 2 2 0.02
  5. 3 3 0.05
  6. 4 4 0.10
  7. 5 5 0.20")
  8. df <- read.table(text = "
  9. param name
  10. 1 0.124 Tim
  11. 2 0.011 John
  12. 3 0.002 Alex
  13. 4 0.023 Jessica
  14. 5 0.056 Rose")
英文:

A base solution with findInterval():

  1. df$bucket &lt;- findInterval(df$param, grd$treshold) + 1
  2. df$bucket
  3. # [1] 5 2 1 3 4

You can also use a rolling join with dplyr:

  1. library(dplyr)
  2. df %&gt;%
  3. left_join(grd, by = join_by(closest(param &lt; treshold))) %&gt;%
  4. select(-treshold)
  5. # param name count
  6. # 1 0.124 Tim 5
  7. # 2 0.011 John 2
  8. # 3 0.002 Alex 1
  9. # 4 0.023 Jessica 3
  10. # 5 0.056 Rose 4

Data

  1. grd &lt;- read.table(text = &quot;
  2. count treshold
  3. 1 1 0.01
  4. 2 2 0.02
  5. 3 3 0.05
  6. 4 4 0.10
  7. 5 5 0.20&quot;)
  8. df &lt;- read.table(text = &quot;
  9. param name
  10. 1 0.124 Tim
  11. 2 0.011 John
  12. 3 0.002 Alex
  13. 4 0.023 Jessica
  14. 5 0.056 Rose&quot;)

答案2

得分: 0

以下是使用dplyr的可能解决方案:

  1. library(dplyr)
  2. df <- df %>%
  3. mutate(
  4. bucket = case_when(
  5. param <= 0.01 ~ 1,
  6. param <= 0.02 ~ 2,
  7. param <= 0.05 ~ 3,
  8. param <= 0.10 ~ 4,
  9. param <= 0.20 ~ 5
  10. )
  11. )

据我理解,你在问题中分享的最终结果是不正确的(第2行)。如果我理解错了,你可以轻松地调整case_when()中的阈值参数。

英文:

Here is a possible solution using dplyr

  1. library(dplyr)
  2. df &lt;- df |&gt;
  3. mutate(
  4. bucket = case_when(
  5. param &lt;= 0.01 ~ 1,
  6. param &lt;= 0.02 ~ 2,
  7. param &lt;= 0.05 ~ 3,
  8. param &lt;= 0.10 ~ 4,
  9. param &lt;= 0.20 ~ 5
  10. )
  11. )

As far as I understood you question, the final result you shared in your question is not correct (row 2). If I misunderstood you can easily adjust the threshold parameters in case_when()

huangapple
  • 本文由 发表于 2023年3月7日 21:38:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/75662703.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定