R:按增量创建和标记组

huangapple go评论72阅读模式
英文:

R: Creating and Labelling Groups by Increments

问题

我正在使用R编程语言工作。

我有以下数据:

  1. set.seed(123)
  2. my_data = data.frame(var1 = rnorm(100,100,100))
  3. min = min(my_data$var1)
  4. max = max(my_data$var1)

这是我想要做的:

  • 从var1的最小值开始,我想创建一个变量,将var1的值按照一定的“固定增量”(例如,按10增加)分组,直到达到var1的最大值。
  • 然后,我想创建另一个变量,用于标记每个组的最小/最大值。

这是我尝试的代码:

  1. # 创建增量向量
  2. breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)
  3. # 初始化新变量
  4. my_data$class <- NA
  5. my_data$label <- NA
  6. # 获取断点数量
  7. n <- length(breaks)
  8. # 循环
  9. for (i in 1:(n - 1)) {
  10. # 找出var1的每个值位于哪个“类别”(即断点)中
  11. indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
  12. # 进行赋值
  13. my_data$class[indices] <- i
  14. # 创建标签
  15. my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
  16. }

代码似乎已经运行,但我不确定是否正确(因为我看到了一些NA值)。

请问有人可以告诉我如何正确执行吗?

谢谢!

英文:

I am working with the R programming language.

I have the following data:

  1. set.seed(123)
  2. my_data = data.frame(var1 = rnorm(100,100,100))
  3. min = min(my_data$var1)
  4. max = max(my_data$var)

Here is what I am trying to do:

  • Starting from the smallest value of var1, I would like to create a variable that groups values of var1 by some "fixed increment" (e.g. by 10) until the maximum value of var1 is reached
  • Then, I would then like to create another variable which labels each of these groups by the min/max value of that group

Here is my attempt to do this:

  1. # create a vector of increments
  2. breaks &lt;- seq(min(my_data$var1), max(my_data$var1), by = 10)
  3. # initialize new variables
  4. my_data$class &lt;- NA
  5. my_data$label &lt;- NA
  6. # get the number of breaks
  7. n &lt;- length(breaks)
  8. # Loop
  9. for (i in 1:(n - 1)) {
  10. # find which &quot;class&quot; (i.e. break) each value of var1 is located within
  11. indices &lt;- which(my_data$var1 &gt; breaks[i] &amp; my_data$var1 &lt;= breaks[i + 1])
  12. # make assignment
  13. my_data$class[indices] &lt;- i
  14. # create labels
  15. my_data$label[indices] &lt;- paste(breaks[i], breaks[i + 1])
  16. }

The code seems to have run, but I am not sure if this is correct (I don't think I have done this correctly because I see some NA's).

Can someone please tell show me how to do this correctly?

Thanks!

答案1

得分: 1

这可以通过非等值连接来完成。

  1. library(data.table)
  2. my_data1 <- copy(my_data)
  3. setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
  4. type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
  5. on = .(var1 > start, var1 <= end), by = .EACHI]

-output

  1. > head(my_data1)
  2. var1 indices label
  3. 1: 43.95244 18 39.0831124359188 49.0831124359188
  4. 2: 76.98225 21 69.0831124359188 79.0831124359188
  5. 3: 255.87083 39 249.083112435919 259.083112435919
  6. 4: 107.05084 24 99.0831124359188 109.083112435919
  7. 5: 112.92877 25 109.083112435919 119.083112435919
  8. 6: 271.50650 41 269.083112435919 279.083112435919

与 OP 的 for 循环进行比较:

  1. > head(my_data)
  2. var1 class label
  3. 1 43.95244 18 39.0831124359188 49.0831124359188
  4. 2 76.98225 21 69.0831124359188 79.0831124359188
  5. 3 255.87083 39 249.083112435919 259.083112435919
  6. 4 107.05084 24 99.0831124359188 109.083112435919
  7. 5 112.92877 25 109.083112435919 119.083112435919
  8. 6 271.50650 41 269.083112435919 279.083112435919

关于输出中的 NAs,这是由于 seq 输出导致的。

  1. > breaks
  2. [1] -130.9168876 -120.9168876 -110.9168876 -100.9168876 -90.9168876 -80.9168876 -70.9168876 -60.9168876 -50.9168876 -40.9168876 -30.9168876
  3. [12] -20.9168876 -10.9168876 -0.9168876 9.0831124 19.0831124 29.0831124 39.0831124 49.0831124 59.0831124 69.0831124 79.0831124
  4. [23] 89.0831124 99.0831124 109.0831124 119.0831124 129.0831124 139.0831124 149.0831124 159.0831124 169.0831124 179.0831124 189.0831124
  5. [34] 199.0831124 209.0831124 219.0831124 229.0831124 239.0831124 249.0831124 259.0831124 269.0831124 279.0831124 289.0831124 299.0831124
  6. [45] 309.0831124

注意最大值是 309.083,对于 var1 > -130.9168876 将对那些完全相同的值返回 FALSE。相反,应该是 var1 >= -130.9168876。为了纠正这个问题,我们可能需要在末尾与 max 连接,然后取 unique(以防有重复值)。

  1. breaks <- unique(c(seq(min, max, by = 10), max))

现在,我们执行相同的操作:

  1. > setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
  2. + type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
  3. + on = .(var1 >= start, var1 <= end), by = .EACHI]
  4. >
  5. > head(my_data1)
  6. var1 indices label
  7. 1: 43.95244 18 39.0831124359188 49.0831124359188
  8. 2: 76.98225 21 69.0831124359188 79.0831124359188
  9. 3: 255.87083 39 249.083112435919 259.083112435919
  10. 4: 107.05084 24 99.0831124359188 109.083112435919
  11. 5: 112.92877 25 109.083112435919 119.083112435919
  12. 6: 271.50650 41 269.083112435919 279.083112435919
  13. > head(my_data)
  14. var1 class label
  15. 1 43.95244 18 39.0831124359188 49.0831124359188
  16. 2 76.98225 21 69.0831124359188 79.0831124359188
  17. 3 255.87083 39 249.083112435919 259.083112435919
  18. 4 107.05084 24 99.0831124359188 109.083112435919
  19. 5 112.92877 25 109.083112435919 119.083112435919
  20. 6 271.50650 41 269.083112435919 279.083112435919
  21. > my_data1[is.na(indices)]
  22. Empty data.table (0 rows and 3 cols): var1,indices,label
英文:

This could be done with a non-equi join

  1. library(data.table)
  2. my_data1 &lt;- copy(my_data)
  3. setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
  4. type = &quot;lead&quot;, fill = last(breaks))), c(&quot;indices&quot;, &quot;label&quot;) := .(.GRP, paste(start, end)),
  5. on = .(var1 &gt; start, var1 &lt;= end), by = .EACHI]

-output

  1. &gt; head(my_data1)
  2. var1 indices label
  3. 1: 43.95244 18 39.0831124359188 49.0831124359188
  4. 2: 76.98225 21 69.0831124359188 79.0831124359188
  5. 3: 255.87083 39 249.083112435919 259.083112435919
  6. 4: 107.05084 24 99.0831124359188 109.083112435919
  7. 5: 112.92877 25 109.083112435919 119.083112435919
  8. 6: 271.50650 41 269.083112435919 279.083112435919

compare it with OP's for loop

  1. &gt; head(my_data)
  2. var1 class label
  3. 1 43.95244 18 39.0831124359188 49.0831124359188
  4. 2 76.98225 21 69.0831124359188 79.0831124359188
  5. 3 255.87083 39 249.083112435919 259.083112435919
  6. 4 107.05084 24 99.0831124359188 109.083112435919
  7. 5 112.92877 25 109.083112435919 119.083112435919
  8. 6 271.50650 41 269.083112435919 279.083112435919

Regarding the NAs in the output, it is a result of the seq output

  1. &gt; breaks
  2. [1] -130.9168876 -120.9168876 -110.9168876 -100.9168876 -90.9168876 -80.9168876 -70.9168876 -60.9168876 -50.9168876 -40.9168876 -30.9168876
  3. [12] -20.9168876 -10.9168876 -0.9168876 9.0831124 19.0831124 29.0831124 39.0831124 49.0831124 59.0831124 69.0831124 79.0831124
  4. [23] 89.0831124 99.0831124 109.0831124 119.0831124 129.0831124 139.0831124 149.0831124 159.0831124 169.0831124 179.0831124 189.0831124
  5. [34] 199.0831124 209.0831124 219.0831124 229.0831124 239.0831124 249.0831124 259.0831124 269.0831124 279.0831124 289.0831124 299.0831124
  6. [45] 309.0831124

Note the max value is 309.083, and for the var1 &gt; -130.9168876 would return FALSE for those values that are exactly same. Instead, it should be var1 &gt;= -130.9168876. In order to correct this, we may need to concatenate with max at the end and then take the unique (in case there are duplicates)

  1. breaks &lt;- unique(c(seq(min, max, by = 10), max))

Now, we do the same

  1. &gt; setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
  2. + type = &quot;lead&quot;, fill = last(breaks))), c(&quot;indices&quot;, &quot;label&quot;) := .(.GRP, paste(start, end)),
  3. + on = .(var1 &gt;= start, var1 &lt;= end), by = .EACHI]
  4. &gt;
  5. &gt; head(my_data1)
  6. var1 indices label
  7. 1: 43.95244 18 39.0831124359188 49.0831124359188
  8. 2: 76.98225 21 69.0831124359188 79.0831124359188
  9. 3: 255.87083 39 249.083112435919 259.083112435919
  10. 4: 107.05084 24 99.0831124359188 109.083112435919
  11. 5: 112.92877 25 109.083112435919 119.083112435919
  12. 6: 271.50650 41 269.083112435919 279.083112435919
  13. &gt; head(my_data)
  14. var1 class label
  15. 1 43.95244 18 39.0831124359188 49.0831124359188
  16. 2 76.98225 21 69.0831124359188 79.0831124359188
  17. 3 255.87083 39 249.083112435919 259.083112435919
  18. 4 107.05084 24 99.0831124359188 109.083112435919
  19. 5 112.92877 25 109.083112435919 119.083112435919
  20. 6 271.50650 41 269.083112435919 279.083112435919
  21. &gt; my_data1[is.na(indices)]
  22. Empty data.table (0 rows and 3 cols): var1,indices,label

huangapple
  • 本文由 发表于 2023年1月9日 04:45:57
  • 转载请务必保留本文链接:https://go.coder-hub.com/75051154.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定