英文:
R: Creating and Labelling Groups by Increments
问题
我正在使用R编程语言工作。
我有以下数据:
set.seed(123)
my_data = data.frame(var1 = rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var1)
这是我想要做的:
- 从var1的最小值开始,我想创建一个变量,将var1的值按照一定的“固定增量”(例如,按10增加)分组,直到达到var1的最大值。
- 然后,我想创建另一个变量,用于标记每个组的最小/最大值。
这是我尝试的代码:
# 创建增量向量
breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)
# 初始化新变量
my_data$class <- NA
my_data$label <- NA
# 获取断点数量
n <- length(breaks)
# 循环
for (i in 1:(n - 1)) {
# 找出var1的每个值位于哪个“类别”(即断点)中
indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
# 进行赋值
my_data$class[indices] <- i
# 创建标签
my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
}
代码似乎已经运行,但我不确定是否正确(因为我看到了一些NA值)。
请问有人可以告诉我如何正确执行吗?
谢谢!
英文:
I am working with the R programming language.
I have the following data:
set.seed(123)
my_data = data.frame(var1 = rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)
Here is what I am trying to do:
- Starting from the smallest value of var1, I would like to create a variable that groups values of var1 by some "fixed increment" (e.g. by 10) until the maximum value of var1 is reached
- Then, I would then like to create another variable which labels each of these groups by the min/max value of that group
Here is my attempt to do this:
# create a vector of increments
breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)
# initialize new variables
my_data$class <- NA
my_data$label <- NA
# get the number of breaks
n <- length(breaks)
# Loop
for (i in 1:(n - 1)) {
# find which "class" (i.e. break) each value of var1 is located within
indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
# make assignment
my_data$class[indices] <- i
# create labels
my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
}
The code seems to have run, but I am not sure if this is correct (I don't think I have done this correctly because I see some NA's).
Can someone please tell show me how to do this correctly?
Thanks!
答案1
得分: 1
这可以通过非等值连接来完成。
library(data.table)
my_data1 <- copy(my_data)
setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
on = .(var1 > start, var1 <= end), by = .EACHI]
-output
> head(my_data1)
var1 indices label
1: 43.95244 18 39.0831124359188 49.0831124359188
2: 76.98225 21 69.0831124359188 79.0831124359188
3: 255.87083 39 249.083112435919 259.083112435919
4: 107.05084 24 99.0831124359188 109.083112435919
5: 112.92877 25 109.083112435919 119.083112435919
6: 271.50650 41 269.083112435919 279.083112435919
与 OP 的 for
循环进行比较:
> head(my_data)
var1 class label
1 43.95244 18 39.0831124359188 49.0831124359188
2 76.98225 21 69.0831124359188 79.0831124359188
3 255.87083 39 249.083112435919 259.083112435919
4 107.05084 24 99.0831124359188 109.083112435919
5 112.92877 25 109.083112435919 119.083112435919
6 271.50650 41 269.083112435919 279.083112435919
关于输出中的 NAs,这是由于 seq
输出导致的。
> breaks
[1] -130.9168876 -120.9168876 -110.9168876 -100.9168876 -90.9168876 -80.9168876 -70.9168876 -60.9168876 -50.9168876 -40.9168876 -30.9168876
[12] -20.9168876 -10.9168876 -0.9168876 9.0831124 19.0831124 29.0831124 39.0831124 49.0831124 59.0831124 69.0831124 79.0831124
[23] 89.0831124 99.0831124 109.0831124 119.0831124 129.0831124 139.0831124 149.0831124 159.0831124 169.0831124 179.0831124 189.0831124
[34] 199.0831124 209.0831124 219.0831124 229.0831124 239.0831124 249.0831124 259.0831124 269.0831124 279.0831124 289.0831124 299.0831124
[45] 309.0831124
注意最大值是 309.083,对于 var1 > -130.9168876
将对那些完全相同的值返回 FALSE。相反,应该是 var1 >= -130.9168876
。为了纠正这个问题,我们可能需要在末尾与 max
连接,然后取 unique
(以防有重复值)。
breaks <- unique(c(seq(min, max, by = 10), max))
现在,我们执行相同的操作:
> setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
+ type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
+ on = .(var1 >= start, var1 <= end), by = .EACHI]
>
> head(my_data1)
var1 indices label
1: 43.95244 18 39.0831124359188 49.0831124359188
2: 76.98225 21 69.0831124359188 79.0831124359188
3: 255.87083 39 249.083112435919 259.083112435919
4: 107.05084 24 99.0831124359188 109.083112435919
5: 112.92877 25 109.083112435919 119.083112435919
6: 271.50650 41 269.083112435919 279.083112435919
> head(my_data)
var1 class label
1 43.95244 18 39.0831124359188 49.0831124359188
2 76.98225 21 69.0831124359188 79.0831124359188
3 255.87083 39 249.083112435919 259.083112435919
4 107.05084 24 99.0831124359188 109.083112435919
5 112.92877 25 109.083112435919 119.083112435919
6 271.50650 41 269.083112435919 279.083112435919
> my_data1[is.na(indices)]
Empty data.table (0 rows and 3 cols): var1,indices,label
英文:
This could be done with a non-equi join
library(data.table)
my_data1 <- copy(my_data)
setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
on = .(var1 > start, var1 <= end), by = .EACHI]
-output
> head(my_data1)
var1 indices label
1: 43.95244 18 39.0831124359188 49.0831124359188
2: 76.98225 21 69.0831124359188 79.0831124359188
3: 255.87083 39 249.083112435919 259.083112435919
4: 107.05084 24 99.0831124359188 109.083112435919
5: 112.92877 25 109.083112435919 119.083112435919
6: 271.50650 41 269.083112435919 279.083112435919
compare it with OP's for
loop
> head(my_data)
var1 class label
1 43.95244 18 39.0831124359188 49.0831124359188
2 76.98225 21 69.0831124359188 79.0831124359188
3 255.87083 39 249.083112435919 259.083112435919
4 107.05084 24 99.0831124359188 109.083112435919
5 112.92877 25 109.083112435919 119.083112435919
6 271.50650 41 269.083112435919 279.083112435919
Regarding the NAs in the output, it is a result of the seq
output
> breaks
[1] -130.9168876 -120.9168876 -110.9168876 -100.9168876 -90.9168876 -80.9168876 -70.9168876 -60.9168876 -50.9168876 -40.9168876 -30.9168876
[12] -20.9168876 -10.9168876 -0.9168876 9.0831124 19.0831124 29.0831124 39.0831124 49.0831124 59.0831124 69.0831124 79.0831124
[23] 89.0831124 99.0831124 109.0831124 119.0831124 129.0831124 139.0831124 149.0831124 159.0831124 169.0831124 179.0831124 189.0831124
[34] 199.0831124 209.0831124 219.0831124 229.0831124 239.0831124 249.0831124 259.0831124 269.0831124 279.0831124 289.0831124 299.0831124
[45] 309.0831124
Note the max value is 309.083, and for the var1 > -130.9168876
would return FALSE for those values that are exactly same. Instead, it should be var1 >= -130.9168876
. In order to correct this, we may need to concatenate with max
at the end and then take the unique
(in case there are duplicates)
breaks <- unique(c(seq(min, max, by = 10), max))
Now, we do the same
> setDT(my_data1)[data.table(start = breaks, end = shift(breaks,
+ type = "lead", fill = last(breaks))), c("indices", "label") := .(.GRP, paste(start, end)),
+ on = .(var1 >= start, var1 <= end), by = .EACHI]
>
> head(my_data1)
var1 indices label
1: 43.95244 18 39.0831124359188 49.0831124359188
2: 76.98225 21 69.0831124359188 79.0831124359188
3: 255.87083 39 249.083112435919 259.083112435919
4: 107.05084 24 99.0831124359188 109.083112435919
5: 112.92877 25 109.083112435919 119.083112435919
6: 271.50650 41 269.083112435919 279.083112435919
> head(my_data)
var1 class label
1 43.95244 18 39.0831124359188 49.0831124359188
2 76.98225 21 69.0831124359188 79.0831124359188
3 255.87083 39 249.083112435919 259.083112435919
4 107.05084 24 99.0831124359188 109.083112435919
5 112.92877 25 109.083112435919 119.083112435919
6 271.50650 41 269.083112435919 279.083112435919
> my_data1[is.na(indices)]
Empty data.table (0 rows and 3 cols): var1,indices,label
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论