2023年1月9日 04:45:57go评论81阅读模式

英文:

R: Creating and Labelling Groups by Increments

问题

我正在使用R编程语言工作。

我有以下数据：

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var1)

这是我想要做的：

从var1的最小值开始，我想创建一个变量，将var1的值按照一定的“固定增量”（例如，按10增加）分组，直到达到var1的最大值。
然后，我想创建另一个变量，用于标记每个组的最小/最大值。

这是我尝试的代码：

# 创建增量向量
breaks <- seq(min(my_data$var1), max(my_data$var1), by = 10)
# 初始化新变量
my_data$class <- NA
my_data$label <- NA
# 获取断点数量
n <- length(breaks)
# 循环
for (i in 1:(n - 1)) {
    # 找出var1的每个值位于哪个“类别”（即断点）中
    indices <- which(my_data$var1 > breaks[i] & my_data$var1 <= breaks[i + 1])
    
    # 进行赋值
    my_data$class[indices] <- i
    
    # 创建标签
    my_data$label[indices] <- paste(breaks[i], breaks[i + 1])
}

代码似乎已经运行，但我不确定是否正确（因为我看到了一些NA值）。

请问有人可以告诉我如何正确执行吗？

谢谢！

英文:

I am working with the R programming language.

I have the following data:

set.seed(123)
my_data = data.frame(var1 =  rnorm(100,100,100))
min = min(my_data$var1)
max = max(my_data$var)

Here is what I am trying to do:

Starting from the smallest value of var1, I would like to create a variable that groups values of var1 by some "fixed increment" (e.g. by 10) until the maximum value of var1 is reached
Then, I would then like to create another variable which labels each of these groups by the min/max value of that group

Here is my attempt to do this:

# create a vector of increments
breaks &lt;- seq(min(my_data$var1), max(my_data$var1), by = 10)
# initialize new variables
my_data$class &lt;- NA
my_data$label &lt;- NA
# get the number of breaks
n &lt;- length(breaks)
# Loop 
for (i in 1:(n - 1)) {
    # find which &quot;class&quot; (i.e. break) each value of var1 is located within
    indices &lt;- which(my_data$var1 &gt; breaks[i] &amp; my_data$var1 &lt;= breaks[i + 1])
    
    # make assignment
    my_data$class[indices] &lt;- i
    
    # create labels
    my_data$label[indices] &lt;- paste(breaks[i], breaks[i + 1])
}

The code seems to have run, but I am not sure if this is correct (I don't think I have done this correctly because I see some NA's).

Can someone please tell show me how to do this correctly?

Thanks!

答案1

得分: 1

这可以通过非等值连接来完成。

library(data.table)
my_data1 <- copy(my_data)
setDT(my_data1)[data.table(start = breaks, end = shift(breaks, 
   type = "lead", fill = last(breaks))),  c("indices", "label") := .(.GRP, paste(start, end)), 
    on = .(var1 > start, var1 <= end), by = .EACHI]

-output

> head(my_data1)
        var1 indices                             label
1:  43.95244      18 39.0831124359188 49.0831124359188
2:  76.98225      21 69.0831124359188 79.0831124359188
3: 255.87083      39 249.083112435919 259.083112435919
4: 107.05084      24 99.0831124359188 109.083112435919
5: 112.92877      25 109.083112435919 119.083112435919
6: 271.50650      41 269.083112435919 279.083112435919

与 OP 的 for 循环进行比较：

> head(my_data)
       var1 class                             label
1  43.95244    18 39.0831124359188 49.0831124359188
2  76.98225    21 69.0831124359188 79.0831124359188
3 255.87083    39 249.083112435919 259.083112435919
4 107.05084    24 99.0831124359188 109.083112435919
5 112.92877    25 109.083112435919 119.083112435919
6 271.50650    41 269.083112435919 279.083112435919

关于输出中的 NAs，这是由于 seq 输出导致的。

> breaks
 [1] -130.9168876 -120.9168876 -110.9168876 -100.9168876  -90.9168876  -80.9168876  -70.9168876  -60.9168876  -50.9168876  -40.9168876  -30.9168876
[12]  -20.9168876  -10.9168876   -0.9168876    9.0831124   19.0831124   29.0831124   39.0831124   49.0831124   59.0831124   69.0831124   79.0831124
[23]   89.0831124   99.0831124  109.0831124  119.0831124  129.0831124  139.0831124  149.0831124  159.0831124  169.0831124  179.0831124  189.0831124
[34]  199.0831124  209.0831124  219.0831124  229.0831124  239.0831124  249.0831124  259.0831124  269.0831124  279.0831124  289.0831124  299.0831124
[45]  309.0831124

注意最大值是 309.083，对于 var1 > -130.9168876 将对那些完全相同的值返回 FALSE。相反，应该是 var1 >= -130.9168876。为了纠正这个问题，我们可能需要在末尾与 max 连接，然后取 unique（以防有重复值）。

breaks <- unique(c(seq(min, max, by = 10), max))

现在，我们执行相同的操作：

> setDT(my_data1)[data.table(start = breaks, end = shift(breaks, 
+    type = "lead", fill = last(breaks))),  c("indices", "label") := .(.GRP, paste(start, end)), 
+     on = .(var1 >= start, var1 <= end), by = .EACHI]
> 
> head(my_data1)
        var1 indices                             label
1:  43.95244      18 39.0831124359188 49.0831124359188
2:  76.98225      21 69.0831124359188 79.0831124359188
3: 255.87083      39 249.083112435919 259.083112435919
4: 107.05084      24 99.0831124359188 109.083112435919
5: 112.92877      25 109.083112435919 119.083112435919
6: 271.50650      41 269.083112435919 279.083112435919
> head(my_data)
       var1 class                             label
1  43.95244    18 39.0831124359188 49.0831124359188
2  76.98225    21 69.0831124359188 79.0831124359188
3 255.87083    39 249.083112435919 259.083112435919
4 107.05084    24 99.0831124359188 109.083112435919
5 112.92877    25 109.083112435919 119.083112435919
6 271.50650    41 269.083112435919 279.083112435919
> my_data1[is.na(indices)]
Empty data.table (0 rows and 3 cols): var1,indices,label

英文:

This could be done with a non-equi join

library(data.table)
my_data1 &lt;- copy(my_data)
setDT(my_data1)[data.table(start = breaks, end = shift(breaks, 
   type = &quot;lead&quot;, fill = last(breaks))),  c(&quot;indices&quot;, &quot;label&quot;) := .(.GRP, paste(start, end)), 
    on = .(var1 &gt; start, var1 &lt;= end), by = .EACHI]

-output

&gt; head(my_data1)
        var1 indices                             label
1:  43.95244      18 39.0831124359188 49.0831124359188
2:  76.98225      21 69.0831124359188 79.0831124359188
3: 255.87083      39 249.083112435919 259.083112435919
4: 107.05084      24 99.0831124359188 109.083112435919
5: 112.92877      25 109.083112435919 119.083112435919
6: 271.50650      41 269.083112435919 279.083112435919

compare it with OP's for loop

&gt; head(my_data)
       var1 class                             label
1  43.95244    18 39.0831124359188 49.0831124359188
2  76.98225    21 69.0831124359188 79.0831124359188
3 255.87083    39 249.083112435919 259.083112435919
4 107.05084    24 99.0831124359188 109.083112435919
5 112.92877    25 109.083112435919 119.083112435919
6 271.50650    41 269.083112435919 279.083112435919

Regarding the NAs in the output, it is a result of the seq output

&gt; breaks
 [1] -130.9168876 -120.9168876 -110.9168876 -100.9168876  -90.9168876  -80.9168876  -70.9168876  -60.9168876  -50.9168876  -40.9168876  -30.9168876
[12]  -20.9168876  -10.9168876   -0.9168876    9.0831124   19.0831124   29.0831124   39.0831124   49.0831124   59.0831124   69.0831124   79.0831124
[23]   89.0831124   99.0831124  109.0831124  119.0831124  129.0831124  139.0831124  149.0831124  159.0831124  169.0831124  179.0831124  189.0831124
[34]  199.0831124  209.0831124  219.0831124  229.0831124  239.0831124  249.0831124  259.0831124  269.0831124  279.0831124  289.0831124  299.0831124
[45]  309.0831124

Note the max value is 309.083, and for the var1 > -130.9168876 would return FALSE for those values that are exactly same. Instead, it should be var1 >= -130.9168876. In order to correct this, we may need to concatenate with max at the end and then take the unique (in case there are duplicates)

breaks &lt;- unique(c(seq(min, max, by = 10), max))

Now, we do the same

&gt; setDT(my_data1)[data.table(start = breaks, end = shift(breaks, 
+    type = &quot;lead&quot;, fill = last(breaks))),  c(&quot;indices&quot;, &quot;label&quot;) := .(.GRP, paste(start, end)), 
+     on = .(var1 &gt;= start, var1 &lt;= end), by = .EACHI]
&gt; 
&gt; head(my_data1)
        var1 indices                             label
1:  43.95244      18 39.0831124359188 49.0831124359188
2:  76.98225      21 69.0831124359188 79.0831124359188
3: 255.87083      39 249.083112435919 259.083112435919
4: 107.05084      24 99.0831124359188 109.083112435919
5: 112.92877      25 109.083112435919 119.083112435919
6: 271.50650      41 269.083112435919 279.083112435919
&gt; head(my_data)
       var1 class                             label
1  43.95244    18 39.0831124359188 49.0831124359188
2  76.98225    21 69.0831124359188 79.0831124359188
3 255.87083    39 249.083112435919 259.083112435919
4 107.05084    24 99.0831124359188 109.083112435919
5 112.92877    25 109.083112435919 119.083112435919
6 271.50650    41 269.083112435919 279.083112435919
&gt; my_data1[is.na(indices)]
Empty data.table (0 rows and 3 cols): var1,indices,label

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R：按增量创建和标记组

问题

答案1

创建用于趋势分析表的if语句：条件长度大于1时出现错误。

`R`/`ggplot2`：合并个别`geom_histogram`层时的奇怪现象

在列表中迭代管道上的对象

分割和汇总结果，无需循环。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。