2023年6月29日 05:00:44go评论101阅读模式

英文:

Creating a matrix of dummies for each hour in R?

问题

我的数据集跨越了一年。一开始是每15分钟一个数据点，然后我将其汇总成了每小时一个数据点。所以现在我有8760个小时的数据，每天从0到23点。我想要为线性回归创建一个虚拟矩阵，其中包含24行和24列，类似于这样：

| 小时 1 | 小时 2 | ...
| -------- | -------- |
| 1 | 0 |
| 0 | 1 |

我尝试了不同的函数，但没有成功。我希望你们中的某人能帮助我。

以下是当前数据集的代码：

data$time = substr(data$x, 1, 16)
data$time <- as.POSIXct(data$start_time, format = "%d.%m.%Y %H:%M", tz = "UTC")
data = subset(data, select = -c(x, y))
data$hour <- hour(data$time)
head(data)
df = data %>%
  mutate(data_aggregate = floor_date(time, unit = "hour")) %>%
  group_by(data_aggregate) %>%
  summarise(W = sum(W, na.rm = TRUE))
df1 <-  df %>%
  mutate(hour = as.factor(hour(data_aggregate)))

英文:

My data set spans over a year. This first had quarter hours, which I then aggregated into hours. So now I have 8760 hours, which means always 0 to 23 per day. I would like to create now for a linear regression, a dummy matrix which contains 24 rows and 24 columns, something like that:

| hour 1 | hour 2 | ...
| -------- | -------- |
| 1 | 0 |
| 0 | 1 |

I tried it with different functions, but nothing works. I hope someone of you could help me.
This are the codes for the current dataset:

data$time = substr(data$x, 1, 16)
data$time &lt;- as.POSIXct(data$start_time, format = &quot;%d.%m.%Y %H:%M&quot;, tz = &quot;UTC&quot;)
data = subset(data, select = -c(x, y))
data$hour &lt;- hour(data$time)
head(data)
df = data %&gt;%
  mutate(data_aggregate = floor_date(time, unit = &quot;hour&quot;)) %&gt;%
  group_by(data_aggregate) %&gt;%
  summarise(W = sum(W, na.rm = TRUE))
df1 &lt;-  df %&gt;% mutate(hour = as.factor(hour(data_aggregate)))

答案1

得分: 0

正如@Onyambu解释的那样，在R中不需要这样做。在R中执行线性回归（或其他类型的统计模型）时，如果将一个因子变量作为预测变量，R会自动为该因子的每个水平生成虚拟变量（除了一个水平，该水平被用作参考水平）。这被称为"虚拟编码"或"独热编码"。

在你的情况下，当你为小时创建一个因子变量时，R会自动创建23个虚拟变量（因为有24小时，其中一个用作参考水平）。

df1$hour <- as.factor(df1$hour)
model = lm(W ~ hour, data = df1)

如果你仍然想创建虚拟变量，下面是如何操作的。不建议这样做

要使用基本的R model.matrix 函数创建一个虚拟矩阵，你可以使用：

df1$hour <- as.factor(df1$hour)
dummy_matrix <- model.matrix(~hour-1, data = df1)

~hour-1 公式表示我们想要从小时变量中生成一个没有截距的模型矩阵（-1），因为我们想要表示每个小时的所有24列。

你可以使用 cbind() 将这个矩阵重新附加到原始数据框中。

英文:

As @Onyambu explained, you don't need to do it in R. When performing linear regression (or other types of statistical models) in R, if you include a factor variable as a predictor, R automatically generates dummy variables for each level of the factor (except one which is used as the reference level). This is known as "dummy coding" or "one-hot encoding".

In your case, when you create a factor variable for the hour, R will automatically create 23 dummy variables (since there are 24 hours, and one is used as the reference level).

df1$hour &lt;- as.factor(df1$hour)
model = lm(W ~ hour, data = df1)

If you still want to create dummy variables, here's how to do it. Not recommended

To create a dummy matrix using base R's model.matrix function, you can use:

df1$hour &lt;- as.factor(df1$hour)
dummy_matrix &lt;- model.matrix(~hour-1, data = df1)

The ~hour-1 formula means that we want a model matrix from the hour variable without an intercept (-1) because we want all 24 columns representing each hour.

You can reattach this matrix to the original data frame using cbind().

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在R中为每个小时创建一个虚拟矩阵？

问题

答案1

My R plot of a time series is contradictory with the same plot on a larger time span, why is that?

ggplot2柱状图与统计数据（凋亡/坏死分析）

在R中从OCR生成的列表创建干净的数据框。

Plotly在悬停时显示缺失值

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。