英文:
Creating a matrix of dummies for each hour in R?
问题
我的数据集跨越了一年。一开始是每15分钟一个数据点,然后我将其汇总成了每小时一个数据点。所以现在我有8760个小时的数据,每天从0到23点。我想要为线性回归创建一个虚拟矩阵,其中包含24行和24列,类似于这样:
| 小时 1 | 小时 2 | ...
| -------- | -------- |
| 1 | 0 |
| 0 | 1 |
我尝试了不同的函数,但没有成功。我希望你们中的某人能帮助我。
以下是当前数据集的代码:
data$time = substr(data$x, 1, 16)
data$time <- as.POSIXct(data$start_time, format = "%d.%m.%Y %H:%M", tz = "UTC")
data = subset(data, select = -c(x, y))
data$hour <- hour(data$time)
head(data)
df = data %>%
mutate(data_aggregate = floor_date(time, unit = "hour")) %>%
group_by(data_aggregate) %>%
summarise(W = sum(W, na.rm = TRUE))
df1 <- df %>%
mutate(hour = as.factor(hour(data_aggregate)))
英文:
My data set spans over a year. This first had quarter hours, which I then aggregated into hours. So now I have 8760 hours, which means always 0 to 23 per day. I would like to create now for a linear regression, a dummy matrix which contains 24 rows and 24 columns, something like that:
| hour 1 | hour 2 | ...
| -------- | -------- |
| 1 | 0 |
| 0 | 1 |
I tried it with different functions, but nothing works. I hope someone of you could help me.
This are the codes for the current dataset:
data$time = substr(data$x, 1, 16)
data$time <- as.POSIXct(data$start_time, format = "%d.%m.%Y %H:%M", tz = "UTC")
data = subset(data, select = -c(x, y))
data$hour <- hour(data$time)
head(data)
df = data %>%
mutate(data_aggregate = floor_date(time, unit = "hour")) %>%
group_by(data_aggregate) %>%
summarise(W = sum(W, na.rm = TRUE))
df1 <- df %>% mutate(hour = as.factor(hour(data_aggregate)))
答案1
得分: 0
正如@Onyambu解释的那样,在R中不需要这样做。在R中执行线性回归(或其他类型的统计模型)时,如果将一个因子变量作为预测变量,R会自动为该因子的每个水平生成虚拟变量(除了一个水平,该水平被用作参考水平)。这被称为"虚拟编码"或"独热编码"。
在你的情况下,当你为小时创建一个因子变量时,R会自动创建23个虚拟变量(因为有24小时,其中一个用作参考水平)。
df1$hour <- as.factor(df1$hour)
model = lm(W ~ hour, data = df1)
如果你仍然想创建虚拟变量,下面是如何操作的。不建议这样做
要使用基本的R model.matrix
函数创建一个虚拟矩阵,你可以使用:
df1$hour <- as.factor(df1$hour)
dummy_matrix <- model.matrix(~hour-1, data = df1)
~hour-1
公式表示我们想要从小时变量中生成一个没有截距的模型矩阵(-1),因为我们想要表示每个小时的所有24列。
你可以使用 cbind()
将这个矩阵重新附加到原始数据框中。
英文:
As @Onyambu explained, you don't need to do it in R. When performing linear regression (or other types of statistical models) in R, if you include a factor variable as a predictor, R automatically generates dummy variables for each level of the factor (except one which is used as the reference level). This is known as "dummy coding" or "one-hot encoding".
In your case, when you create a factor variable for the hour, R will automatically create 23 dummy variables (since there are 24 hours, and one is used as the reference level).
df1$hour <- as.factor(df1$hour)
model = lm(W ~ hour, data = df1)
If you still want to create dummy variables, here's how to do it. Not recommended
To create a dummy matrix using base R's model.matrix
function, you can use:
df1$hour <- as.factor(df1$hour)
dummy_matrix <- model.matrix(~hour-1, data = df1)
The ~hour-1
formula means that we want a model matrix from the hour variable without an intercept (-1) because we want all 24 columns representing each hour.
You can reattach this matrix to the original data frame using cbind()
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论