英文:
Running several year linear regressions from a single dataframe in R and storing coefficients into new dataframe
问题
我有以下的数据框:
df <- data.frame(
Company = rep(LETTERS[1:10], each = 10),
Year = rep(2010:2019, times = 10),
Var1 = runif(100, -1, 1),
Var2 = runif(100, -1, 1),
Var3 = runif(100, -1, 1),
Var4 = runif(100, -1, 1),
Var5 = runif(100, -1, 1)
)
我尝试每年运行 10 个 OLS 回归,将所有系数、标准误差、t-统计量和 p 值添加到一个新的大数据框中。我看到这个链接并尝试了以下方法:
coef_matrix <- df %>% group_by(Year) %>% do(tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df)))
然而,这给了我以下输出:
head(coef_matrix,15)
# A tibble: 15 x 6
# Groups: Year [3]
Year term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2010 (Intercept) 0.0274 0.0591 0.463 0.644
2 2010 Var2 -0.0339 0.113 -0.301 0.764
3 2010 Var3 -0.0705 0.112 -0.628 0.532
4 2010 Var4 0.0663 0.0973 0.682 0.497
5 2010 Var5 0.201 0.0983 2.04 0.0440
6 2011 (Intercept) 0.0274 0.0591 0.463 0.644
7 2011 Var2 -0.0339 0.113 -0.301 0.764
8 2011 Var3 -0.0705 0.112 -0.628 0.532
9 2011 Var4 0.0663 0.0973 0.682 0.497
10 2011 Var5 0.201 0.0983 2.04 0.0440
11 2012 (Intercept) 0.0274 0.0591 0.463 0.644
12 2012 Var2 -0.0339 0.113 -0.301 0.764
13 2012 Var3 -0.0705 0.112 -0.628 0.532
14 2012 Var4 0.0663 0.0973 0.682 0.497
15 2012 Var5 0.201 0.0983 2.04 0.0440
我想尝试为每一年运行 10 个年度回归,然后将它们的所有系数、标准误差等存储到一个数据框中。现在,系数与如果我为所有年份一起运行普通的 OLS 回归是相同的。
summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df))
这是如何更改的?我以为 group_by
函数应该运行年度回归,但似乎我犯了一个错误。因此,对于每一年,系数、标准误差等都应该不同(当然)。
英文:
i have the following df:
df <- data.frame(
Company = rep(LETTERS[1:10], each = 10),
Year = rep(2010:2019, times = 10),
Var1 = runif(100, -1, 1),
Var2 = runif(100, -1, 1),
Var3 = runif(100, -1, 1),
Var4 = runif(100, -1, 1),
Var5 = runif(100, -1, 1)
)
I am trying to run 10 ols regressions per year and adding all the coefficients, standard errors, t-statistic and p value into one big new dataframe. I came across this link and tried the following:
https://stackoverflow.com/questions/37395059/running-several-linear-regressions-from-a-single-dataframe-in-r
coef_matrix <- df %>% group_by(Year) %>% do(tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df)))
However, this gives me this output:
head(coef_matrix,15)
# A tibble: 15 x 6
# Groups: Year [3]
Year term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2010 (Intercept) 0.0274 0.0591 0.463 0.644
2 2010 Var2 -0.0339 0.113 -0.301 0.764
3 2010 Var3 -0.0705 0.112 -0.628 0.532
4 2010 Var4 0.0663 0.0973 0.682 0.497
5 2010 Var5 0.201 0.0983 2.04 0.0440
6 2011 (Intercept) 0.0274 0.0591 0.463 0.644
7 2011 Var2 -0.0339 0.113 -0.301 0.764
8 2011 Var3 -0.0705 0.112 -0.628 0.532
9 2011 Var4 0.0663 0.0973 0.682 0.497
10 2011 Var5 0.201 0.0983 2.04 0.0440
11 2012 (Intercept) 0.0274 0.0591 0.463 0.644
12 2012 Var2 -0.0339 0.113 -0.301 0.764
13 2012 Var3 -0.0705 0.112 -0.628 0.532
14 2012 Var4 0.0663 0.0973 0.682 0.497
15 2012 Var5 0.201 0.0983 2.04 0.0440
>
I want to try and run 10 annual regressions for every year and then store all their coefficients, standard errors etcetera into a dataframe. Right now the coefficients are the same as if i would run a normal ols regression for all years together:
summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df))
Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.10859 -0.51344 -0.05597 0.49481 1.13573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02736 0.05909 0.463 0.644
Var2 -0.03392 0.11252 -0.301 0.764
Var3 -0.07045 0.11223 -0.628 0.532
Var4 0.06633 0.09730 0.682 0.497
Var5 0.20064 0.09828 2.041 0.044 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5851 on 95 degrees of freedom
Multiple R-squared: 0.05331, Adjusted R-squared: 0.01345
F-statistic: 1.338 on 4 and 95 DF, p-value: 0.2617
How do i change this? I thought the group_by function was supposed to run the annual regression, but it seems like i am making a mistake. So for every year, the coefficients, standard errors etcetera should be different (of course). So for 2010, the output should be this in the newly created df:
summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df[df$Year=="2010", ]))
Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df[df$Year ==
"2010", ])
Residuals:
1 11 21 31 41 51 61 71 81 91
-0.04871 -0.08103 -0.79737 0.10159 0.18243 0.41049 -0.05951 -0.12389 0.18528 0.23072
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3293 0.2223 -1.481 0.1986
Var2 0.8657 0.3569 2.426 0.0597 .
Var3 0.5305 0.4735 1.120 0.3134
Var4 -0.2685 0.2446 -1.097 0.3225
Var5 -0.1988 0.3729 -0.533 0.6167
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4389 on 5 degrees of freedom
Multiple R-squared: 0.612, Adjusted R-squared: 0.3017
F-statistic: 1.972 on 4 and 5 DF, p-value: 0.2372
And this for 2011, and so on:
Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df[df$Year ==
"2011", ])
Residuals:
2 12 22 32 42 52 62 72 82 92
-0.541020 -0.626232 0.627003 0.326570 0.450827 -0.505888 -0.008373 0.264653 0.190374 -0.177914
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.23500 0.27901 0.842 0.438
Var2 -0.49973 0.59313 -0.843 0.438
Var3 0.21137 0.55453 0.381 0.719
Var4 0.34993 0.40960 0.854 0.432
Var5 0.09965 0.37772 0.264 0.802
Residual standard error: 0.597 on 5 degrees of freedom
Multiple R-squared: 0.2734, Adjusted R-squared: -0.3079
F-statistic: 0.4703 on 4 and 5 DF, p-value: 0.7577
答案1
得分: 2
df %>%
group_by(Year) %>%
summarise(broom::tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data = cur_data_all())))
# 一个数据框: 50 × 6
# 组: Year [10]
Year term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2010 (Intercept) 0.224 0.383 0.584 0.585
2 2010 Var2 0.643 0.689 0.933 0.394
3 2010 Var3 -0.251 0.512 -0.490 0.645
4 2010 Var4 -0.386 0.653 -0.592 0.580
5 2010 Var5 -0.222 0.388 -0.574 0.591
6 2011 (Intercept) 0.243 0.270 0.898 0.410
7 2011 Var2 -0.300 0.446 -0.673 0.531
8 2011 Var3 0.401 0.418 0.960 0.381
9 2011 Var4 0.424 0.330 1.29 0.255
10 2011 Var5 -0.465 0.291 -1.60 0.171
如果你有dplyr>=1.1
,考虑使用summarise
的替代方法reframe
,并使用pick(Var1:Var5)
代替cur_data()
。
英文:
df %>%
group_by(Year) %>%
summarise(broom::tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data= cur_data_all())))
# A tibble: 50 × 6
# Groups: Year [10]
Year term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 2010 (Intercept) 0.224 0.383 0.584 0.585
2 2010 Var2 0.643 0.689 0.933 0.394
3 2010 Var3 -0.251 0.512 -0.490 0.645
4 2010 Var4 -0.386 0.653 -0.592 0.580
5 2010 Var5 -0.222 0.388 -0.574 0.591
6 2011 (Intercept) 0.243 0.270 0.898 0.410
7 2011 Var2 -0.300 0.446 -0.673 0.531
8 2011 Var3 0.401 0.418 0.960 0.381
9 2011 Var4 0.424 0.330 1.29 0.255
10 2011 Var5 -0.465 0.291 -1.60 0.171
If you havedplyr>=1.1
then consider using reframe
instead of summarise
and use pick(Var1:Var5)
instead of cur_data()
答案2
得分: 0
以下是 data.table 实现的代码部分的中文翻译:
# 载入 data.table 库
library(data.table)
# 将数据框 df 转换为 data.table,并计算线性回归的摘要信息
setDT(df)[, as.data.table(
summary(lm(Var1~Var2+Var3+Var4+Var5, data=.SD))$coef,
keep.rownames = T),
Year
]
以下是输出部分的中文翻译:
年份 变量名 估计值 标准误差 t值 Pr(>|t|)
<int> <char> <num> <num> <num> <num>
1: 2010 截距 -0.0185139123 0.1745433 -0.106070604 0.91965050
2: 2010 Var2 0.8925268150 0.3074344 2.903145262 0.03366718
3: 2010 Var3 0.2056756465 0.2588893 0.794453941 0.46295473
4: 2010 Var4 0.1415611201 0.2333638 0.606611315 0.57059491
5: 2010 Var5 0.5847579888 0.2458589 2.378429109 0.06328800
6: 2011 截距 0.0226610822 0.2697575 0.084005375 0.93631186
7: 2011 Var2 0.1713025466 0.4501861 0.380514940 0.71919278
8: 2011 Var3 -0.0365874385 0.4618825 -0.079213730 0.93993524
9: 2011 Var4 0.4523682769 0.3970092 1.139440262 0.30614687
10: 2011 Var5 0.0754535192 0.4244909 0.177750631 0.86589569
11: 2012 截距 -0.0959599437 0.2731497 -0.351308981 0.73967566
12: 2012 Var2 0.3776595156 0.6546532 0.576884860 0.58901564
13: 2012 Var3 -0.0061169165 0.3507799 -0.017438049 0.98676160
14: 2012 Var4 0.0859049128 0.5806072 0.147957021 0.88815829
15: 2012 Var5 -0.0853547099 0.4864285 -0.175472270 0.86759347
16: 2013 截距 -0.1408782361 0.1712474 -0.822659140 0.44814357
17: 2013 Var2 0.1446858677 0.3215907 0.449906923 0.67161187
18: 2013 Var3 -0.2353407216 0.3858692 -0.609897700 0.56858084
19: 2013 Var4 -0.1236695218 0.3080081 -0.401513870 0.70462777
20: 2013 Var5 -0.6461539283 0.2538933 -2.544982676 0.05157997
21: 2014 截距 0.1457589874 0.2361635 0.617195160 0.56412468
22: 2014 Var2 -0.1902867534 0.6787563 -0.280346220 0.79044108
23: 2014 Var3 -0.0664099633 0.5026280 -0.132125479 0.90003734
24: 2014 Var4 0.3274313087 0.4057963 0.806885835 0.45638230
25: 2014 Var5 0.0009242792 0.5080055 0.001819428 0.99861867
26: 2015 截距 -0.2835090756 0.1837970 -1.542512028 0.18359216
27: 2015 Var2 -0.7611936157 0.3350701 -2.271744145 0.07228084
28: 2015 Var3 -0.0460680232 0.3194645 -0.144203885 0.89097155
29: 2015 Var4 -0.4157749562 0.2315036 -1.795976121 0.13244194
30: 2015 Var5 0.0744453929 0.2760067 0.269723121 0.79815100
31: 2016 截距 0.3445679145 0.2215998 1.554910686 0.18068667
32: 2016 Var2 0.2026557534 0.4119664 0.491923044 0.64360822
33: 2016 Var3 0.1997847174 0.3087333 0.647110955 0.54609126
34: 2016 Var4 0.0680401474 0.2937252 0.231645560 0.82599500
35: 2016 Var5 -0.0948129954 0.3593450 -0.263849455 0.80242521
36: 2017 截距 0.1813009784 0.1807830 1.002865311 0.36196037
37: 2017 Var2 -0.1404884248 0.1961502 -0.
<details>
<summary>英文:</summary>
Here is a data.table implementation:
library(data.table)
setDT(df)[, as.data.table(
summary(lm(Var1~Var2+Var3+Var4+Var5, data=.SD))$coef,
keep.rownames = T),
Year
]
Output:
Year rn Estimate Std. Error t value Pr(>|t|)
<int> <char> <num> <num> <num> <num>
1: 2010 (Intercept) -0.0185139123 0.1745433 -0.106070604 0.91965050
2: 2010 Var2 0.8925268150 0.3074344 2.903145262 0.03366718
3: 2010 Var3 0.2056756465 0.2588893 0.794453941 0.46295473
4: 2010 Var4 0.1415611201 0.2333638 0.606611315 0.57059491
5: 2010 Var5 0.5847579888 0.2458589 2.378429109 0.06328800
6: 2011 (Intercept) 0.0226610822 0.2697575 0.084005375 0.93631186
7: 2011 Var2 0.1713025466 0.4501861 0.380514940 0.71919278
8: 2011 Var3 -0.0365874385 0.4618825 -0.079213730 0.93993524
9: 2011 Var4 0.4523682769 0.3970092 1.139440262 0.30614687
10: 2011 Var5 0.0754535192 0.4244909 0.177750631 0.86589569
11: 2012 (Intercept) -0.0959599437 0.2731497 -0.351308981 0.73967566
12: 2012 Var2 0.3776595156 0.6546532 0.576884860 0.58901564
13: 2012 Var3 -0.0061169165 0.3507799 -0.017438049 0.98676160
14: 2012 Var4 0.0859049128 0.5806072 0.147957021 0.88815829
15: 2012 Var5 -0.0853547099 0.4864285 -0.175472270 0.86759347
16: 2013 (Intercept) -0.1408782361 0.1712474 -0.822659140 0.44814357
17: 2013 Var2 0.1446858677 0.3215907 0.449906923 0.67161187
18: 2013 Var3 -0.2353407216 0.3858692 -0.609897700 0.56858084
19: 2013 Var4 -0.1236695218 0.3080081 -0.401513870 0.70462777
20: 2013 Var5 -0.6461539283 0.2538933 -2.544982676 0.05157997
21: 2014 (Intercept) 0.1457589874 0.2361635 0.617195160 0.56412468
22: 2014 Var2 -0.1902867534 0.6787563 -0.280346220 0.79044108
23: 2014 Var3 -0.0664099633 0.5026280 -0.132125479 0.90003734
24: 2014 Var4 0.3274313087 0.4057963 0.806885835 0.45638230
25: 2014 Var5 0.0009242792 0.5080055 0.001819428 0.99861867
26: 2015 (Intercept) -0.2835090756 0.1837970 -1.542512028 0.18359216
27: 2015 Var2 -0.7611936157 0.3350701 -2.271744145 0.07228084
28: 2015 Var3 -0.0460680232 0.3194645 -0.144203885 0.89097155
29: 2015 Var4 -0.4157749562 0.2315036 -1.795976121 0.13244194
30: 2015 Var5 0.0744453929 0.2760067 0.269723121 0.79815100
31: 2016 (Intercept) 0.3445679145 0.2215998 1.554910686 0.18068667
32: 2016 Var2 0.2026557534 0.4119664 0.491923044 0.64360822
33: 2016 Var3 0.1997847174 0.3087333 0.647110955 0.54609126
34: 2016 Var4 0.0680401474 0.2937252 0.231645560 0.82599500
35: 2016 Var5 -0.0948129954 0.3593450 -0.263849455 0.80242521
36: 2017 (Intercept) 0.1813009784 0.1807830 1.002865311 0.36196037
37: 2017 Var2 -0.1404884248 0.1961502 -0.716228893 0.50589908
38: 2017 Var3 0.3181799345 0.2697474 1.179547527 0.29124163
39: 2017 Var4 0.7311305687 0.3620538 2.019397582 0.09944668
40: 2017 Var5 0.1843115867 0.5512637 0.334343762 0.75168854
41: 2018 (Intercept) 0.3315016731 0.1445671 2.293063654 0.07037946
42: 2018 Var2 -0.5653688206 0.3228250 -1.751316579 0.14028416
43: 2018 Var3 0.2804616367 0.2291430 1.223958846 0.27549148
44: 2018 Var4 -0.1027345987 0.2652794 -0.387269471 0.71449263
45: 2018 Var5 -0.4464118180 0.2198853 -2.030203222 0.09808576
46: 2019 (Intercept) 0.1793316586 0.3700288 0.484642375 0.64841488
47: 2019 Var2 -0.1862647333 0.5215482 -0.357138118 0.73556723
48: 2019 Var3 0.3001795075 0.4801942 0.625121014 0.55931008
49: 2019 Var4 0.3334414516 0.6686747 0.498660191 0.63917790
50: 2019 Var5 -0.5558946009 0.5209731 -1.067031261 0.33474572
Year rn Estimate Std. Error t value Pr(>|t|)
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论