在R中从单个数据框中运行多年线性回归并将系数存储到新数据框中。

huangapple go评论69阅读模式
英文:

Running several year linear regressions from a single dataframe in R and storing coefficients into new dataframe

问题

我有以下的数据框:

df <- data.frame(
  Company = rep(LETTERS[1:10], each = 10),
  Year = rep(2010:2019, times = 10),
  Var1 = runif(100, -1, 1),
  Var2 = runif(100, -1, 1),
  Var3 = runif(100, -1, 1),
  Var4 = runif(100, -1, 1),
  Var5 = runif(100, -1, 1)
)

我尝试每年运行 10 个 OLS 回归,将所有系数、标准误差、t-统计量和 p 值添加到一个新的大数据框中。我看到这个链接并尝试了以下方法:

coef_matrix <- df %>% group_by(Year) %>% do(tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df)))

然而,这给了我以下输出:

head(coef_matrix,15)
# A tibble: 15 x 6
# Groups:   Year [3]
   Year term        estimate std.error statistic p.value
  <int> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1  2010 (Intercept)   0.0274    0.0591     0.463  0.644 
2  2010 Var2         -0.0339    0.113     -0.301  0.764 
3  2010 Var3         -0.0705    0.112     -0.628  0.532 
4  2010 Var4          0.0663    0.0973     0.682  0.497 
5  2010 Var5          0.201     0.0983     2.04   0.0440
6  2011 (Intercept)   0.0274    0.0591     0.463  0.644 
7  2011 Var2         -0.0339    0.113     -0.301  0.764 
8  2011 Var3         -0.0705    0.112     -0.628  0.532 
9  2011 Var4          0.0663    0.0973     0.682  0.497 
10  2011 Var5          0.201     0.0983     2.04   0.0440
11  2012 (Intercept)   0.0274    0.0591     0.463  0.644 
12  2012 Var2         -0.0339    0.113     -0.301  0.764 
13  2012 Var3         -0.0705    0.112     -0.628  0.532 
14  2012 Var4          0.0663    0.0973     0.682  0.497 
15  2012 Var5          0.201     0.0983     2.04   0.0440

我想尝试为每一年运行 10 个年度回归,然后将它们的所有系数、标准误差等存储到一个数据框中。现在,系数与如果我为所有年份一起运行普通的 OLS 回归是相同的。

summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df))

这是如何更改的?我以为 group_by 函数应该运行年度回归,但似乎我犯了一个错误。因此,对于每一年,系数、标准误差等都应该不同(当然)。

英文:

i have the following df:

 df &lt;- data.frame(
      Company = rep(LETTERS[1:10], each = 10),
      Year = rep(2010:2019, times = 10),
      Var1 = runif(100, -1, 1),
      Var2 = runif(100, -1, 1),
      Var3 = runif(100, -1, 1),
      Var4 = runif(100, -1, 1),
      Var5 = runif(100, -1, 1)
    )

I am trying to run 10 ols regressions per year and adding all the coefficients, standard errors, t-statistic and p value into one big new dataframe. I came across this link and tried the following:
https://stackoverflow.com/questions/37395059/running-several-linear-regressions-from-a-single-dataframe-in-r

coef_matrix &lt;- df %&gt;% group_by(Year) %&gt;% do(tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df)))

However, this gives me this output:

head(coef_matrix,15)
# A tibble: 15 x 6
# Groups:   Year [3]
    Year term        estimate std.error statistic p.value
   &lt;int&gt; &lt;chr&gt;          &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;
 1  2010 (Intercept)   0.0274    0.0591     0.463  0.644 
 2  2010 Var2         -0.0339    0.113     -0.301  0.764 
 3  2010 Var3         -0.0705    0.112     -0.628  0.532 
 4  2010 Var4          0.0663    0.0973     0.682  0.497 
 5  2010 Var5          0.201     0.0983     2.04   0.0440
 6  2011 (Intercept)   0.0274    0.0591     0.463  0.644 
 7  2011 Var2         -0.0339    0.113     -0.301  0.764 
 8  2011 Var3         -0.0705    0.112     -0.628  0.532 
 9  2011 Var4          0.0663    0.0973     0.682  0.497 
10  2011 Var5          0.201     0.0983     2.04   0.0440
11  2012 (Intercept)   0.0274    0.0591     0.463  0.644 
12  2012 Var2         -0.0339    0.113     -0.301  0.764 
13  2012 Var3         -0.0705    0.112     -0.628  0.532 
14  2012 Var4          0.0663    0.0973     0.682  0.497 
15  2012 Var5          0.201     0.0983     2.04   0.0440
&gt; 

I want to try and run 10 annual regressions for every year and then store all their coefficients, standard errors etcetera into a dataframe. Right now the coefficients are the same as if i would run a normal ols regression for all years together:

summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df))

Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.10859 -0.51344 -0.05597  0.49481  1.13573 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)  
(Intercept)  0.02736    0.05909   0.463    0.644  
Var2        -0.03392    0.11252  -0.301    0.764  
Var3        -0.07045    0.11223  -0.628    0.532  
Var4         0.06633    0.09730   0.682    0.497  
Var5         0.20064    0.09828   2.041    0.044 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5851 on 95 degrees of freedom
Multiple R-squared:  0.05331,	Adjusted R-squared:  0.01345 
F-statistic: 1.338 on 4 and 95 DF,  p-value: 0.2617

How do i change this? I thought the group_by function was supposed to run the annual regression, but it seems like i am making a mistake. So for every year, the coefficients, standard errors etcetera should be different (of course). So for 2010, the output should be this in the newly created df:

summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df[df$Year==&quot;2010&quot;, ]))
Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df[df$Year == 
    &quot;2010&quot;, ])

Residuals:
       1       11       21       31       41       51       61       71       81       91 
-0.04871 -0.08103 -0.79737  0.10159  0.18243  0.41049 -0.05951 -0.12389  0.18528  0.23072 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)  
(Intercept)  -0.3293     0.2223  -1.481   0.1986  
Var2          0.8657     0.3569   2.426   0.0597 .
Var3          0.5305     0.4735   1.120   0.3134  
Var4         -0.2685     0.2446  -1.097   0.3225  
Var5         -0.1988     0.3729  -0.533   0.6167  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4389 on 5 degrees of freedom
Multiple R-squared:  0.612,	Adjusted R-squared:  0.3017 
F-statistic: 1.972 on 4 and 5 DF,  p-value: 0.2372

And this for 2011, and so on:

Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df[df$Year == 
    &quot;2011&quot;, ])

Residuals:
        2        12        22        32        42        52        62        72        82        92 
-0.541020 -0.626232  0.627003  0.326570  0.450827 -0.505888 -0.008373  0.264653  0.190374 -0.177914 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)
(Intercept)  0.23500    0.27901   0.842    0.438
Var2        -0.49973    0.59313  -0.843    0.438
Var3         0.21137    0.55453   0.381    0.719
Var4         0.34993    0.40960   0.854    0.432
Var5         0.09965    0.37772   0.264    0.802

Residual standard error: 0.597 on 5 degrees of freedom
Multiple R-squared:  0.2734,	Adjusted R-squared:  -0.3079 
F-statistic: 0.4703 on 4 and 5 DF,  p-value: 0.7577

答案1

得分: 2

df %>%
  group_by(Year) %>%
  summarise(broom::tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data = cur_data_all())))

# 一个数据框: 50 × 6
# 组:   Year [10]
    Year term        estimate std.error statistic p.value
   <int> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1  2010 (Intercept)    0.224     0.383     0.584   0.585
 2  2010 Var2           0.643     0.689     0.933   0.394
 3  2010 Var3          -0.251     0.512    -0.490   0.645
 4  2010 Var4          -0.386     0.653    -0.592   0.580
 5  2010 Var5          -0.222     0.388    -0.574   0.591
 6  2011 (Intercept)    0.243     0.270     0.898   0.410
 7  2011 Var2          -0.300     0.446    -0.673   0.531
 8  2011 Var3           0.401     0.418     0.960   0.381
 9  2011 Var4           0.424     0.330     1.29    0.255
10  2011 Var5          -0.465     0.291    -1.60    0.171

如果你有dplyr>=1.1,考虑使用summarise的替代方法reframe,并使用pick(Var1:Var5)代替cur_data()

英文:
df %&gt;%
  group_by(Year) %&gt;% 
  summarise(broom::tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data= cur_data_all())))

# A tibble: 50 &#215; 6
# Groups:   Year [10]
    Year term        estimate std.error statistic p.value
   &lt;int&gt; &lt;chr&gt;          &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt;
 1  2010 (Intercept)    0.224     0.383     0.584   0.585
 2  2010 Var2           0.643     0.689     0.933   0.394
 3  2010 Var3          -0.251     0.512    -0.490   0.645
 4  2010 Var4          -0.386     0.653    -0.592   0.580
 5  2010 Var5          -0.222     0.388    -0.574   0.591
 6  2011 (Intercept)    0.243     0.270     0.898   0.410
 7  2011 Var2          -0.300     0.446    -0.673   0.531
 8  2011 Var3           0.401     0.418     0.960   0.381
 9  2011 Var4           0.424     0.330     1.29    0.255
10  2011 Var5          -0.465     0.291    -1.60    0.171

If you havedplyr&gt;=1.1 then consider using reframe instead of summarise and use pick(Var1:Var5) instead of cur_data()

答案2

得分: 0

以下是 data.table 实现的代码部分的中文翻译:

# 载入 data.table 库
library(data.table)

# 将数据框 df 转换为 data.table,并计算线性回归的摘要信息
setDT(df)[, as.data.table(
  summary(lm(Var1~Var2+Var3+Var4+Var5, data=.SD))$coef,
  keep.rownames = T),
  Year
]

以下是输出部分的中文翻译:

     年份          变量名      估计值 标准误差       t值   Pr(>|t|)
    <int>      <char>         <num>      <num>        <num>      <num>
 1:  2010 截距 -0.0185139123  0.1745433 -0.106070604 0.91965050
 2:  2010 Var2  0.8925268150  0.3074344  2.903145262 0.03366718
 3:  2010 Var3  0.2056756465  0.2588893  0.794453941 0.46295473
 4:  2010 Var4  0.1415611201  0.2333638  0.606611315 0.57059491
 5:  2010 Var5  0.5847579888  0.2458589  2.378429109 0.06328800
 6:  2011 截距  0.0226610822  0.2697575  0.084005375 0.93631186
 7:  2011 Var2  0.1713025466  0.4501861  0.380514940 0.71919278
 8:  2011 Var3 -0.0365874385  0.4618825 -0.079213730 0.93993524
 9:  2011 Var4  0.4523682769  0.3970092  1.139440262 0.30614687
10:  2011 Var5  0.0754535192  0.4244909  0.177750631 0.86589569
11:  2012 截距 -0.0959599437  0.2731497 -0.351308981 0.73967566
12:  2012 Var2  0.3776595156  0.6546532  0.576884860 0.58901564
13:  2012 Var3 -0.0061169165  0.3507799 -0.017438049 0.98676160
14:  2012 Var4  0.0859049128  0.5806072  0.147957021 0.88815829
15:  2012 Var5 -0.0853547099  0.4864285 -0.175472270 0.86759347
16:  2013 截距 -0.1408782361  0.1712474 -0.822659140 0.44814357
17:  2013 Var2  0.1446858677  0.3215907  0.449906923 0.67161187
18:  2013 Var3 -0.2353407216  0.3858692 -0.609897700 0.56858084
19:  2013 Var4 -0.1236695218  0.3080081 -0.401513870 0.70462777
20:  2013 Var5 -0.6461539283  0.2538933 -2.544982676 0.05157997
21:  2014 截距  0.1457589874  0.2361635  0.617195160 0.56412468
22:  2014 Var2 -0.1902867534  0.6787563 -0.280346220 0.79044108
23:  2014 Var3 -0.0664099633  0.5026280 -0.132125479 0.90003734
24:  2014 Var4  0.3274313087  0.4057963  0.806885835 0.45638230
25:  2014 Var5  0.0009242792  0.5080055  0.001819428 0.99861867
26:  2015 截距 -0.2835090756  0.1837970 -1.542512028 0.18359216
27:  2015 Var2 -0.7611936157  0.3350701 -2.271744145 0.07228084
28:  2015 Var3 -0.0460680232  0.3194645 -0.144203885 0.89097155
29:  2015 Var4 -0.4157749562  0.2315036 -1.795976121 0.13244194
30:  2015 Var5  0.0744453929  0.2760067  0.269723121 0.79815100
31:  2016 截距  0.3445679145  0.2215998  1.554910686 0.18068667
32:  2016 Var2  0.2026557534  0.4119664  0.491923044 0.64360822
33:  2016 Var3  0.1997847174  0.3087333  0.647110955 0.54609126
34:  2016 Var4  0.0680401474  0.2937252  0.231645560 0.82599500
35:  2016 Var5 -0.0948129954  0.3593450 -0.263849455 0.80242521
36:  2017 截距  0.1813009784  0.1807830  1.002865311 0.36196037
37:  2017 Var2 -0.1404884248  0.1961502 -0.

<details>
<summary>英文:</summary>

Here is a data.table implementation:

library(data.table)

setDT(df)[, as.data.table(
summary(lm(Var1~Var2+Var3+Var4+Var5, data=.SD))$coef,
keep.rownames = T),
Year
]


Output:
 Year          rn      Estimate Std. Error      t value   Pr(&gt;|t|)
&lt;int&gt;      &lt;char&gt;         &lt;num&gt;      &lt;num&gt;        &lt;num&gt;      &lt;num&gt;

1: 2010 (Intercept) -0.0185139123 0.1745433 -0.106070604 0.91965050
2: 2010 Var2 0.8925268150 0.3074344 2.903145262 0.03366718
3: 2010 Var3 0.2056756465 0.2588893 0.794453941 0.46295473
4: 2010 Var4 0.1415611201 0.2333638 0.606611315 0.57059491
5: 2010 Var5 0.5847579888 0.2458589 2.378429109 0.06328800
6: 2011 (Intercept) 0.0226610822 0.2697575 0.084005375 0.93631186
7: 2011 Var2 0.1713025466 0.4501861 0.380514940 0.71919278
8: 2011 Var3 -0.0365874385 0.4618825 -0.079213730 0.93993524
9: 2011 Var4 0.4523682769 0.3970092 1.139440262 0.30614687
10: 2011 Var5 0.0754535192 0.4244909 0.177750631 0.86589569
11: 2012 (Intercept) -0.0959599437 0.2731497 -0.351308981 0.73967566
12: 2012 Var2 0.3776595156 0.6546532 0.576884860 0.58901564
13: 2012 Var3 -0.0061169165 0.3507799 -0.017438049 0.98676160
14: 2012 Var4 0.0859049128 0.5806072 0.147957021 0.88815829
15: 2012 Var5 -0.0853547099 0.4864285 -0.175472270 0.86759347
16: 2013 (Intercept) -0.1408782361 0.1712474 -0.822659140 0.44814357
17: 2013 Var2 0.1446858677 0.3215907 0.449906923 0.67161187
18: 2013 Var3 -0.2353407216 0.3858692 -0.609897700 0.56858084
19: 2013 Var4 -0.1236695218 0.3080081 -0.401513870 0.70462777
20: 2013 Var5 -0.6461539283 0.2538933 -2.544982676 0.05157997
21: 2014 (Intercept) 0.1457589874 0.2361635 0.617195160 0.56412468
22: 2014 Var2 -0.1902867534 0.6787563 -0.280346220 0.79044108
23: 2014 Var3 -0.0664099633 0.5026280 -0.132125479 0.90003734
24: 2014 Var4 0.3274313087 0.4057963 0.806885835 0.45638230
25: 2014 Var5 0.0009242792 0.5080055 0.001819428 0.99861867
26: 2015 (Intercept) -0.2835090756 0.1837970 -1.542512028 0.18359216
27: 2015 Var2 -0.7611936157 0.3350701 -2.271744145 0.07228084
28: 2015 Var3 -0.0460680232 0.3194645 -0.144203885 0.89097155
29: 2015 Var4 -0.4157749562 0.2315036 -1.795976121 0.13244194
30: 2015 Var5 0.0744453929 0.2760067 0.269723121 0.79815100
31: 2016 (Intercept) 0.3445679145 0.2215998 1.554910686 0.18068667
32: 2016 Var2 0.2026557534 0.4119664 0.491923044 0.64360822
33: 2016 Var3 0.1997847174 0.3087333 0.647110955 0.54609126
34: 2016 Var4 0.0680401474 0.2937252 0.231645560 0.82599500
35: 2016 Var5 -0.0948129954 0.3593450 -0.263849455 0.80242521
36: 2017 (Intercept) 0.1813009784 0.1807830 1.002865311 0.36196037
37: 2017 Var2 -0.1404884248 0.1961502 -0.716228893 0.50589908
38: 2017 Var3 0.3181799345 0.2697474 1.179547527 0.29124163
39: 2017 Var4 0.7311305687 0.3620538 2.019397582 0.09944668
40: 2017 Var5 0.1843115867 0.5512637 0.334343762 0.75168854
41: 2018 (Intercept) 0.3315016731 0.1445671 2.293063654 0.07037946
42: 2018 Var2 -0.5653688206 0.3228250 -1.751316579 0.14028416
43: 2018 Var3 0.2804616367 0.2291430 1.223958846 0.27549148
44: 2018 Var4 -0.1027345987 0.2652794 -0.387269471 0.71449263
45: 2018 Var5 -0.4464118180 0.2198853 -2.030203222 0.09808576
46: 2019 (Intercept) 0.1793316586 0.3700288 0.484642375 0.64841488
47: 2019 Var2 -0.1862647333 0.5215482 -0.357138118 0.73556723
48: 2019 Var3 0.3001795075 0.4801942 0.625121014 0.55931008
49: 2019 Var4 0.3334414516 0.6686747 0.498660191 0.63917790
50: 2019 Var5 -0.5558946009 0.5209731 -1.067031261 0.33474572
Year rn Estimate Std. Error t value Pr(>|t|)



</details>



huangapple
  • 本文由 发表于 2023年5月30日 01:50:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76359398.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定