如何存储来自fixest回归的估计样本以计算摘要统计信息?

huangapple go评论63阅读模式
英文:

How do I store the estimation sample from a fixest regression to calculate summary statistics?

问题

如何存储fixest::feols回归的估计样本,以便计算汇总统计信息?在Stata中,可以使用e(sample)来实现,例如sum y if e(sample)来计算估计样本中因变量的均值。

之前的问题中,我看到可以使用obs(model)来存储估计样本,以便在使用subset运行进一步的回归分析,但我不知道如何使用它来计算汇总统计信息,因为obs(model)返回整数而不是布尔值。

英文:

How do I store the estimation sample from a fixest::feols regression so that I can calculate summary statistics? In Stata this can be done with e(sample), eg. sum y if e(sample) to calculate the mean of the dependent variable on the estimation sample.

From a previous question, I see that obs(model) can be used to store the estimation sample to run further regressions using subset, but I don't see how to use it to calculate summary statistics, because obs(model) returns integers instead of Booleans.

答案1

得分: 0

R不是Stata:看起来这个问题更像是关于如何在R中报告描述性统计的一般性问题。

有很多种方法来报告描述性统计数据。我在下面提供了一个示例,演示了如何为估计样本精心筛选描述性统计数据。

library(fixest)
base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
# 让我们添加一些缺失值
base$x1[1:5] = NA
# 估计
est = feols(y ~ x1, base)
#> 注意: 由于NA值(RHS: 5),删除了5个观察值。

# 有无数种方法可以获取描述性统计数据
# 作为示例,让我们使用collapse包
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#> 数据集:all.vars(x$fml),2个变量,N = 145
#> ------------------------------------------------------- 
#> y (numeric):
#> 统计信息
#>     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#>   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#> 四分位数
#>    1%    5%  10%  25%  50%  75%  90%   95%  99%
#>   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#> -------------------------------------------------------
#> x1 (numeric):
#> 统计信息
#>     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#>   145     23  3.05  0.44    2  4.4  0.35  3.19
#> 四分位数
#>    1%    5%  10%  25%  50%  75%   90%  95%   99%
#>   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#> ---------------------------------------------------------

此外,您可以很容易地使用以下函数自动化这一过程:

# 让我们创建一个自动化的函数
summ = function(x){
  # x: fixest估计
  
  # 我们获取数据
  data = model.matrix(x, type = c("lhs", "rhs"))
  # 我们删除截距
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # 汇总统计
  descr(data)
}

summ(est)
#> 数据集:数据,2个变量,N = 145
#> ----------------------------------------------------- 
#> y (numeric):
#> 统计信息
#>     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#>   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#> 四分位数
#>    1%    5%  10%  25%  50%  75%  90%   95%  99%
#>   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#> ----------------------------------------------------- 
#> x1 (numeric):
#> 统计信息
#>     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#>   145     23  3.05  0.44    2  4.4  0.35  3.19
#> 四分位数
#>    1%    5%  10%  25%  50%  75%   90%  95%   99%
#>   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#> ------------------------------------------------------

如果您希望得到一个表格,您可以轻松地调整代码以适用于datasummary

library(modelsummary)
summ = function(x){
  # x: fixest估计
  
  # 我们获取数据
  data = model.matrix(x, type = c("lhs", "rhs"))
  # 我们删除截距
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # 汇总统计
  datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
}

summ(est)
#>      Mean   SD  Min Median  Max
#> 1  y 5.88 0.82 4.30   5.80 7.90
#> 2 x1 3.05 0.44 2.00   3.00 4.40
英文:

R is not Stata: It looks like this question is more a general question on how to report descriptive statistics in R.

There are many many ways to report desc stats. I provide an example below illustrating how to curate the desc stats to the estimation sample.

library(fixest)
base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
# let's add a few missing values
base$x1[1:5] = NA
# estimation
est = feols(y ~ x1, base)
#> NOTE: 5 observations removed because of NA values (RHS: 5).

# There is an infinity of ways to get descriptive stats
# As an example, let's use the collapse pkg
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#> Dataset: all.vars(x$fml), 2 Variables, N = 145
#> ------------------------------------------------------- 
#> y (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#> Quantiles
#>    1%    5%  10%  25%  50%  75%  90%   95%  99%
#>   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#> -------------------------------------------------------
#> x1 (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     23  3.05  0.44    2  4.4  0.35  3.19
#> Quantiles
#>    1%    5%  10%  25%  50%  75%   90%  95%   99%
#>   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#> ---------------------------------------------------------

Further, you can very easily automate that with a function, as below:

#Let's create a function that automates that
summ = function(x){
  # x: fixest estimation
  
  # we fetch the data
  data = model.matrix(x, type = c("lhs", "rhs"))
  # we remove the intercept
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # the sumstat
  descr(data)
}

summ(est)
#> Dataset: data, 2 Variables, N = 145
#> ----------------------------------------------------- 
#> y (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#> Quantiles
#>    1%    5%  10%  25%  50%  75%  90%   95%  99%
#>   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#> ----------------------------------------------------- 
#> x1 (numeric):
#> Statistics
#>     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#>   145     23  3.05  0.44    2  4.4  0.35  3.19
#> Quantiles
#>    1%    5%  10%  25%  50%  75%   90%  95%   99%
#>   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#> ------------------------------------------------------

If you would prefer a table, you can easily adapt the code to work with datasummary:

library(modelsummary)
summ = function(x){
  # x: fixest estimation
  
  # we fetch the data
  data = model.matrix(x, type = c("lhs", "rhs"))
  # we remove the intercept
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # the sumstat
  datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
}

summ(est)
#>      Mean   SD  Min Median  Max
#> 1  y 5.88 0.82 4.30   5.80 7.90
#> 2 x1 3.05 0.44 2.00   3.00 4.40

答案2

得分: 0

一种解决方案是在数据框的索引上进行过滤:

df %>% filter(row_number() %in% obs(model))
   %>% summarize(y_mean = mean(y))
英文:

One solution is filtering on the indices of the dataframe:

df %>% filter(row_number() %in% obs(model))
   %>% summarize(y_mean = mean(y))

huangapple
  • 本文由 发表于 2023年6月1日 23:01:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76383278.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定