如何存储来自fixest回归的估计样本以计算摘要统计信息?

huangapple go评论113阅读模式
英文:

How do I store the estimation sample from a fixest regression to calculate summary statistics?

问题

如何存储fixest::feols回归的估计样本,以便计算汇总统计信息?在Stata中,可以使用e(sample)来实现,例如sum y if e(sample)来计算估计样本中因变量的均值。

之前的问题中,我看到可以使用obs(model)来存储估计样本,以便在使用subset运行进一步的回归分析,但我不知道如何使用它来计算汇总统计信息,因为obs(model)返回整数而不是布尔值。

英文:

How do I store the estimation sample from a fixest::feols regression so that I can calculate summary statistics? In Stata this can be done with e(sample), eg. sum y if e(sample) to calculate the mean of the dependent variable on the estimation sample.

From a previous question, I see that obs(model) can be used to store the estimation sample to run further regressions using subset, but I don't see how to use it to calculate summary statistics, because obs(model) returns integers instead of Booleans.

答案1

得分: 0

R不是Stata:看起来这个问题更像是关于如何在R中报告描述性统计的一般性问题。

有很多种方法来报告描述性统计数据。我在下面提供了一个示例,演示了如何为估计样本精心筛选描述性统计数据。

  1. library(fixest)
  2. base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
  3. # 让我们添加一些缺失值
  4. base$x1[1:5] = NA
  5. # 估计
  6. est = feols(y ~ x1, base)
  7. #> 注意: 由于NA值(RHS: 5),删除了5个观察值。
  8. # 有无数种方法可以获取描述性统计数据
  9. # 作为示例,让我们使用collapse包
  10. library(collapse)
  11. descr(base[obs(est), all.vars(est$fml)])
  12. #> 数据集:all.vars(x$fml),2个变量,N = 145
  13. #> -------------------------------------------------------
  14. #> y (numeric):
  15. #> 统计信息
  16. #> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
  17. #> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
  18. #> 四分位数
  19. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  20. #> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
  21. #> -------------------------------------------------------
  22. #> x1 (numeric):
  23. #> 统计信息
  24. #> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
  25. #> 145 23 3.05 0.44 2 4.4 0.35 3.19
  26. #> 四分位数
  27. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  28. #> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
  29. #> ---------------------------------------------------------

此外,您可以很容易地使用以下函数自动化这一过程:

  1. # 让我们创建一个自动化的函数
  2. summ = function(x){
  3. # x: fixest估计
  4. # 我们获取数据
  5. data = model.matrix(x, type = c("lhs", "rhs"))
  6. # 我们删除截距
  7. var_keep = names(data) != "(Intercept)"
  8. data = data[var_keep]
  9. # 汇总统计
  10. descr(data)
  11. }
  12. summ(est)
  13. #> 数据集:数据,2个变量,N = 145
  14. #> -----------------------------------------------------
  15. #> y (numeric):
  16. #> 统计信息
  17. #> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
  18. #> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
  19. #> 四分位数
  20. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  21. #> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
  22. #> -----------------------------------------------------
  23. #> x1 (numeric):
  24. #> 统计信息
  25. #> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
  26. #> 145 23 3.05 0.44 2 4.4 0.35 3.19
  27. #> 四分位数
  28. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  29. #> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
  30. #> ------------------------------------------------------

如果您希望得到一个表格,您可以轻松地调整代码以适用于datasummary

  1. library(modelsummary)
  2. summ = function(x){
  3. # x: fixest估计
  4. # 我们获取数据
  5. data = model.matrix(x, type = c("lhs", "rhs"))
  6. # 我们删除截距
  7. var_keep = names(data) != "(Intercept)"
  8. data = data[var_keep]
  9. # 汇总统计
  10. datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
  11. }
  12. summ(est)
  13. #> Mean SD Min Median Max
  14. #> 1 y 5.88 0.82 4.30 5.80 7.90
  15. #> 2 x1 3.05 0.44 2.00 3.00 4.40
英文:

R is not Stata: It looks like this question is more a general question on how to report descriptive statistics in R.

There are many many ways to report desc stats. I provide an example below illustrating how to curate the desc stats to the estimation sample.

  1. library(fixest)
  2. base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
  3. # let's add a few missing values
  4. base$x1[1:5] = NA
  5. # estimation
  6. est = feols(y ~ x1, base)
  7. #> NOTE: 5 observations removed because of NA values (RHS: 5).
  8. # There is an infinity of ways to get descriptive stats
  9. # As an example, let's use the collapse pkg
  10. library(collapse)
  11. descr(base[obs(est), all.vars(est$fml)])
  12. #> Dataset: all.vars(x$fml), 2 Variables, N = 145
  13. #> -------------------------------------------------------
  14. #> y (numeric):
  15. #> Statistics
  16. #> N Ndist Mean SD Min Max Skew Kurt
  17. #> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
  18. #> Quantiles
  19. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  20. #> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
  21. #> -------------------------------------------------------
  22. #> x1 (numeric):
  23. #> Statistics
  24. #> N Ndist Mean SD Min Max Skew Kurt
  25. #> 145 23 3.05 0.44 2 4.4 0.35 3.19
  26. #> Quantiles
  27. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  28. #> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
  29. #> ---------------------------------------------------------

Further, you can very easily automate that with a function, as below:

  1. #Let's create a function that automates that
  2. summ = function(x){
  3. # x: fixest estimation
  4. # we fetch the data
  5. data = model.matrix(x, type = c("lhs", "rhs"))
  6. # we remove the intercept
  7. var_keep = names(data) != "(Intercept)"
  8. data = data[var_keep]
  9. # the sumstat
  10. descr(data)
  11. }
  12. summ(est)
  13. #> Dataset: data, 2 Variables, N = 145
  14. #> -----------------------------------------------------
  15. #> y (numeric):
  16. #> Statistics
  17. #> N Ndist Mean SD Min Max Skew Kurt
  18. #> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
  19. #> Quantiles
  20. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  21. #> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
  22. #> -----------------------------------------------------
  23. #> x1 (numeric):
  24. #> Statistics
  25. #> N Ndist Mean SD Min Max Skew Kurt
  26. #> 145 23 3.05 0.44 2 4.4 0.35 3.19
  27. #> Quantiles
  28. #> 1% 5% 10% 25% 50% 75% 90% 95% 99%
  29. #> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
  30. #> ------------------------------------------------------

If you would prefer a table, you can easily adapt the code to work with datasummary:

  1. library(modelsummary)
  2. summ = function(x){
  3. # x: fixest estimation
  4. # we fetch the data
  5. data = model.matrix(x, type = c("lhs", "rhs"))
  6. # we remove the intercept
  7. var_keep = names(data) != "(Intercept)"
  8. data = data[var_keep]
  9. # the sumstat
  10. datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
  11. }
  12. summ(est)
  13. #> Mean SD Min Median Max
  14. #> 1 y 5.88 0.82 4.30 5.80 7.90
  15. #> 2 x1 3.05 0.44 2.00 3.00 4.40

答案2

得分: 0

一种解决方案是在数据框的索引上进行过滤:

  1. df %>% filter(row_number() %in% obs(model))
  2. %>% summarize(y_mean = mean(y))
英文:

One solution is filtering on the indices of the dataframe:

  1. df %>% filter(row_number() %in% obs(model))
  2. %>% summarize(y_mean = mean(y))

huangapple
  • 本文由 发表于 2023年6月1日 23:01:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76383278.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定