2023年6月1日 23:01:48go评论113阅读模式

英文:

How do I store the estimation sample from a fixest regression to calculate summary statistics?

问题

如何存储fixest::feols回归的估计样本，以便计算汇总统计信息？在Stata中，可以使用e(sample)来实现，例如sum y if e(sample)来计算估计样本中因变量的均值。

从之前的问题中，我看到可以使用obs(model)来存储估计样本，以便在使用subset运行进一步的回归分析，但我不知道如何使用它来计算汇总统计信息，因为obs(model)返回整数而不是布尔值。

英文:

How do I store the estimation sample from a fixest::feols regression so that I can calculate summary statistics? In Stata this can be done with e(sample), eg. sum y if e(sample) to calculate the mean of the dependent variable on the estimation sample.

From a previous question, I see that obs(model) can be used to store the estimation sample to run further regressions using subset, but I don't see how to use it to calculate summary statistics, because obs(model) returns integers instead of Booleans.

答案1

得分: 0

R不是Stata：看起来这个问题更像是关于如何在R中报告描述性统计的一般性问题。

有很多种方法来报告描述性统计数据。我在下面提供了一个示例，演示了如何为估计样本精心筛选描述性统计数据。

library(fixest)
base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
# 让我们添加一些缺失值
base$x1[1:5] = NA
# 估计
est = feols(y ~ x1, base)
#&gt; 注意: 由于NA值（RHS: 5），删除了5个观察值。
# 有无数种方法可以获取描述性统计数据
# 作为示例，让我们使用collapse包
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#&gt; 数据集：all.vars(x$fml)，2个变量，N = 145
#&gt; ------------------------------------------------------- 
#&gt; y (numeric):
#&gt; 统计信息
#&gt;     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#&gt;   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#&gt; 四分位数
#&gt;    1%    5%  10%  25%  50%  75%  90%   95%  99%
#&gt;   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#&gt; -------------------------------------------------------
#&gt; x1 (numeric):
#&gt; 统计信息
#&gt;     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#&gt;   145     23  3.05  0.44    2  4.4  0.35  3.19
#&gt; 四分位数
#&gt;    1%    5%  10%  25%  50%  75%   90%  95%   99%
#&gt;   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#&gt; ---------------------------------------------------------

此外，您可以很容易地使用以下函数自动化这一过程：

# 让我们创建一个自动化的函数
summ = function(x){
  # x: fixest估计
  
  # 我们获取数据
  data = model.matrix(x, type = c("lhs", "rhs"))
  # 我们删除截距
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # 汇总统计
  descr(data)
}
summ(est)
#&gt; 数据集：数据，2个变量，N = 145
#&gt; ----------------------------------------------------- 
#&gt; y (numeric):
#&gt; 统计信息
#&gt;     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#&gt;   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#&gt; 四分位数
#&gt;    1%    5%  10%  25%  50%  75%  90%   95%  99%
#&gt;   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#&gt; ----------------------------------------------------- 
#&gt; x1 (numeric):
#&gt; 统计信息
#&gt;     N  Ndist  平均值    标准差  最小值  最大值  偏度  峰度
#&gt;   145     23  3.05  0.44    2  4.4  0.35  3.19
#&gt; 四分位数
#&gt;    1%    5%  10%  25%  50%  75%   90%  95%   99%
#&gt;   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#&gt; ------------------------------------------------------

如果您希望得到一个表格，您可以轻松地调整代码以适用于datasummary：

library(modelsummary)
summ = function(x){
  # x: fixest估计
  
  # 我们获取数据
  data = model.matrix(x, type = c("lhs", "rhs"))
  # 我们删除截距
  var_keep = names(data) != "(Intercept)"
  data = data[var_keep]
  # 汇总统计
  datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
}
summ(est)
#&gt;      Mean   SD  Min Median  Max
#&gt; 1  y 5.88 0.82 4.30   5.80 7.90
#&gt; 2 x1 3.05 0.44 2.00   3.00 4.40

英文:

R is not Stata: It looks like this question is more a general question on how to report descriptive statistics in R.

There are many many ways to report desc stats. I provide an example below illustrating how to curate the desc stats to the estimation sample.

library(fixest)
base = setNames(iris, c(&quot;y&quot;, &quot;x1&quot;, &quot;x2&quot;, &quot;x3&quot;, &quot;species&quot;))
# let&#39;s add a few missing values
base$x1[1:5] = NA
# estimation
est = feols(y ~ x1, base)
#&gt; NOTE: 5 observations removed because of NA values (RHS: 5).
# There is an infinity of ways to get descriptive stats
# As an example, let&#39;s use the collapse pkg
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#&gt; Dataset: all.vars(x$fml), 2 Variables, N = 145
#&gt; ------------------------------------------------------- 
#&gt; y (numeric):
#&gt; Statistics
#&gt;     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#&gt;   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#&gt; Quantiles
#&gt;    1%    5%  10%  25%  50%  75%  90%   95%  99%
#&gt;   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#&gt; -------------------------------------------------------
#&gt; x1 (numeric):
#&gt; Statistics
#&gt;     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#&gt;   145     23  3.05  0.44    2  4.4  0.35  3.19
#&gt; Quantiles
#&gt;    1%    5%  10%  25%  50%  75%   90%  95%   99%
#&gt;   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#&gt; ---------------------------------------------------------

Further, you can very easily automate that with a function, as below:

#Let&#39;s create a function that automates that
summ = function(x){
  # x: fixest estimation
  
  # we fetch the data
  data = model.matrix(x, type = c(&quot;lhs&quot;, &quot;rhs&quot;))
  # we remove the intercept
  var_keep = names(data) != &quot;(Intercept)&quot;
  data = data[var_keep]
  # the sumstat
  descr(data)
}
summ(est)
#&gt; Dataset: data, 2 Variables, N = 145
#&gt; ----------------------------------------------------- 
#&gt; y (numeric):
#&gt; Statistics
#&gt;     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#&gt;   145     35  5.88  0.82  4.3  7.9  0.27  2.46
#&gt; Quantiles
#&gt;    1%    5%  10%  25%  50%  75%  90%   95%  99%
#&gt;   4.4  4.62  4.9  5.2  5.8  6.4  6.9  7.28  7.7
#&gt; ----------------------------------------------------- 
#&gt; x1 (numeric):
#&gt; Statistics
#&gt;     N  Ndist  Mean    SD  Min  Max  Skew  Kurt
#&gt;   145     23  3.05  0.44    2  4.4  0.35  3.19
#&gt; Quantiles
#&gt;    1%    5%  10%  25%  50%  75%   90%  95%   99%
#&gt;   2.2  2.32  2.5  2.8    3  3.3  3.66  3.8  4.16
#&gt; ------------------------------------------------------

If you would prefer a table, you can easily adapt the code to work with datasummary:

library(modelsummary)
summ = function(x){
  # x: fixest estimation
  
  # we fetch the data
  data = model.matrix(x, type = c(&quot;lhs&quot;, &quot;rhs&quot;))
  # we remove the intercept
  var_keep = names(data) != &quot;(Intercept)&quot;
  data = data[var_keep]
  # the sumstat
  datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = &quot;dataframe&quot;)
}
summ(est)
#&gt;      Mean   SD  Min Median  Max
#&gt; 1  y 5.88 0.82 4.30   5.80 7.90
#&gt; 2 x1 3.05 0.44 2.00   3.00 4.40

答案2

得分: 0

一种解决方案是在数据框的索引上进行过滤：

df %>% filter(row_number() %in% obs(model))
   %>% summarize(y_mean = mean(y))

英文:

One solution is filtering on the indices of the dataframe:

df %&gt;% filter(row_number() %in% obs(model))
   %&gt;% summarize(y_mean = mean(y))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何存储来自fixest回归的估计样本以计算摘要统计信息？

问题

答案1

答案2

在R Shiny中，根据单选按钮的选择如何显示或隐藏textInput？

如何使闪亮的`selectInput()`溢出`bslib::navset_card_pill()`？

选择行和列

使用R中的子集来过滤字符串。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。