英文:
How do I store the estimation sample from a fixest regression to calculate summary statistics?
问题
如何存储fixest::feols
回归的估计样本,以便计算汇总统计信息?在Stata中,可以使用e(sample)
来实现,例如sum y if e(sample)
来计算估计样本中因变量的均值。
从之前的问题中,我看到可以使用obs(model)
来存储估计样本,以便在使用subset
运行进一步的回归分析,但我不知道如何使用它来计算汇总统计信息,因为obs(model)
返回整数而不是布尔值。
英文:
How do I store the estimation sample from a fixest::feols
regression so that I can calculate summary statistics? In Stata this can be done with e(sample)
, eg. sum y if e(sample)
to calculate the mean of the dependent variable on the estimation sample.
From a previous question, I see that obs(model)
can be used to store the estimation sample to run further regressions using subset
, but I don't see how to use it to calculate summary statistics, because obs(model)
returns integers instead of Booleans.
答案1
得分: 0
R不是Stata:看起来这个问题更像是关于如何在R中报告描述性统计的一般性问题。
有很多种方法来报告描述性统计数据。我在下面提供了一个示例,演示了如何为估计样本精心筛选描述性统计数据。
library(fixest)
base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
# 让我们添加一些缺失值
base$x1[1:5] = NA
# 估计
est = feols(y ~ x1, base)
#> 注意: 由于NA值(RHS: 5),删除了5个观察值。
# 有无数种方法可以获取描述性统计数据
# 作为示例,让我们使用collapse包
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#> 数据集:all.vars(x$fml),2个变量,N = 145
#> -------------------------------------------------------
#> y (numeric):
#> 统计信息
#> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
#> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
#> 四分位数
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
#> -------------------------------------------------------
#> x1 (numeric):
#> 统计信息
#> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
#> 145 23 3.05 0.44 2 4.4 0.35 3.19
#> 四分位数
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
#> ---------------------------------------------------------
此外,您可以很容易地使用以下函数自动化这一过程:
# 让我们创建一个自动化的函数
summ = function(x){
# x: fixest估计
# 我们获取数据
data = model.matrix(x, type = c("lhs", "rhs"))
# 我们删除截距
var_keep = names(data) != "(Intercept)"
data = data[var_keep]
# 汇总统计
descr(data)
}
summ(est)
#> 数据集:数据,2个变量,N = 145
#> -----------------------------------------------------
#> y (numeric):
#> 统计信息
#> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
#> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
#> 四分位数
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
#> -----------------------------------------------------
#> x1 (numeric):
#> 统计信息
#> N Ndist 平均值 标准差 最小值 最大值 偏度 峰度
#> 145 23 3.05 0.44 2 4.4 0.35 3.19
#> 四分位数
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
#> ------------------------------------------------------
如果您希望得到一个表格,您可以轻松地调整代码以适用于datasummary:
library(modelsummary)
summ = function(x){
# x: fixest估计
# 我们获取数据
data = model.matrix(x, type = c("lhs", "rhs"))
# 我们删除截距
var_keep = names(data) != "(Intercept)"
data = data[var_keep]
# 汇总统计
datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
}
summ(est)
#> Mean SD Min Median Max
#> 1 y 5.88 0.82 4.30 5.80 7.90
#> 2 x1 3.05 0.44 2.00 3.00 4.40
英文:
R is not Stata: It looks like this question is more a general question on how to report descriptive statistics in R.
There are many many ways to report desc stats. I provide an example below illustrating how to curate the desc stats to the estimation sample.
library(fixest)
base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
# let's add a few missing values
base$x1[1:5] = NA
# estimation
est = feols(y ~ x1, base)
#> NOTE: 5 observations removed because of NA values (RHS: 5).
# There is an infinity of ways to get descriptive stats
# As an example, let's use the collapse pkg
library(collapse)
descr(base[obs(est), all.vars(est$fml)])
#> Dataset: all.vars(x$fml), 2 Variables, N = 145
#> -------------------------------------------------------
#> y (numeric):
#> Statistics
#> N Ndist Mean SD Min Max Skew Kurt
#> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
#> -------------------------------------------------------
#> x1 (numeric):
#> Statistics
#> N Ndist Mean SD Min Max Skew Kurt
#> 145 23 3.05 0.44 2 4.4 0.35 3.19
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
#> ---------------------------------------------------------
Further, you can very easily automate that with a function, as below:
#Let's create a function that automates that
summ = function(x){
# x: fixest estimation
# we fetch the data
data = model.matrix(x, type = c("lhs", "rhs"))
# we remove the intercept
var_keep = names(data) != "(Intercept)"
data = data[var_keep]
# the sumstat
descr(data)
}
summ(est)
#> Dataset: data, 2 Variables, N = 145
#> -----------------------------------------------------
#> y (numeric):
#> Statistics
#> N Ndist Mean SD Min Max Skew Kurt
#> 145 35 5.88 0.82 4.3 7.9 0.27 2.46
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 4.4 4.62 4.9 5.2 5.8 6.4 6.9 7.28 7.7
#> -----------------------------------------------------
#> x1 (numeric):
#> Statistics
#> N Ndist Mean SD Min Max Skew Kurt
#> 145 23 3.05 0.44 2 4.4 0.35 3.19
#> Quantiles
#> 1% 5% 10% 25% 50% 75% 90% 95% 99%
#> 2.2 2.32 2.5 2.8 3 3.3 3.66 3.8 4.16
#> ------------------------------------------------------
If you would prefer a table, you can easily adapt the code to work with datasummary:
library(modelsummary)
summ = function(x){
# x: fixest estimation
# we fetch the data
data = model.matrix(x, type = c("lhs", "rhs"))
# we remove the intercept
var_keep = names(data) != "(Intercept)"
data = data[var_keep]
# the sumstat
datasummary(All(data) ~ Mean + SD + Min + Median + Max, data, output = "dataframe")
}
summ(est)
#> Mean SD Min Median Max
#> 1 y 5.88 0.82 4.30 5.80 7.90
#> 2 x1 3.05 0.44 2.00 3.00 4.40
答案2
得分: 0
一种解决方案是在数据框的索引上进行过滤:
df %>% filter(row_number() %in% obs(model))
%>% summarize(y_mean = mean(y))
英文:
One solution is filtering on the indices of the dataframe:
df %>% filter(row_number() %in% obs(model))
%>% summarize(y_mean = mean(y))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论