英文:
Learning user defined functions to do ANOVA and emmeans pairwise comparison
问题
我正在尝试学习编写函数,并尝试创建一个执行ANOVA和后续F检验的函数。我将问题简化为获取emmeans和相关的两两比较。问题是两两比较不起作用?我对函数还很陌生,尝试了很多方法来修复这个问题,但都没有成功...
我正在使用diamonds数据集以便复现问题...
以下是代码:
# 导入库
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv) {
iv_name <- as.character(substitute(iv))
dv_name <- as.character(substitute(dv))
formula_str <- paste(dv_name, "~", iv_name)
aov.model <- lm(formula_str, data = data)
emm.means <- emmeans(aov.model,
specs = iv_name,
by = iv_name)
emm.pairs <- pairs(emm.means)
output <- list(
aov_model_summary = summary(aov.model),
emm.means,
emm.pairs)
return(output)
}
results_LM(dia, cut, price)
问题是emmmeans的输出看起来很奇怪,而且两两比较的结果也没有显示出来。
请注意,当我在函数之外进行ANOVA和emmmeans后续F检验时,一切正常。
非常感谢您的帮助,并希望能给我一些关于在数据框上编写这样的函数的指导。谢谢!
英文:
I am trying to learn to write functions and exploring making a function to do an ANOVA and post F test. I have simplified this to the problem which is obtaining emmeans and associated all pairwise comparisons. The issue is the pairwise comparisons do not work??? I am real new at functions and have tried a lot of ways to fix this to no avail...
I am using the diamonds data set for reproducibility...
Here is the code:
# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv) {
iv_name <- as.character(substitute(iv))
dv_name <- as.character(substitute(dv))
formula_str <- paste(dv_name, "~", iv_name)
aov.model <- lm(formula_str, data = data)
emm.means <- emmeans(aov.model,
specs = iv_name,
by = iv_name)
emm.pairs <- pairs(emm.means)
output <- list(
aov_model_summary = summary(aov.model),
emm.means,
emm.pairs)
return(output)
}
results_LM(dia, cut, price)
This issue is the emmeans output looks odd and the pairwise comparisons are missing
[[3]]
cut = Fair:
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
cut = Good:
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
cut = Ideal:
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
cut = Premium:
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA
cut = Very Good:
contrast estimate SE df z.ratio p.value
(nothing) nonEst NA NA NA NA```
Note that when I do the anova and emmeans post F test outside of the function it works fine?
Any help is appreciated and pointers to learn to write functions like this on dataframes would be GREAT. Thanks Bill
</details>
# 答案1
**得分**: 1
你离答案很接近了。我认为唯一的问题是在调用`emmeans()`函数时不需要使用`by`参数。我还建议使用`reformulate()`函数来创建公式。请参考下面的代码:
``` r
library(tidyverse)
library(emmeans)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv) {
iv_name <- as.character(substitute(iv))
dv_name <- as.character(substitute(dv))
formula_str <- reformulate(iv_name, response=dv_name)
aov.model <- lm(formula_str, data = data)
emm.means <- emmeans(aov.model,
specs = iv_name)
emm.pairs <- pairs(emm.means)
output <- list(
aov_model_summary = summary(aov.model),
emm.means,
emm.pairs)
names(output) <- c("Model", "Marginal Means", "Pairwise Comparisons")
return(output)
}
results_LM(dia, cut, price)
#> $Model
#>
#> Call:
#> lm(formula = formula_str, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4258 -2741 -1494 1360 15348
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4358.76 98.79 44.122 < 2e-16 ***
#> cutGood -429.89 113.85 -3.776 0.000160 ***
#> cutIdeal -901.22 102.41 -8.800 < 2e-16 ***
#> cutPremium 225.50 104.40 2.160 0.030772 *
#> cutVery Good -377.00 105.16 -3.585 0.000338 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3964 on 53935 degrees of freedom
#> Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
#> F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16
#>
#>
#> $`Marginal Means`
#> cut emmean SE df lower.CL upper.CL
#> Fair 4359 98.8 53935 4165 4552
#> Good 3929 56.6 53935 3818 4040
#> Ideal 3458 27.0 53935 3405 3510
#> Premium 4584 33.8 53935 4518 4650
#> Very Good 3982 36.1 53935 3911 4052
#>
#> Confidence level used: 0.95
#>
#> $`Pairwise Comparisons`
#> contrast estimate SE df t.ratio p.value
#> Fair - Good 429.9 113.8 53935 3.776 0.0015
#> Fair - Ideal 901.2 102.4 53935 8.800 <.0001
#> Fair - Premium -225.5 104.4 53935 -2.160 0.1950
#> Fair - Very Good 377.0 105.2 53935 3.585 0.0031
#> Good - Ideal 471.3 62.7 53935 7.517 <.0001
#> Good - Premium -655.4 65.9 53935 -9.946 <.0001
#> Good - Very Good -52.9 67.1 53935 -0.788 0.9341
#> Ideal - Premium -1126.7 43.2 53935 -26.067 <.0001
#> Ideal - Very Good -524.2 45.1 53935 -11.636 <.0001
#> Premium - Very Good 602.5 49.4 53935 12.198 <.0001
#>
#> P value adjustment: tukey method for comparing a family of 5 estimates
如果你想要使用rlang
的评估,你可以将as.character(substitute())
替换为as_label(enquo())
。
由于你想要了解函数的工作原理以及可以做什么,让我演示另一件事情。你可以给返回的结果设置一个类(例如"lm_pwc"),通过使用class(output) <- "lm_pwc"
来设置输出列表的类。
library(tidyverse)
library(emmeans)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv) {
iv_name <- as_label(enquo(iv))
dv_name <- as_label(enquo(dv))
formula_str <- reformulate(iv_name, response=dv_name)
aov.model <- lm(formula_str, data = data)
emm.means <- emmeans(aov.model,
specs = iv_name)
emm.pairs <- pairs(emm.means)
output <- list(
aov_model_summary = summary(aov.model),
means = emm.means,
pairs = emm.pairs)
class(output) <- "lm_pwc"
return(output)
}
然后,你可以创建一个单独的print.lm_pwc()
函数,允许你只打印结果的特定部分。例如,下面的函数默认只打印配对比较部分,但返回的结果包含初始返回的所有三个元素。这样可以返回比打印更多的对象。下面是一个示例:
print.lm_pwc <- function(x, ..., which=c("pairs", "model", "means", "all")){
w <- match.arg(which, several.ok=FALSE)
if(w %in% c("model", "all")){
cat("Model Results:\n")
print(x[[1]])
}
if(w %in% c("means", "all")){
cat("Marginal Means:\n")
print(x[[2]])
}
if(w %in% c("pairs", "all")){
cat("Pairwise Comparisons:\n")
print(x[[3]])
}
}
res <- results_LM(dia, cut, price)
res
#> Pairwise Comparisons:
#> contrast estimate SE df t.ratio p.value
#> Fair - Good 429.9 113.8 53935 3.776 0.0015
#> Fair - Ideal 901.2 102.4 53935 8.800 <.0001
#> Fair - Premium -225.5 104.4 53935 -2.160 0.1950
#> Fair - Very Good 377.0 105.2 53935 3.585 0.0031
#> Good - Ideal 471.3 62.7 53935 7.517 <.0001
#> Good - Premium -655.4 65.9 53935 -9.946 <.0001
#> Good - Very Good -52.9 67.1 53935 -0.788 0.9341
#> Ideal - Premium -1126.7 43.2 53935 -26.067 <.0001
#> Ideal - Very Good -524.2 45.1 53935 -11.636 <.0001
#> Premium - Very Good 602.5 49.4 53935 12.198 <.0001
#>
#> P value adjustment: tukey method for comparing a family of 5 estimates
注意,上面的代码只打印了配对比较部分,但返回的对象包含所有三个元素。
names(res)
#> [1] "aov_model_summary" "means" "pairs"
你可以使用print(res, which = "model")
只打印模型,或者使用print(res, which="means")
只打印边际均值,或者使用print(res, which="all")
显示所有结果。注意,函数定义中的第一个选择(例如上面的"pairs")是默认值。
英文:
You were really close. I think the only thing is you don't need the by
argument in the call to emmeans()
. I might also suggest using reformulate()
to make the formula. See below:
library(tidyverse)
library(emmeans)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv) {
iv_name <- as.character(substitute(iv))
dv_name <- as.character(substitute(dv))
formula_str <- reformulate(iv_name, response=dv_name)
aov.model <- lm(formula_str, data = data)
emm.means <- emmeans(aov.model,
specs = iv_name)
emm.pairs <- pairs(emm.means)
output <- list(
aov_model_summary = summary(aov.model),
emm.means,
emm.pairs)
names(output) <- c("Model", "Marginal Means", "Pairwise Comparisons")
return(output)
}
results_LM(dia, cut, price)
#> $Model
#>
#> Call:
#> lm(formula = formula_str, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -4258 -2741 -1494 1360 15348
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4358.76 98.79 44.122 < 2e-16 ***
#> cutGood -429.89 113.85 -3.776 0.000160 ***
#> cutIdeal -901.22 102.41 -8.800 < 2e-16 ***
#> cutPremium 225.50 104.40 2.160 0.030772 *
#> cutVery Good -377.00 105.16 -3.585 0.000338 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3964 on 53935 degrees of freedom
#> Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
#> F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16
#>
#>
#> $`Marginal Means`
#> cut emmean SE df lower.CL upper.CL
#> Fair 4359 98.8 53935 4165 4552
#> Good 3929 56.6 53935 3818 4040
#> Ideal 3458 27.0 53935 3405 3510
#> Premium 4584 33.8 53935 4518 4650
#> Very Good 3982 36.1 53935 3911 4052
#>
#> Confidence level used: 0.95
#>
#> $`Pairwise Comparisons`
#> contrast estimate SE df t.ratio p.value
#> Fair - Good 429.9 113.8 53935 3.776 0.0015
#> Fair - Ideal 901.2 102.4 53935 8.800 <.0001
#> Fair - Premium -225.5 104.4 53935 -2.160 0.1950
#> Fair - Very Good 377.0 105.2 53935 3.585 0.0031
#> Good - Ideal 471.3 62.7 53935 7.517 <.0001
#> Good - Premium -655.4 65.9 53935 -9.946 <.0001
#> Good - Very Good -52.9 67.1 53935 -0.788 0.9341
#> Ideal - Premium -1126.7 43.2 53935 -26.067 <.0001
#> Ideal - Very Good -524.2 45.1 53935 -11.636 <.0001
#> Premium - Very Good 602.5 49.4 53935 12.198 <.0001
#>
#> P value adjustment: tukey method for comparing a family of 5 estimates
If you wanted the rlang
evaluation instead, you could replace as.character(substitute())
with as_label(enquo())
.
Since you're trying to learn how functions work and what you can do with them, let me demonstrate one other thing. You could give your returned result a class (e.g., "lm_pwc"), by setting the class of the output list with class(output) <- "lm_pwc"
.
library(tidyverse)
library(emmeans)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv) {
iv_name <- as_label(enquo(iv))
dv_name <- as_label(enquo(dv))
formula_str <- reformulate(iv_name, response=dv_name)
aov.model <- lm(formula_str, data = data)
emm.means <- emmeans(aov.model,
specs = iv_name)
emm.pairs <- pairs(emm.means)
output <- list(
aov_model_summary = summary(aov.model),
means = emm.means,
pairs = emm.pairs)
class(output) <- "lm_pwc"
return(output)
}
Then, you could make a separate print.lm_pwc()
function that would allow you to print only the desired part of the results. For example, the function below by default prints only the pairwise comparisons, but the returned result contains all three elements in the initial return. This allows you to return more objects than you print. Here's an example:
print.lm_pwc <- function(x, ..., which=c("pairs", "model", "means", "all")){
w <- match.arg(which, several.ok=FALSE)
if(w %in% c("model", "all")){
cat("Model Results:\n")
print(x[[1]])
}
if(w %in% c("means", "all")){
cat("Marginal Means:\n")
print(x[[2]])
}
if(w %in% c("pairs", "all")){
cat("Pairwise Comparisons:\n")
print(x[[3]])
}
}
res <- results_LM(dia, cut, price)
res
#> Pairwise Comparisons:
#> contrast estimate SE df t.ratio p.value
#> Fair - Good 429.9 113.8 53935 3.776 0.0015
#> Fair - Ideal 901.2 102.4 53935 8.800 <.0001
#> Fair - Premium -225.5 104.4 53935 -2.160 0.1950
#> Fair - Very Good 377.0 105.2 53935 3.585 0.0031
#> Good - Ideal 471.3 62.7 53935 7.517 <.0001
#> Good - Premium -655.4 65.9 53935 -9.946 <.0001
#> Good - Very Good -52.9 67.1 53935 -0.788 0.9341
#> Ideal - Premium -1126.7 43.2 53935 -26.067 <.0001
#> Ideal - Very Good -524.2 45.1 53935 -11.636 <.0001
#> Premium - Very Good 602.5 49.4 53935 12.198 <.0001
#>
#> P value adjustment: tukey method for comparing a family of 5 estimates
Notice that the above only prints the pairwise comparisons, but the returned object contains all three elements.
names(res)
#> [1] "aov_model_summary" "means" "pairs"
You could print just the model with print(res, which = "model")
or the marginal means with print(res, which="means")
or you could show all results with print(res, which="all")
. Note, that the first choice in the function definition (e.g., "pairs" above) is the default.
<sup>Created on 2023-08-09 with reprex v2.0.2</sup>
答案2
得分: 0
@DaveArmstrong - 再次感谢 - 我仍在努力理解class(output) <- "lm_pwc"
这一行代码。
我已经更新了代码,如果有人感兴趣,可以包含更多可能感兴趣的方差分析材料。我的部门使用SAS,并且抱怨必须输入每一行代码才能得到SAS在一行中完成的输出。因此,我正在编写一套用于执行基本分析的代码,这些分析通常不是统计人员经常进行的。
我还尝试在输出中添加注释,但现在这是我目前的代码...
# # # # # #
# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv, block=NULL) {
iv_name <- as_label(enquo(iv))
dv_name <- as_label(enquo(dv))
formula_str <- reformulate(iv_name, response=dv_name)
if (!is.null(block)) {
block_name <- quo_name(enquo(block))
formula_str <- paste0(formula_str, " + ", block_name)
}
aov.model <- lm(formula_str, data = data)
residuals <- resid(aov.model)
par(mfrow=c(2,2))
assumpt.plot <- plot(aov.model)
# Perform Levene test
levene_test <- leveneTest(residuals, data[[iv_name]])
# Get the Shapiro Wilk test
# # Checking normality statistically ----
# shapiro_stat <- shapiro.test(aov.model$residuals)
# Calculate emmeans for model ---
emm.means <- emmeans(aov.model,
specs = iv_name)
# calculate pairwise comparisons of emmeans
emm.pairs <- pairs(emm.means)
# plot the emmenas by IV
plot.1 <- ggplot(tidy(emm.means, conf.int = TRUE), aes(cut, estimate)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high))
# generate the output
output <- list(
aov_model_summary = summary(aov.model),
levene_test,
# shapiro_stat,
means = emm.means,
pairs = emm.pairs,
# call the assumption plot
assumpt.plot,
# call the mean SE plot
plot.1)
# set the class of the output - still not sure what this does
class(output) <- "lm_pwc"
# return all the output ----
return(output)
}
results_LM(dia, cut, price)
英文:
@DaveArmstrong - thanks again - I am still trying to understand the
class(output) <- "lm_pwc"
I have updated the code if anyone was interested to include more of the anova material someone might be interested in. My department uses SAS and complains that you have to enter ever line of code to get the output SAS does in one line. So I was making a set of code to do basic analyses that not stat people do a lot.
I am also trying to learn to put comments in the output but here is what I have now...
# # # # # #
# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)
dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))
results_LM <- function(data, iv, dv, block=NULL) {
iv_name <- as_label(enquo(iv))
dv_name <- as_label(enquo(dv))
formula_str <- reformulate(iv_name, response=dv_name)
if (!is.null(block)) {
block_name <- quo_name(enquo(block))
formula_str <- paste0(formula_str, " + ", block_name)
}
aov.model <- lm(formula_str, data = data)
residuals <- resid(aov.model)
par(mfrow=c(2,2))
assumpt.plot <- plot(aov.model)
# Perform Levene test
levene_test <- leveneTest(residuals, data[[iv_name]])
# Get the Shapiro Wilk test
# # Checking normality statistically ----
# shapiro_stat <- shapiro.test(aov.model$residuals)
# Calculate emmeans for model ---
emm.means <- emmeans(aov.model,
specs = iv_name)
# calculate pairwise comparisons of emmeans
emm.pairs <- pairs(emm.means)
# plot the emmenas by IV
plot.1 <- ggplot(tidy(emm.means, conf.int = TRUE), aes(cut, estimate)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high))
# generate the output
output <- list(
aov_model_summary = summary(aov.model),
levene_test,
# shapiro_stat,
means = emm.means,
pairs = emm.pairs,
# call the assumption plot
assumpt.plot,
# call the mean SE plot
plot.1)
# set the class of the output - still not sure what this does
class(output) <- "lm_pwc"
# return all the output ----
return(output)
}
results_LM(dia, cut, price)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论