2023年8月9日 05:12:51go评论109阅读模式

英文:

Learning user defined functions to do ANOVA and emmeans pairwise comparison

问题

我正在尝试学习编写函数，并尝试创建一个执行ANOVA和后续F检验的函数。我将问题简化为获取emmeans和相关的两两比较。问题是两两比较不起作用？我对函数还很陌生，尝试了很多方法来修复这个问题，但都没有成功...

我正在使用diamonds数据集以便复现问题...

以下是代码：

# 导入库
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)

dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))

results_LM <- function(data, iv, dv) {
  iv_name <- as.character(substitute(iv))
  dv_name <- as.character(substitute(dv))
  formula_str <- paste(dv_name, "~", iv_name)
  aov.model <- lm(formula_str, data = data)
  emm.means <- emmeans(aov.model,
                       specs = iv_name,
                       by = iv_name)
  emm.pairs <- pairs(emm.means)
  output <- list(
    aov_model_summary = summary(aov.model),
    emm.means,
    emm.pairs)
  return(output)
}

results_LM(dia, cut, price)

问题是emmmeans的输出看起来很奇怪，而且两两比较的结果也没有显示出来。

请注意，当我在函数之外进行ANOVA和emmmeans后续F检验时，一切正常。

非常感谢您的帮助，并希望能给我一些关于在数据框上编写这样的函数的指导。谢谢！

英文:

I am trying to learn to write functions and exploring making a function to do an ANOVA and post F test. I have simplified this to the problem which is obtaining emmeans and associated all pairwise comparisons. The issue is the pairwise comparisons do not work??? I am real new at functions and have tried a lot of ways to fix this to no avail...

I am using the diamonds data set for reproducibility...

Here is the code:

# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)

dia &lt;- diamonds
dia$cut &lt;- as.factor(as.character(dia$cut))

results_LM &lt;- function(data, iv, dv) {
  iv_name &lt;- as.character(substitute(iv))
  dv_name &lt;- as.character(substitute(dv))
  formula_str &lt;- paste(dv_name, &quot;~&quot;, iv_name)
  aov.model &lt;- lm(formula_str, data = data)
  emm.means &lt;- emmeans(aov.model,
                       specs = iv_name,
                       by = iv_name)
  emm.pairs &lt;- pairs(emm.means)
  output &lt;- list(
    aov_model_summary = summary(aov.model),
    emm.means,
    emm.pairs)
  return(output)
}

results_LM(dia, cut, price)

This issue is the emmeans output looks odd and the pairwise comparisons are missing

[[3]]
cut = Fair:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Good:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Ideal:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Premium:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA

cut = Very Good:
 contrast  estimate SE df z.ratio p.value
 (nothing)   nonEst NA NA      NA      NA```

Note that when I do the anova and emmeans post F test outside of the function it works fine? 

Any help is appreciated and pointers to learn to write functions like this on dataframes would be GREAT. Thanks Bill

</details>


# 答案1
**得分**: 1

你离答案很接近了。我认为唯一的问题是在调用`emmeans()`函数时不需要使用`by`参数。我还建议使用`reformulate()`函数来创建公式。请参考下面的代码：

``` r
library(tidyverse)
library(emmeans)

dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))

results_LM <- function(data, iv, dv) {
  iv_name <- as.character(substitute(iv))
  dv_name <- as.character(substitute(dv))
  formula_str <- reformulate(iv_name, response=dv_name)
  aov.model <- lm(formula_str, data = data)
  emm.means <- emmeans(aov.model,
                       specs = iv_name)
  emm.pairs <- pairs(emm.means)
  output <- list(
    aov_model_summary = summary(aov.model),
    emm.means,
    emm.pairs)
  names(output) <- c("Model", "Marginal Means", "Pairwise Comparisons")
  return(output)
}

results_LM(dia, cut, price)
#> $Model
#> 
#> Call:
#> lm(formula = formula_str, data = data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -4258  -2741  -1494   1360  15348 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   4358.76      98.79  44.122  < 2e-16 ***
#> cutGood       -429.89     113.85  -3.776 0.000160 ***
#> cutIdeal      -901.22     102.41  -8.800  < 2e-16 ***
#> cutPremium     225.50     104.40   2.160 0.030772 *  
#> cutVery Good  -377.00     105.16  -3.585 0.000338 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3964 on 53935 degrees of freedom
#> Multiple R-squared:  0.01286,   Adjusted R-squared:  0.01279 
#> F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
#> 
#> 
#> $`Marginal Means`
#>  cut       emmean   SE    df lower.CL upper.CL
#>  Fair        4359 98.8 53935     4165     4552
#>  Good        3929 56.6 53935     3818     4040
#>  Ideal       3458 27.0 53935     3405     3510
#>  Premium     4584 33.8 53935     4518     4650
#>  Very Good   3982 36.1 53935     3911     4052
#> 
#> Confidence level used: 0.95 
#> 
#> $`Pairwise Comparisons`
#>  contrast            estimate    SE    df t.ratio p.value
#>  Fair - Good            429.9 113.8 53935   3.776  0.0015
#>  Fair - Ideal           901.2 102.4 53935   8.800  <.0001
#>  Fair - Premium        -225.5 104.4 53935  -2.160  0.1950
#>  Fair - Very Good       377.0 105.2 53935   3.585  0.0031
#>  Good - Ideal           471.3  62.7 53935   7.517  <.0001
#>  Good - Premium        -655.4  65.9 53935  -9.946  <.0001
#>  Good - Very Good       -52.9  67.1 53935  -0.788  0.9341
#>  Ideal - Premium      -1126.7  43.2 53935 -26.067  <.0001
#>  Ideal - Very Good     -524.2  45.1 53935 -11.636  <.0001
#>  Premium - Very Good    602.5  49.4 53935  12.198  <.0001
#> 
#> P value adjustment: tukey method for comparing a family of 5 estimates

如果你想要使用rlang的评估，你可以将as.character(substitute())替换为as_label(enquo())。

由于你想要了解函数的工作原理以及可以做什么，让我演示另一件事情。你可以给返回的结果设置一个类（例如"lm_pwc"），通过使用class(output) <- "lm_pwc"来设置输出列表的类。

library(tidyverse)
library(emmeans)

dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))

results_LM <- function(data, iv, dv) {
  iv_name <- as_label(enquo(iv))
  dv_name <- as_label(enquo(dv))
  formula_str <- reformulate(iv_name, response=dv_name)
  aov.model <- lm(formula_str, data = data)
  emm.means <- emmeans(aov.model,
                       specs = iv_name)
  emm.pairs <- pairs(emm.means)
  output <- list(
    aov_model_summary = summary(aov.model),
    means = emm.means,
    pairs = emm.pairs)
  class(output) <- "lm_pwc"
  return(output)
}

然后，你可以创建一个单独的print.lm_pwc()函数，允许你只打印结果的特定部分。例如，下面的函数默认只打印配对比较部分，但返回的结果包含初始返回的所有三个元素。这样可以返回比打印更多的对象。下面是一个示例：


print.lm_pwc <- function(x, ..., which=c("pairs", "model", "means", "all")){
  w <- match.arg(which, several.ok=FALSE)
  if(w %in% c("model", "all")){
    cat("Model Results:\n")
    print(x[[1]])
  }
  if(w %in% c("means", "all")){
    cat("Marginal Means:\n")
    print(x[[2]])
  }
  if(w %in% c("pairs", "all")){
    cat("Pairwise Comparisons:\n")
    print(x[[3]])
  }
}

res <- results_LM(dia, cut, price)
res
#> Pairwise Comparisons:
#>  contrast            estimate    SE    df t.ratio p.value
#>  Fair - Good            429.9 113.8 53935   3.776  0.0015
#>  Fair - Ideal           901.2 102.4 53935   8.800  <.0001
#>  Fair - Premium        -225.5 104.4 53935  -2.160  0.1950
#>  Fair - Very Good       377.0 105.2 53935   3.585  0.0031
#>  Good - Ideal           471.3  62.7 53935   7.517  <.0001
#>  Good - Premium        -655.4  65.9 53935  -9.946  <.0001
#>  Good - Very Good       -52.9  67.1 53935  -0.788  0.9341
#>  Ideal - Premium      -1126.7  43.2 53935 -26.067  <.0001
#>  Ideal - Very Good     -524.2  45.1 53935 -11.636  <.0001
#>  Premium - Very Good    602.5  49.4 53935  12.198  <.0001
#> 
#> P value adjustment: tukey method for comparing a family of 5 estimates

注意，上面的代码只打印了配对比较部分，但返回的对象包含所有三个元素。

names(res)
#> [1] "aov_model_summary" "means"             "pairs"

你可以使用print(res, which = "model")只打印模型，或者使用print(res, which="means")只打印边际均值，或者使用print(res, which="all")显示所有结果。注意，函数定义中的第一个选择（例如上面的"pairs"）是默认值。

英文:

You were really close. I think the only thing is you don't need the by argument in the call to emmeans(). I might also suggest using reformulate() to make the formula. See below:

library(tidyverse)
library(emmeans)

dia &lt;- diamonds
dia$cut &lt;- as.factor(as.character(dia$cut))

results_LM &lt;- function(data, iv, dv) {
  iv_name &lt;- as.character(substitute(iv))
  dv_name &lt;- as.character(substitute(dv))
  formula_str &lt;- reformulate(iv_name, response=dv_name)
  aov.model &lt;- lm(formula_str, data = data)
  emm.means &lt;- emmeans(aov.model,
                       specs = iv_name)
  emm.pairs &lt;- pairs(emm.means)
  output &lt;- list(
    aov_model_summary = summary(aov.model),
    emm.means,
    emm.pairs)
  names(output) &lt;- c(&quot;Model&quot;, &quot;Marginal Means&quot;, &quot;Pairwise Comparisons&quot;)
  return(output)
}

results_LM(dia, cut, price)
#&gt; $Model
#&gt; 
#&gt; Call:
#&gt; lm(formula = formula_str, data = data)
#&gt; 
#&gt; Residuals:
#&gt;    Min     1Q Median     3Q    Max 
#&gt;  -4258  -2741  -1494   1360  15348 
#&gt; 
#&gt; Coefficients:
#&gt;              Estimate Std. Error t value Pr(&gt;|t|)    
#&gt; (Intercept)   4358.76      98.79  44.122  &lt; 2e-16 ***
#&gt; cutGood       -429.89     113.85  -3.776 0.000160 ***
#&gt; cutIdeal      -901.22     102.41  -8.800  &lt; 2e-16 ***
#&gt; cutPremium     225.50     104.40   2.160 0.030772 *  
#&gt; cutVery Good  -377.00     105.16  -3.585 0.000338 ***
#&gt; ---
#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
#&gt; 
#&gt; Residual standard error: 3964 on 53935 degrees of freedom
#&gt; Multiple R-squared:  0.01286,    Adjusted R-squared:  0.01279 
#&gt; F-statistic: 175.7 on 4 and 53935 DF,  p-value: &lt; 2.2e-16
#&gt; 
#&gt; 
#&gt; $`Marginal Means`
#&gt;  cut       emmean   SE    df lower.CL upper.CL
#&gt;  Fair        4359 98.8 53935     4165     4552
#&gt;  Good        3929 56.6 53935     3818     4040
#&gt;  Ideal       3458 27.0 53935     3405     3510
#&gt;  Premium     4584 33.8 53935     4518     4650
#&gt;  Very Good   3982 36.1 53935     3911     4052
#&gt; 
#&gt; Confidence level used: 0.95 
#&gt; 
#&gt; $`Pairwise Comparisons`
#&gt;  contrast            estimate    SE    df t.ratio p.value
#&gt;  Fair - Good            429.9 113.8 53935   3.776  0.0015
#&gt;  Fair - Ideal           901.2 102.4 53935   8.800  &lt;.0001
#&gt;  Fair - Premium        -225.5 104.4 53935  -2.160  0.1950
#&gt;  Fair - Very Good       377.0 105.2 53935   3.585  0.0031
#&gt;  Good - Ideal           471.3  62.7 53935   7.517  &lt;.0001
#&gt;  Good - Premium        -655.4  65.9 53935  -9.946  &lt;.0001
#&gt;  Good - Very Good       -52.9  67.1 53935  -0.788  0.9341
#&gt;  Ideal - Premium      -1126.7  43.2 53935 -26.067  &lt;.0001
#&gt;  Ideal - Very Good     -524.2  45.1 53935 -11.636  &lt;.0001
#&gt;  Premium - Very Good    602.5  49.4 53935  12.198  &lt;.0001
#&gt; 
#&gt; P value adjustment: tukey method for comparing a family of 5 estimates

If you wanted the rlang evaluation instead, you could replace as.character(substitute()) with as_label(enquo()).

Since you're trying to learn how functions work and what you can do with them, let me demonstrate one other thing. You could give your returned result a class (e.g., "lm_pwc"), by setting the class of the output list with class(output) <- "lm_pwc".

library(tidyverse)
library(emmeans)

dia &lt;- diamonds
dia$cut &lt;- as.factor(as.character(dia$cut))

results_LM &lt;- function(data, iv, dv) {
  iv_name &lt;- as_label(enquo(iv))
  dv_name &lt;- as_label(enquo(dv))
  formula_str &lt;- reformulate(iv_name, response=dv_name)
  aov.model &lt;- lm(formula_str, data = data)
  emm.means &lt;- emmeans(aov.model,
                       specs = iv_name)
  emm.pairs &lt;- pairs(emm.means)
  output &lt;- list(
    aov_model_summary = summary(aov.model),
    means = emm.means,
    pairs = emm.pairs)
  class(output) &lt;- &quot;lm_pwc&quot;
  return(output)
}

Then, you could make a separate print.lm_pwc() function that would allow you to print only the desired part of the results. For example, the function below by default prints only the pairwise comparisons, but the returned result contains all three elements in the initial return. This allows you to return more objects than you print. Here's an example:


print.lm_pwc &lt;- function(x, ..., which=c(&quot;pairs&quot;, &quot;model&quot;, &quot;means&quot;, &quot;all&quot;)){
  w &lt;- match.arg(which, several.ok=FALSE)
  if(w %in% c(&quot;model&quot;, &quot;all&quot;)){
    cat(&quot;Model Results:\n&quot;)
    print(x[[1]])
  }
  if(w %in% c(&quot;means&quot;, &quot;all&quot;)){
    cat(&quot;Marginal Means:\n&quot;)
    print(x[[2]])
  }
  if(w %in% c(&quot;pairs&quot;, &quot;all&quot;)){
    cat(&quot;Pairwise Comparisons:\n&quot;)
    print(x[[3]])
  }
}

res &lt;- results_LM(dia, cut, price)
res
#&gt; Pairwise Comparisons:
#&gt;  contrast            estimate    SE    df t.ratio p.value
#&gt;  Fair - Good            429.9 113.8 53935   3.776  0.0015
#&gt;  Fair - Ideal           901.2 102.4 53935   8.800  &lt;.0001
#&gt;  Fair - Premium        -225.5 104.4 53935  -2.160  0.1950
#&gt;  Fair - Very Good       377.0 105.2 53935   3.585  0.0031
#&gt;  Good - Ideal           471.3  62.7 53935   7.517  &lt;.0001
#&gt;  Good - Premium        -655.4  65.9 53935  -9.946  &lt;.0001
#&gt;  Good - Very Good       -52.9  67.1 53935  -0.788  0.9341
#&gt;  Ideal - Premium      -1126.7  43.2 53935 -26.067  &lt;.0001
#&gt;  Ideal - Very Good     -524.2  45.1 53935 -11.636  &lt;.0001
#&gt;  Premium - Very Good    602.5  49.4 53935  12.198  &lt;.0001
#&gt; 
#&gt; P value adjustment: tukey method for comparing a family of 5 estimates

Notice that the above only prints the pairwise comparisons, but the returned object contains all three elements.

names(res)
#&gt; [1] &quot;aov_model_summary&quot; &quot;means&quot;             &quot;pairs&quot;

You could print just the model with print(res, which = "model") or the marginal means with print(res, which="means") or you could show all results with print(res, which="all"). Note, that the first choice in the function definition (e.g., "pairs" above) is the default.

<sup>Created on 2023-08-09 with reprex v2.0.2</sup>

答案2

得分: 0

@DaveArmstrong - 再次感谢 - 我仍在努力理解class(output) <- "lm_pwc"这一行代码。

我已经更新了代码，如果有人感兴趣，可以包含更多可能感兴趣的方差分析材料。我的部门使用SAS，并且抱怨必须输入每一行代码才能得到SAS在一行中完成的输出。因此，我正在编写一套用于执行基本分析的代码，这些分析通常不是统计人员经常进行的。

我还尝试在输出中添加注释，但现在这是我目前的代码...

# # # # # # 
# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)

dia <- diamonds
dia$cut <- as.factor(as.character(dia$cut))

results_LM <- function(data, iv, dv, block=NULL) {
  iv_name <- as_label(enquo(iv))
  dv_name <- as_label(enquo(dv))
  formula_str <- reformulate(iv_name, response=dv_name)
  if (!is.null(block)) {
    block_name <- quo_name(enquo(block))
    formula_str <- paste0(formula_str, " + ", block_name)
  }
  aov.model <- lm(formula_str, data = data)
  residuals <- resid(aov.model)
  par(mfrow=c(2,2))
  assumpt.plot <- plot(aov.model)
    # Perform Levene test
  levene_test <- leveneTest(residuals, data[[iv_name]])
  # Get the Shapiro Wilk test
  # # Checking normality statistically ----
  # shapiro_stat <- shapiro.test(aov.model$residuals)
  # Calculate emmeans for model ---
  emm.means <- emmeans(aov.model,
                       specs = iv_name)
  # calculate pairwise comparisons of emmeans
  emm.pairs <- pairs(emm.means)
  # plot the emmenas by IV
  plot.1 <- ggplot(tidy(emm.means, conf.int = TRUE), aes(cut, estimate)) +
    geom_point() +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high))
  # generate the output 
  output <- list(
    aov_model_summary = summary(aov.model),
    levene_test,
    # shapiro_stat,
    means = emm.means,
    pairs = emm.pairs,
    # call the assumption plot
    assumpt.plot,
    # call the mean SE plot
    plot.1)
  # set the class of the output - still not sure what this does
  class(output) <- "lm_pwc"

  # return all the output ----
  return(output)
}

results_LM(dia, cut, price)

英文:

@DaveArmstrong - thanks again - I am still trying to understand the
class(output) <- "lm_pwc"

I have updated the code if anyone was interested to include more of the anova material someone might be interested in. My department uses SAS and complains that you have to enter ever line of code to get the output SAS does in one line. So I was making a set of code to do basic analyses that not stat people do a lot.

I am also trying to learn to put comments in the output but here is what I have now...

# # # # # # 
# libraries
library(tidyverse)
library(car)
library(emmeans)
library(readxl)
library(rlang)
library(broom)

dia &lt;- diamonds
dia$cut &lt;- as.factor(as.character(dia$cut))

results_LM &lt;- function(data, iv, dv, block=NULL) {
  iv_name &lt;- as_label(enquo(iv))
  dv_name &lt;- as_label(enquo(dv))
  formula_str &lt;- reformulate(iv_name, response=dv_name)
  if (!is.null(block)) {
    block_name &lt;- quo_name(enquo(block))
    formula_str &lt;- paste0(formula_str, &quot; + &quot;, block_name)
  }
  aov.model &lt;- lm(formula_str, data = data)
  residuals &lt;- resid(aov.model)
  par(mfrow=c(2,2))
  assumpt.plot &lt;- plot(aov.model)
    # Perform Levene test
  levene_test &lt;- leveneTest(residuals, data[[iv_name]])
  # Get the Shapiro Wilk test
  # # Checking normality statistically ----
  # shapiro_stat &lt;- shapiro.test(aov.model$residuals)
  # Calculate emmeans for model ---
  emm.means &lt;- emmeans(aov.model,
                       specs = iv_name)
  # calculate pairwise comparisons of emmeans
  emm.pairs &lt;- pairs(emm.means)
  # plot the emmenas by IV
  plot.1 &lt;- ggplot(tidy(emm.means, conf.int = TRUE), aes(cut, estimate)) +
    geom_point() +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high))
  # generate the output 
  output &lt;- list(
    aov_model_summary = summary(aov.model),
    levene_test,
    # shapiro_stat,
    means = emm.means,
    pairs = emm.pairs,
    # call the assumption plot
    assumpt.plot,
    # call the mean SE plot
    plot.1)
  # set the class of the output - still not sure what this does
  class(output) &lt;- &quot;lm_pwc&quot;

  # return all the output ----
  return(output)
}

results_LM(dia, cut, price)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

学习用户定义的函数来进行方差分析（ANOVA）和emmeans成对比较。

问题

答案2

有没有一种方法可以在保留索引的同时对按年分组的值进行总结？

使用Shiny应用中的verbatimTextOutput和shinyDirChoose时，可以看到目录路径。

Pandas Yahoo Finance: AttributeError: ‘Index’ 对象没有属性 ‘tz_localize’

在 ggplot 中为多边形创建一个孔以用作遮罩。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论