如何在R中修复glm变量中的奇异性。

huangapple go评论51阅读模式
英文:

how to fix singularities in glm variables in R

问题

I have an issue with adding one of my (dummy) variables into my poisson model. The main issue is that when I run my model with just this dummy as the explanatory variable (it consists of six dummy variables) and no intercept, it works perfectly fine. But when I try to add it on top of an existing model with more explanatory variables, it doesn't compute the final variable because of colinearity errors. That's confusing me, as it works normally when just the dummies are in the model.

I have already tried changing up the order of the explanatory variables, but as soon as I add the 'location' to any setup, it doesn't calculate the final variable anymore.

As the final variable is relevant for my results, even if it is directly colinear with another variable, I am lost as for what to do.

Thanks in advance for your help!

The model with just the variable location (R makes it into a dummy automatically):

model = glm(formula = taeni_number ~ 0 + location,
                         data=data, family=quasipoisson(link="log"))

The model with the other explanatory variables first:

model = glm(formula = taeni_number ~ 0 + grass + gravel + multi + location,
                         data=data, family=quasipoisson(link="log"))

The used dataset is:

structure(list(grass = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), gravel = c(1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), multi = c(0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), taeni_number = c(0L, 
0L, 0L, 0L, 1L, 1L, 1L, 2L, 0L, 0L, 2L, 0L, 1L, 3L, 0L, 0L, 0L, 
1L, 8L, 34L, 5L, 11L, 19L,

<details>
<summary>英文:</summary>

I have an issue with adding one of my (dummy) variables into my poisson model. The main issue is that when I run my model with just this dummy as the explanatory variable (it consists of six dummy variables) and no intercept, it works perfectly fine. But when I try to add it on top of an existing model with more explanatory variables, it doesn&#39;t compute the final variable because of colinearity errors. That&#39;s confusing me, as it works normally when just the dummies are in the model. 

I have already tried changing up the order of the explanatory variables, but as soon as I add the &#39; location&#39; to any setup it doesn&#39;t calculate the final variable anymore.

As the final variable is relevant for my results, even if it is directly colinear with another variable, I am lost as for what to do

Thanks in advance for your help!

The model with just the variable location (R makes it into a dummy automatically):

model = glm(formula = taeni_number ~ 0 + location,
data=data, family=quasipoisson(link="log"))


The model with the other explanatory variables first:

model = glm(formula = taeni_number ~ 0 + grass + gravel + multi + location,
data=data, family=quasipoisson(link="log"))


The used dataset is: 

structure(list(grass = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L), gravel = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L), multi = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L), taeni_number = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 0L, 0L,
2L, 0L, 1L, 3L, 0L, 0L, 0L, 1L, 8L, 34L, 5L, 11L, 19L, 7L, 0L,
1L, 2L, 0L, 1L, 0L, 14L, 19L, 10L, 15L, 8L, 3L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 3L, 1L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 9L,
5L, 8L, 8L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 6L, 15L, 2L, 2L, 9L,
3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 2L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
6L, 1L, 1L, 4L, 1L), location = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L,
6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 6L, 6L, 6L, 6L, 6L, 6L), levels = c("0", "1", "2", "3", "4",
"5"), class = "factor")), class = "data.frame", row.names = c(NA,
-108L))



</details>


# 答案1
**得分**: 1

以下是代码部分的翻译:

```r
问题实际上是`grass`,`gravel`和`multi`也完全共线。在任何一行中,这些值中只有一个值为1。由于您已经排除了截距项,可以估算`grass`,`gravel`和`multi`的所有系数,但由于位置的虚拟变量也是完全共线的,其中一个被删除。如果您更改变量的排序,将位置放在首位,您会看到共线变量的最后一个系数(例如,本例中的`multi`)被删除。

```r
glm(formula = taeni_number ~ 0 + location + gravel + grass + multi,
    data=d, family=quasipoisson(link="log"))

编辑:添加请求的测试

建议的测试无需通过包含所有组的水平来计算。您可以按照传统方式执行此操作 - 为每个分类变量保留一个参考组,然后在事后执行测试。一种方法是执行广义线性假设检验,这将允许您测试所有值对之间的差异。首先,我们可以处理数据并使用包含分类变量的模型重新估算数据:

library(dplyr)
library(tidyr)
library(multcomp)

d <- d %>%
  mutate(substrate = multi + 2*gravel + 3*grass, 
         substrate = factor(substrate, labels=c("Multiple", "Gravel", "Grass")))

model <- glm(formula = taeni_number ~ location + substrate,
            data=d, family=quasipoisson(link="log"))

然后,您可以使用multcomp包中的glht()和其摘要方法来测试所有成对比较。请注意,我已经通过在下面的adjusted()函数中更改type参数来请求不进行多重性校正的测试:

pwc_substrate <- summary(glht(model, linfct=mcp(substrate = "Tukey")), 
                         test=adjusted(type="none"))

pwc_substrate

这告诉您,Grass的预测显著高于Multiple或Gravel,但Gravel和Multiple彼此之间没有显著差异。以下是位置的一部分:

pwc_location <- summary(glht(model, linfct=mcp(location = "Tukey")), 
                        test=adjusted(type="none"))

pwc_location

这有点难以理解,但如果您重新排序位置因子的水平,以平均捕获的动物数量递增的方式,结果会变得更清晰(尽管请注意,它们完全相同,只是重新排序):

d$location <- reorder(d$location, d$taeni_number, mean)
model <- glm(formula = taeni_number ~ location + substrate,
             data=d, family=quasipoisson(link="log"))

pwc_location <- summary(glht(model, linfct=mcp(location = "Tukey")), 
                        test=adjusted(type="none"))

pwc_location

在这里,您可以看到5和3显著大于所有其他水平,但根据输出的最后一行,5和3之间没有显著差异。此外,位置0、1、2和4之间也没有显著差异。这可能是您需要的全部内容,但还有一些可视化这些结果的方法。我有一个在psre包中的可视化方法,您可以通过以下方式安装:

install.packages("remotes")
remotes::install_github("davidaarmstrong/psre")

除了您上面所做的工作之外,您还需要使用ggpredict()为每个变量创建效应:

library(psre)
library(ggeffects)
library(ggplot2)

eff_substrate <- ggpredict(model, 
                 "substrate", 
                 ci.lvl = .95)

eff_location <- ggpredict(model, 
                      "location", 
                      ci.lvl = .95)

然后,您可以从上面生成的成对比较中提取紧凑字母显示信息:

cld_substrate <- cld(pwc_substrate)
cld_location <- cld(pwc_location)
substrate_lmat <- cld_substrate$mcletters$LetterMatrix
location_lmat <- cld_location$mcletters$LetterMatrix

接下来,按照捕获的动物数量的预测值重新排序因子水平:

eff_substrate$x <- reorder(eff_substrate$x, eff_substrate$predicted, mean)
eff_location$x <- reorder(eff_location$x, eff_location$predicted, mean)

最后,将效应和字母信息传递给letter_plot()函数。带有线的点是带有95%置信区间的预测值。右侧边缘的点表示紧凑字母显示。具有相同字母的组之间没有显著差异。具有不同字母的组之间存在显著差异。

letter_plot(eff_substrate, substrate_lmat) + 
  labs(x="Predicted # Animals Caught\n(95% Confidence Interval)") + 
  ggtitle("Effect of Substrate")
letter_plot(eff_location, location_lmat) + 
  labs(x="Predicted # Animals Caught\n(95% Confidence Interval)") + 
  ggtitle("Effect of Location")

希望这可以帮助您理解和可视化模型的结果。

英文:

The problem is actually that grass, gravel and multi are perfectly collinear as well. One and only one of those values will be 1 in any row. Since you've taken the intercept out, all coefficients for grass, gravel and multi can be estimated, but since the location dummies are also perfectly collinear one of them is dropped. If you change the ordering of the variables, putting location first, you'll see that the last coefficient from the collinear variables (e.g., multi in this case) is dropped.

glm(formula = taeni_number ~ 0 + location + gravel + grass + multi ,
+             data=d, family=quasipoisson(link=&quot;log&quot;))

# Call:  glm(formula = taeni_number ~ 0 + location + gravel + grass + 
#     multi, family = quasipoisson(link = &quot;log&quot;), data = d)
# 
# Coefficients:
# location0  location1  location2  location3  location4  location5     gravel      grass      multi  
#   -2.4634    -1.0771    -1.9526     1.2583    -1.0771     1.2255     0.5193     1.1605         NA  
# 
# Degrees of Freedom: 108 Total (i.e. Null);  100 Residual
# Null Deviance:	    850.9 
# Residual Deviance: 236.5 	AIC: NA

Edit: Adding Requested Tests

The tests suggested in the comments do not need to be computed by including levels for all groups. You can do this in the conventional way - leaving a reference group for each categorical variable and then doing the tests after the fact. One way would be to do a generalized linear hypothesis test - which would allow you to test the difference between all pairs of values. First, we can manage the data and re-estimate the model with the categorical variables included:

library(dplyr)
library(tidyr)
library(multcomp)
 
d &lt;- structure(list(grass = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L), gravel = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L), multi = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 
0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 
1L), taeni_number = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 0L, 0L, 
2L, 0L, 1L, 3L, 0L, 0L, 0L, 1L, 8L, 34L, 5L, 11L, 19L, 7L, 0L, 
1L, 2L, 0L, 1L, 0L, 14L, 19L, 10L, 15L, 8L, 3L, 0L, 1L, 0L, 0L, 
0L, 0L, 0L, 3L, 1L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 4L, 9L, 
5L, 8L, 8L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 6L, 15L, 2L, 2L, 9L, 
3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 2L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 
6L, 1L, 1L, 4L, 1L), location = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 
6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 
5L, 6L, 6L, 6L, 6L, 6L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, 
&quot;5&quot;), class = &quot;factor&quot;)), class = &quot;data.frame&quot;, row.names = c(NA, 
-108L))
d &lt;- d %&gt;% 
  mutate(substrate = multi + 2*gravel + 3*grass, 
         substrate = factor(substrate, labels=c(&quot;Multiple&quot;, &quot;Gravel&quot;, &quot;Grass&quot;)))

model &lt;- glm(formula = taeni_number ~ location + substrate,
            data=d, family=quasipoisson(link=&quot;log&quot;))

You a use glht() from multcomp and its summary method to test all pairwise comparisons. Note, I have requested no multiplicity correction for the test by that can be changed by changing the type argument in the adjusted() function below:

pwc_substrate &lt;- summary(glht(model, linfct=mcp(substrate = &quot;Tukey&quot;)), 
                         test=adjusted(type=&quot;none&quot;))

pwc_substrate
#&gt; 
#&gt;   Simultaneous Tests for General Linear Hypotheses
#&gt; 
#&gt; Multiple Comparisons of Means: Tukey Contrasts
#&gt; 
#&gt; 
#&gt; Fit: glm(formula = taeni_number ~ location + substrate, family = quasipoisson(link = &quot;log&quot;), 
#&gt;     data = d)
#&gt; 
#&gt; Linear Hypotheses:
#&gt;                        Estimate Std. Error z value Pr(&gt;|z|)    
#&gt; Gravel - Multiple == 0   0.5193     0.2869   1.810  0.07031 .  
#&gt; Grass - Multiple == 0    1.1605     0.2604   4.457  8.3e-06 ***
#&gt; Grass - Gravel == 0      0.6412     0.2165   2.961  0.00306 ** 
#&gt; ---
#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
#&gt; (Adjusted p values reported -- none method)

This tells you that Grass has significantly higher predictions than either Multiple or Gravel, but that Gravel and Multiple are not different from each other. The one for Location is below:

pwc_location &lt;- summary(glht(model, linfct=mcp(location = &quot;Tukey&quot;)), 
                        test=adjusted(type=&quot;none&quot;))

pwc_location
#&gt; 
#&gt;   Simultaneous Tests for General Linear Hypotheses
#&gt; 
#&gt; Multiple Comparisons of Means: Tukey Contrasts
#&gt; 
#&gt; 
#&gt; Fit: glm(formula = taeni_number ~ location + substrate, family = quasipoisson(link = &quot;log&quot;), 
#&gt;     data = d)
#&gt; 
#&gt; Linear Hypotheses:
#&gt;              Estimate Std. Error z value Pr(&gt;|z|)    
#&gt; 1 - 0 == 0  1.386e+00  1.005e+00   1.379    0.168    
#&gt; 2 - 0 == 0  5.108e-01  1.137e+00   0.449    0.653    
#&gt; 3 - 0 == 0  3.722e+00  9.101e-01   4.089 4.32e-05 ***
#&gt; 4 - 0 == 0  1.386e+00  1.005e+00   1.379    0.168    
#&gt; 5 - 0 == 0  3.689e+00  9.104e-01   4.052 5.08e-05 ***
#&gt; 2 - 1 == 0 -8.755e-01  8.291e-01  -1.056    0.291    
#&gt; 3 - 1 == 0  2.335e+00  4.709e-01   4.960 7.06e-07 ***
#&gt; 4 - 1 == 0  1.110e-15  6.359e-01   0.000    1.000    
#&gt; 5 - 1 == 0  2.303e+00  4.716e-01   4.883 1.05e-06 ***
#&gt; 3 - 2 == 0  3.211e+00  7.105e-01   4.519 6.20e-06 ***
#&gt; 4 - 2 == 0  8.755e-01  8.291e-01   1.056    0.291    
#&gt; 5 - 2 == 0  3.178e+00  7.109e-01   4.470 7.81e-06 ***
#&gt; 4 - 3 == 0 -2.335e+00  4.709e-01  -4.960 7.06e-07 ***
#&gt; 5 - 3 == 0 -3.279e-02  1.995e-01  -0.164    0.869    
#&gt; 5 - 4 == 0  2.303e+00  4.716e-01   4.883 1.05e-06 ***
#&gt; ---
#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
#&gt; (Adjusted p values reported -- none method)

It's a bit difficult to see what's going on here, but if you re-order the levels of the location factor to be increasing in the average number of animals collected, the results become a bit clearer (though note that they are exactly the same, just reordered).

d$location &lt;- reorder(d$location, d$taeni_number, mean)
model &lt;- glm(formula = taeni_number ~ location + substrate,
             data=d, family=quasipoisson(link=&quot;log&quot;))

pwc_location &lt;- summary(glht(model, linfct=mcp(location = &quot;Tukey&quot;)), 
                        test=adjusted(type=&quot;none&quot;))

pwc_location
#&gt; 
#&gt;   Simultaneous Tests for General Linear Hypotheses
#&gt; 
#&gt; Multiple Comparisons of Means: Tukey Contrasts
#&gt; 
#&gt; 
#&gt; Fit: glm(formula = taeni_number ~ location + substrate, family = quasipoisson(link = &quot;log&quot;), 
#&gt;     data = d)
#&gt; 
#&gt; Linear Hypotheses:
#&gt;             Estimate Std. Error z value Pr(&gt;|z|)    
#&gt; 2 - 0 == 0 5.108e-01  1.137e+00   0.449    0.653    
#&gt; 1 - 0 == 0 1.386e+00  1.005e+00   1.379    0.168    
#&gt; 4 - 0 == 0 1.386e+00  1.005e+00   1.379    0.168    
#&gt; 5 - 0 == 0 3.689e+00  9.104e-01   4.052 5.08e-05 ***
#&gt; 3 - 0 == 0 3.722e+00  9.101e-01   4.089 4.32e-05 ***
#&gt; 1 - 2 == 0 8.755e-01  8.291e-01   1.056    0.291    
#&gt; 4 - 2 == 0 8.755e-01  8.291e-01   1.056    0.291    
#&gt; 5 - 2 == 0 3.178e+00  7.109e-01   4.470 7.81e-06 ***
#&gt; 3 - 2 == 0 3.211e+00  7.105e-01   4.519 6.20e-06 ***
#&gt; 4 - 1 == 0 1.110e-15  6.359e-01   0.000    1.000    
#&gt; 5 - 1 == 0 2.303e+00  4.716e-01   4.883 1.05e-06 ***
#&gt; 3 - 1 == 0 2.335e+00  4.709e-01   4.960 7.06e-07 ***
#&gt; 5 - 4 == 0 2.303e+00  4.716e-01   4.883 1.05e-06 ***
#&gt; 3 - 4 == 0 2.335e+00  4.709e-01   4.960 7.06e-07 ***
#&gt; 3 - 5 == 0 3.279e-02  1.995e-01   0.164    0.869    
#&gt; ---
#&gt; Signif. codes:  0 &#39;***&#39; 0.001 &#39;**&#39; 0.01 &#39;*&#39; 0.05 &#39;.&#39; 0.1 &#39; &#39; 1
#&gt; (Adjusted p values reported -- none method)

Here, you can see that 5 and 3 are significantly bigger than all other levels, but according to the last row of the output, there is no significant difference between 5 and 3. Further, there are no significant differences among locations 0, 1, 2, and 4. This may be all you need, but there are some ways of visualizing these results, too. I have one that's in the psre packages, which you an install with

install.packages(&quot;remotes&quot;)
remotes::install_github(&quot;davidaarmstrong/psre&quot;)

In addition to what you've done above, you'll need to create the effects for each of the variables using ggpredict():

library(psre)
library(ggeffects)
library(ggplot2)

eff_substrate &lt;- ggpredict(model, 
                 &quot;substrate&quot;, 
                 ci.lvl = .95)

eff_location &lt;- ggpredict(model, 
                      &quot;location&quot;, 
                      ci.lvl = .95)

Then, you can extract the compact letter display information from the pairwise comparisons generated above:

cld_substrate &lt;- cld(pwc_substrate)
cld_location &lt;- cld(pwc_location)
substrate_lmat &lt;- cld_substrate$mcletters$LetterMatrix
location_lmat &lt;- cld_location$mcletters$LetterMatrix

Next, reorder the factor levels by predicted number of animals caught:

eff_substrate$x &lt;- reorder(eff_substrate$x, eff_substrate$predicted, mean)
eff_location$x &lt;- reorder(eff_location$x, eff_location$predicted, mean)

Finally, pass the effects and letter information to the letter_plot() function. The points with lines through them are predicted values with 95% confidence intervals. The points in the right-hand margin of the plot represent the compact letter display. Groups with the same letters are not significantly different from each other. Groups with different letters are significantly different from each other.

letter_plot(eff_substrate, substrate_lmat) + 
  labs(x=&quot;Predicted # Animals Caught\n(95% Confidence Interval)&quot;) + 
  ggtitle(&quot;Effect of Substrate&quot;)
#&gt; Joining with `by = join_by(x)`
#&gt; Warning: Removed 1 rows containing missing values (`geom_point()`).
#&gt; Warning: Removed 2 rows containing missing values (`geom_point()`).

如何在R中修复glm变量中的奇异性。<!-- -->


letter_plot(eff_location, location_lmat) + 
  labs(x=&quot;Predicted # Animals Caught\n(95% Confidence Interval)&quot;) + 
  ggtitle(&quot;Effect of Location&quot;)
#&gt; Joining with `by = join_by(x)`
#&gt; Warning: Removed 2 rows containing missing values (`geom_point()`).
#&gt; Warning: Removed 4 rows containing missing values (`geom_point()`).

如何在R中修复glm变量中的奇异性。<!-- -->

<sup>Created on 2023-05-26 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年5月25日 17:55:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76331023.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定