英文:
Nested sample design in mgcv syntax
问题
以下是翻译好的内容:
我在将示例设计正确翻译为mgcv包的随机效应语法方面遇到了问题。
这是设置:我们每个“收集”站点都会拖鱼,每个月相同的12个站点(希望如此),连续7个月,每年一次(这些是我认为的随机截距)。然而,有时数据库中的站点要么由于意外而被跳过(NA),要么被备用站点替代(当恶劣条件阻止我们在那里拖网时,附近的站点会替代原来的站点)。数据中还存在重复测量的因素,其中同一站点记录了许多鱼的长度(这只是一个协变量,但认为随机截距也用于重复测量)。如果我的目标是忠实地表示这个设计,我应该如何构造我的随机效应?是否应该类似于这样(?):
(fSite = 因子站点; fCYR = 因子日历年)
s(fSite, bs='re') + s(fCYR, , bs='re')
或者
s(fSite, fCYR, bs='re') - 如果每个fCYR级别中都存在相同的12个站点,这是否只适用?
数据:
> dput(station)
structure(list(CYR = c(2009, 2009, 2009, 2009, 2009, 2009, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2013, 2013,
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014,
2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019,
2019, 2019, 2019, 2019, 2019, 2019), Month = c(6, 7, 8, 9, 10,
11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8,
9, 10, 11, 4, 5, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 5, 6, 7,
8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 1,
3, 5, 6, 7, 8, 9, 10, 11, 2, 3, 5, 6, 7, 8, 9, 10, 11), `Coll. Site 1` = c(21,
23, 22, 23, 22, 23, 22, 22, 21, 68, 23, 22, 22, 70, 20, 23, 22,
20, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
22, 22, NA, NA, 22, 22, 22, NA, NA, 22, 22, 22, 22, 22), `Coll. Site 2` = c(40,
70, 70, 70, 68, 68, 68, 68, 70, 101, 112, 70, 70, 143, 70, 70,
68, 67, 70, 67, 67, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70),
`Coll. Site 3` = c(70, 112, 112, 122, 123, 112, 122, 122,
111, 136, 134, 124, 135, 158, 122, 123, 124, 124, 112, 123,
123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123,
<details>
<summary>英文:</summary>
I'm having trouble translating my sample design into the correct mgcv package syntax for random effects.
Here is the set-up: we trawl for fish, once per "collection" site, at the same 12 sites per month (hopefully), for 7 consecutive months, every year (these are my random intercepts, I think). However, sometimes sites in the database are either skipped by accident (NA), or are replaced by back-up sites (which is a near-by site to replace the intended one when poor conditions hinder us from trawling there). There is also a repeated measures aspect to the data wherein many fish lengths are recorded within the same site (this is just a covariate, but thought random intercepts are also for repeated measures). If my goal is to faithfully represent this design, how should I structure my random effects? Something like this (?):
(fSite = factor site; fCYR = factor calendar year)
**s(fSite, bs='re') +
s(fCYR, , bs='re')**
or
**s(fSite, fCYR, bs='re')** <- Does this only work if the same 12 sites are present in each fCYR level?
Data:
> dput(station)
structure(list(CYR = c(2009, 2009, 2009, 2009, 2009, 2009, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 2011,
2011, 2011, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2013, 2013,
2013, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2014, 2014,
2015, 2015, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2016,
2016, 2016, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019,
2019, 2019, 2019, 2019, 2019, 2019), Month = c(6, 7, 8, 9, 10,
11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8,
9, 10, 11, 4, 5, 6, 7, 8, 9, 10, 5, 6, 7, 8, 9, 10, 5, 6, 7,
8, 9, 10, 5, 6, 7, 8, 9, 10, 11, 5, 6, 7, 8, 9, 10, 11, 12, 1,
3, 5, 6, 7, 8, 9, 10, 11, 2, 3, 5, 6, 7, 8, 9, 10, 11), `Coll. Site 1` = c(21,
23, 22, 23, 22, 23, 22, 22, 21, 68, 23, 22, 22, 70, 20, 23, 22,
20, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 21, 22, 22, 22,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22,
22, 22, NA, NA, 22, 22, 22, NA, NA, 22, 22, 22, 22, 22), `Coll. Site 2` = c(40,
70, 70, 70, 68, 68, 68, 68, 70, 101, 112, 70, 70, 143, 70, 70,
68, 67, 70, 67, 67, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,
70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70),
`Coll. Site 3` = c(70, 112, 112, 122, 123, 112, 122, 122,
111, 136, 134, 124, 135, 158, 122, 123, 124, 124, 112, 123,
123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123,
123, 123, 123, 123, 123, 123, 124, 123, 123, 123, 123, 123,
123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123,
123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123, 123,
123, 123, 123, NA, 123, 123, 123, 123, 123, 123, 123), `Coll. Site 4` = c(147,
147, 133, 134, 146, 159, 54, 134, 134, 159, 156, 145, 146,
167, 159, 146, 144, 146, 147, 146, 145, 146, 146, 146, 146,
146, 146, 146, 146, 146, 146, 146, 145, 146, 146, 146, 146,
146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146,
146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146,
146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146, 146,
146, 146, 146, 146, 146, 146), `Coll. Site 5` = c(176, 173,
168, 168, 169, 172, 168, 168, 168, 168, 168, 169, 167, 241,
169, 172, 146, 168, 168, 169, 167, 168, 168, 168, 168, 168,
168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
168, 168, 168, 168, 168, 168, 168, 168, NA, 168, 168, 168,
168, 168, 168, 168, 168), `Coll. Site 6` = c(241, 258, 241,
241, 241, 241, 144, 241, 169, 241, 241, 241, 241, 266, 241,
241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 240, 241,
241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241,
241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 240,
241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241,
241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241, 241,
241, 241, 241, 241), `Coll. Site 7` = c(280, 279, 266, 266,
253, 266, 24, 266, 241, 270, 266, 266, 266, 270, 270, 266,
266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 257, 266,
266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266,
266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266,
257, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266,
266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266, 266,
266, NA, 266), `Coll. Site 8` = c(281, 283, 270, 270, 257,
270, NA, 270, 266, 279, 270, 270, 270, 301, 301, 270, 270,
270, 270, 270, 270, NA, 270, 270, 270, 270, 265, 270, 270,
270, 270, 270, 270, 270, 270, NA, 270, 270, 270, 270, 270,
270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 266,
270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270,
270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270, 270,
270, 270), `Coll. Site 9` = c(314, 302, 301, 314, 301, 301,
NA, 301, 270, 301, 301, 301, 301, 402, 402, 301, 301, 301,
314, 301, 314, 312, 301, 301, 302, 301, 301, 301, 301, 301,
NA, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301,
301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301,
301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301,
301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301, 301,
301), `Coll. Site 10` = c(401, 314, 303, 609, 314, 402, NA,
314, 301, 314, 314, 314, 314, 618, 618, 314, 314, 314, 403,
402, 402, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314,
314, 314, 314, 314, 314, 314, 314, 315, 314, 314, 314, 314,
314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314,
314, 314, 314, 314, 314, 314, 314, 314, NA, 314, 314, 314,
314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314, 314
), `Coll. Site 11` = c(618, 402, 402, 618, 402, 618, NA,
402, 314, 402, 402, 402, 402, 314, 266, 402, 402, 402, 609,
618, 618, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402,
402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402,
402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402,
402, NA, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402,
402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402, 402
), `Coll. Site 12` = c(NA, 621, 618, NA, 621, NA, NA, 618,
402, 618, 619, 618, 618, NA, NA, 621, 618, 618, 616, 314,
301, 618, 621, 618, 618, 618, 616, 618, 618, 618, 618, 618,
618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618,
618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618,
618, 618, 618, 618, 618, 618, 621, 618, 618, 618, 618, 618,
618, 618, 618, 618, 618, 618, 618, 618, 618, 618, 618)), row.names = c(NA,
-79L), class = c("tbl_df", "tbl", "data.frame"))
</details>
# 答案1
**得分**: 1
> 这只有在每个 `fCYR` 层次中都存在相同的12个站点时才有效吗?
试一试就知道了。简而言之:是的,它有效;不,它不需要每年都存在相同的12个站点。关键是唯一的站点应该被唯一地编码。然而,在模型中同时使用你提到的随机效应形式可能会有益,因为它可以导致更简洁的拟合。
假设你的数据在 `df` 中,而且已经按我认为的模型格式进行了正确的格式化
```r
library("mgcv")
library("tidyr")
library("dplyr")
library("janitor")
df2 <- pivot_longer(df, cols = c(-Month, -CYR), names_to = "site",
values_to = "value", names_prefix = "^Coll\\. ") %>%
janitor::clean_names() %>%
mutate(fcyr = factor(cyr), site = factor(site))
我们可以拟合三个模型来比较随机效应的作用:
knots <- list(month = c(0.5, 12.5))
m1 <- gam(value ~ s(month, bs = "cc") +
s(fcyr, bs = "re") + s(site, bs = "re"),
data = df2, family = nb(), knots = knots)
m2 <- gam(value ~ s(month, bs = "cc") + s(fcyr, site, bs = "re"),
data = df2, family = nb(), knots = knots)
m3 <- gam(value ~ s(month, bs = "cc") +
s(fcyr, bs = "re") + s(site, bs = "re") +
s(fcyr, site, bs = "re"),
data = df2, family = nb(), knots = knots)
由于有132个唯一的(站点,年份)对
> with(df2, nlevels(interaction(fcyr, site, drop = TRUE)))
[1] 132
m2
使用了接近131个 EDF(由于模型截距的原因,会损失一个 df,如果我没记错)
> model_edf(m1)
# A tibble: 1 × 2
model edf
<chr> <dbl>
1 m1 19.8
r$> model_edf(m2)
# A tibble: 1 × 2
model edf
<chr> <dbl>
1 m2 132.
r$> model_edf(m3)
# A tibble: 1 × 2
model edf
<chr> <dbl>
1 m3 58.1
而包含了所有三个 ranef 项的 m3
仅使用了约58个 EDF。
正如你的评论所猜测的那样,如果你包括站点和年份的随机效应以及它们的组合,与仅拟合 site:fcyr
形式的 m2
相比,你将得到对数据更简洁的拟合。
m2
和 m3
都允许你估计所有132个(站点,年份)对的随机截距。那么两者有什么区别呢?
正如你的评论所推测的那样,这两个“主要”随机效应编码了每个站点的平均效应和每个年份的平均效应。然后,s(site, fcyr, bs = "re")
项估计了各个(站点,年份)对的离均值偏差。一旦你估计了平均的 site
和 year
效应,由于年份引起的关于各个站点的偏差(或等效地,关于各个站点均值的年份的偏差)相对较小。因此,s(site,fcyr, bs = "re")
项仅使用了约40个 EDF;很大部分潜在的偏差被压缩掉,因为随机 site
和 year
效应已经在模型中考虑进去了。
所以在这种情况下,你可以使用 任何一个 形式:
~ s(fcyr, bs = "re") + s(site, bs = "re") + s(fcyr, site, bs = "re")
,或者~ s(fcyr, site, bs = "re")
但选项1更为简洁。然而,情况可能并非总是如此。如果站点内部和年份内部的变化很大,那么将分解形式(1.)与2.没有优势也不是不可想象的。
随机效应的主要问题是确保分组标签是唯一的。假设你有3个字段,并且从每个字段中的6个位置重复测量。你不应该将这6个位置在每个字段中编码为 loc1
、loc2
、...、loc6
,因为模型会认为所有标记为 location
等于 loc1
的行都是同一个位置。假设字段标记为 A
、B
和 C
,你应该将位置编码为 locA1
、locA2
、...、locB1
、...、locC6
,以避免任何歧义。在 R 中,随机效应软件中经常涉及到这种问题;在那里,使用诸如 lme()
和 lmer()
的公式接口,如果你说 location
嵌套在 field
中,你可以使用 loc1
编码形式 如果 你这么说。因此,有关交叉或嵌套随机效应的讨论会出现。如果你正确编码你的分组变量,整个讨论都会消失。
在 mgcv
英文:
> Does this only work if the same 12 sites are present in each fCYR
level?
Try it and see. tldr: yes, it works; no it doesn't require the same 12 sites to be present in each year. The key thing is that unique sites should be coded uniquely. However, using both the random effect forms you mention in the model can be beneficial as it can lead to a more parsimonious fit.
Assuming your data is in df
and properly formatted for what I think your model is
library("mgcv")
library("tidyr")
library("dplyr")
library("janitor")
df2 <- pivot_longer(df, cols = c(-Month, -CYR), names_to = "site",
values_to = "value", names_prefix = "^Coll\\. ") |>
janitor::clean_names() |>
mutate(fcyr = factor(cyr), site = factor(site))
we can fit three models to compare what random effects are doing:
knots <- list(month = c(0.5, 12.5))
m1 <- gam(value ~ s(month, bs = "cc") +
s(fcyr, bs = "re") + s(site, bs = "re"),
data = df2, family = nb(), knots = knots)
m2 <- gam(value ~ s(month, bs = "cc") + s(fcyr, site, bs = "re"),
data = df2, family = nb(), knots = knots)
m3 <- gam(value ~ s(month, bs = "cc") +
s(fcyr, bs = "re") + s(site, bs = "re") +
s(fcyr, site, bs = "re"),
data = df2, family = nb(), knots = knots)
As there are 132 unique (site, year) pairs
> with(df2, nlevels(interaction(fcyr, site, drop = TRUE)))
[1] 132
m2
uses almost 131 EDF (one df gets lost due to the model intercept IIRC)
> model_edf(m1)
# A tibble: 1 × 2
model edf
<chr> <dbl>
1 m1 19.8
r$> model_edf(m2)
# A tibble: 1 × 2
model edf
<chr> <dbl>
1 m2 132.
r$> model_edf(m3)
# A tibble: 1 × 2
model edf
<chr> <dbl>
1 m3 58.1
while m3
which contains all three ranef terms uses on ~58 EDF.
As your comment speculates, you get a more parsimonious fit to the data if you include the site and year random effects plus their combination than if you just fitted the site:fcyr
form as per m2
.
Both m2
and m3
allow you to estimate a random intercept for all 132 (site,year) pairs. So what's the difference?
As your comment surmises, the two "main" random effects encode the average effects of each site
and the average effect of each year
. The s(site, fcyr, bs = "re")
term then estimates the deviations from the site and year averages for individual (site,year) pairs. Once you have estimated the average site
and year
effects, the deviations around individual sites due to year (or equivalently deviations over years about individual site means) are relatively small. Hence the s(site,fcyr, bs = "re")
term only uses ~40 EDF; much of the potential deviation is shrunk away because the random site
and year
effects are already accounted for in the model.
So in this instance you could use either form:
~ s(fcyr, bs = "re") + s(site, bs = "re") + s(fcyr, site, bs = "re")
, or~ s(fcyr, site, bs = "re")
but option 1 is more parsimonious. That need not be the case however. If the within site and within year variation is large, it is not inconceivable that the decomposed form (1.) will offer no advantage over 2.
The main thing with random effects is to make sure that the groups are labelled uniquely. Say you have 3 fields and take measurements repeatedly from 6 locations within each field. You wouldn't code the six locations as loc1
, loc2
, ..., loc6
in each field, because the model would think that any row with location
equal to loc1
is the same location. Say the fields are labelled A
, B
and C
, you would code the locations as locA1
, locA2
, ..., locB1
, ..., locC6
to avoid any ambiguity. In the mixed effects software world in R, this kind of question comes up a lot; there, with the way the formula interface to things like lme()
and lmer()
work you could get away with the loc1
coding form if you said location
was nested in field
. Hence discussions about crossed or nested random effects. That whole discussion goes away if you code your grouping variable correctly.
With mgcv, if you used the loc1
form and then fitted the formula 1. form of the model you'd get things wrong
y ~ s(field, bs="re") + s(location, bs="re") + s(field, location, bs="re")
as the s(location, bs="re")
term would think all observations labelled loc1
were all the same location ("subject"), when they aren't. In that case, the 2. form of the model would be the correct one
y ~ s(field, location, bs="re")
In summary, which form you used really comes down to how you coded your grouping factors and not whether the design is nested or crossed.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论