mice creates NaN and NA after imputing cluster-level variable for clustered data with maxit >0

huangapple go评论69阅读模式
英文:

mice creates NaN and NA after imputing cluster-level variable for clustered data with maxit >0

问题

我有一个包含21%缺失值的聚类数据,这些缺失值出现在聚类变量中,该变量是从一个具有相似缺失数据的“日期”变量派生而来。我试图在不对其他任何内容进行插补的情况下插补聚类变量中的缺失数据。数据框中没有其他变量存在缺失数据(编辑:总共有约20个变量)。所有变量都是数字、逻辑、因子或日期格式。我使用PMM进行插补模型,因为该变量是连续型的,但不符合正态分布,我不想引入超出当前值范围的值。

当我使用maxit = 0运行mice时,我没有看到任何错误信息。但当我将maxit增加到0以外的任何值时,虽然没有错误,但当查看插补数据集时,具有缺失值的变量(cluster)中的所有值都设置为NA,而所有非插补变量中的所有值(无论它们在矩阵中的预测值如何)都设置为NaN

我查看了以下资源,其中一些有教程,但找不到解决方案:

我了解到共线性可能会导致NA值插补,但没有共线性错误。我尝试添加ridge=0.001和/或threshold=1.1以使模型更稳健,但没有成功。我想知道将date设置为聚类是否是一个问题,所以我尝试将date = 1(将其用作预测变量)并且没有收到任何共线性错误(或其他错误),但提供了与NANaN相同的结果。我还尝试将date = 0设置为预测矩阵中的值。这在干跑中不会引发任何问题,但在具有5次迭代的模型中,我会收到以下错误,因此我认为这不是解决方案。

Error in .imputation.level2(y = y, ry = ry, x = x, type = type, wy = wy,  : 
No class variable

我不确定什么是“类变量”。它是聚类变量吗?

我的代码:

md.pattern(data)
 id x y date_cr cluster
1154  1 1 1 1       1       0  
304   1 1 1 0       0       2  
  0 0 0 304     304     608            

#设置预测矩阵

pm = make.predictorMatrix(data)
#date与cluster有共线性 - 不确定将其编码为-2或0是否重要。
pm[,c("date","cluster")] = -2
pm[, c("id")] = 0
#所有变量设置为1,没有缺失值
pm[,c("x","y","z")] = 1

#设置插补方法
impmethod = character(ncol(data))
names(impmethod) = colnames(data)
impmethod["cluster"] = "2lonly.pmm"

#干跑不会出现错误或日志事件

> mi = mice(data, m=5, predictorMatrix = pm, method = impmethod,
            maxit=0, printFlag = TRUE, seed=1)

#数据被正确插补(数据的范围是1-99)

> mi$imp$cluster
  1  2  3  4  5
782  79 38 34 63 41
783  45 58 20 85 22
784   8 54 51 12 61
785  67 97 66 43 41
786  32 84  8 14 31
> mi$chainMean
, , Chain 5                   
x                  
y

#我假设这些值是空的是可以的,因为没有迭代吗?或者这个结果暗示了问题吗?

#但是,将迭代次数增加到任何大于0的值会导致无法插补,而不会记录任何问题:

> mi5 = mice(data, m=5, predictorMatrix = pm, method = impmethod,
            maxit=5, printFlag = FALSE, seed=1)

> mi5$imp$cluster
  1  2  3  4  5
782 NA NA NA NA
783 NA NA NA NA
784 NA NA NA NA
785 NA NA NA NA
> mi5$chainMean
, , Chain 5

  1   2   3   4   5
x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN

编辑:

R版本4.2.2(2022-10-31)

mice_3.16.0

英文:

I have clustered data with 21% missing values in the cluster variable, which was derived from a 'date' variable with similar missing data. I'm trying to impute the missing data in the cluster variable without imputing anything else. No other variables have missing data (edit: about 20 variables total in the dataframe). All variables are numeric or logical or factor or date format. I'm using PMM for the imputation model because the variable is continuous but not normally distributed and I don't want to introduce any values outside the range of the current values.

When I run mice with maxit = 0, I don't see any logged errors. When I increase the maxit to anything other than 0, then there are no errors but when I view the imputed data sets, all the values in the variable with missing values (cluster) are set to NA, and all the values in all the non-imputed variables (regardless of their predictor value in the matrix) are set to NaN.

I've looked through these resources that have some tutorials but couldn't find any solutions

https://stefvanbuuren.name/fimd/ch-multilevel.html   
https://bookdown.org/mwheymans/bookmi/multiple-imputation-models-for-multilevel-data.html.  
https://www.gerkovink.com/miceVignettes/    
https://www.nerler.com/teaching/fgme2019/MICourse_Slides.pdf   
https://nerler.github.io/EP16_Multiple_Imputation/slide/

I've read that colinearity can lead to NA imputed values but there are no colinearity errors.
I tried adding ridge=0.001 and/or threshold=1.1 to make the model more robust without success. I wondered if having date set as a cluster was a problem, so I tried setting date = 1 (using it as a predictor) and that did not give me any colinearity errors (or any other errors for that matter) but provided the same result of NA and NaN. I've also tried setting date = 0 in the predictor matrix. That does not cause any problems with the dry run, but in the model with 5 iterations, I get this error, so i don't think that is a solution.

    Error in .imputation.level2(y = y, ry = ry, x = x, type = type, wy = wy,  : 
No class variable

I'm not sure what the class variable is. is it the cluster variable?

My code:

    md.pattern(data)
id x y date_cr cluster
1154  1 1 1 1       1       0  
304   1 1 1 0       0       2  
0 0 0 304     304     608            
#set predictor matrix
pm = make.predictorMatrix(data)
#date has colinearity with cluster - not sure if it matters if i code it as -2 or 0. 
pm[,c("date","cluster")] = -2
pm[, c("id")] = 0
#all variables set =1 have no missing values
pm[,c("x","y","z")] = 1
#set imputation method
impmethod = character(ncol(data))
names(impmethod) = colnames(data)
impmethod["cluster"] = "2lonly.pmm"
#Dry run gives no errors or logged events
> mi = mice(data, m=5, predictorMatrix = pm, method = impmethod,
maxit=0, printFlag = TRUE, seed=1)
#Data is imputed correctly (range in data is 1-99)
> mi$imp$cluster
1  2  3  4  5
782  79 38 34 63 41
783  45 58 20 85 22
784   8 54 51 12 61
785  67 97 66 43 41
786  32 84  8 14 31
> mi$chainMean
, , Chain 5                   
x                  
y
#i assume it's ok these values are blank because there are 0 iterations? or maybe this results suggests a problem?
#However, increasing the number of iterations to anything >0 causes failure to impute without #logging any problems:
> mi5 = mice(data, m=5, predictorMatrix = pm, method = impmethod,
maxit=5, printFlag = FALSE, seed=1)
> mi5$imp$cluster
1  2  3  4  5
782  NA NA NA NA NA
783  NA NA NA NA NA
784  NA NA NA NA NA
785  NA NA NA NA NA
> mi5$chainMean
, , Chain 5
1   2   3   4   5
x NaN NaN NaN NaN NaN
y NaN NaN NaN NaN NaN

edit:

R version 4.2.2 (2022-10-31)

mice_3.16.0

答案1

得分: 1

The "2lonly.pmm" method imputes missing values at the cluster level. It indeed requires the clustering variable as predictor in the imputation model. If there are inconsistencies in the cluster variable, the imputation method cannot produce imputed values.

E.g. if X1 is a unit-level variable and X2 is a cluster-level variable, imputing the following would produce problems, because X2 is not consistent within the cluster.

cluster    X1       X2
1          0        "A"
1          0.5      "B"
1          0.5      NA
英文:

The "2lonly.pmm" method imputes missing values at the cluster level. It indeed requires the clustering variable as predictor in the imputation model. If there are inconsistencies in the cluster variable, the imputation method cannot produce imputed values.

E.g. if X1 is a unit-level variable and X2 is a cluster-level variable, imputing the following would produce problems, because X2 is not consistent within the cluster.

cluster    X1       X2
1          0        "A"
1          0.5      "B"
1          0.5      NA

huangapple
  • 本文由 发表于 2023年6月26日 08:16:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76552891.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定