英文:
How to find columns with three or fewer distinct values
问题
"使用MASS包的波士顿房屋数据集,并使用R中的gam包中的样条进行工作。然而,使用此代码返回错误:
错误消息为:
遇到具有3个或更少唯一值的平滑变量;至少需要4个
导致问题的变量是chas,它只有两个值,1和0。
有什么测试可以确定某一列是否具有3个或更少的唯一值,以便可以从样条分析中排除它吗?"
英文:
I'm using the Boston Housing data set from the MASS package, and working with splines from the gam package in R. However, an error is returned with this code:
library(gam)
library(MASS)
library(tidyverse)
Boston.gam <- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) + s(rad) + s(tax) + s(ptratio) + s(black) + s(lstat), data = Boston)
The error message is:
A smoothing variable encountered with 3 or less unique values; at least 4 needed
The variable that is causing the issue is chas, it only has two values, 1 and 0.
What is a test to determine if a column has 3 or fewer unique values so it can be eliminated from the spline analysis?
答案1
得分: 3
base R
data("Boston", package = "MASS")
head(Boston)
# crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
# 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
# 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
# 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
# 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
# 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
# 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
head(Filter(function(z) length(unique(z)) >= 4, Boston))
# crim zn indus nox rm age dis rad tax ptratio black lstat medv
# 1 0.00632 18 2.31 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
# 2 0.02731 0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
# 3 0.02729 0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
# 4 0.03237 0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
# 5 0.06905 0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
# 6 0.02985 0 2.18 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
head(Boston[, sapply(Boston, function(z) length(unique(z)) >= 4)])
dplyr
library(dplyr)
select(Boston, where(~ n_distinct(.) >= 4)) %>%
head()
### same result
英文:
base R
data("Boston", package = "MASS")
head(Boston)
# crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
# 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
# 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
# 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
# 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
# 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
# 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
head(Filter(function(z) length(unique(z)) >= 4, Boston))
# crim zn indus nox rm age dis rad tax ptratio black lstat medv
# 1 0.00632 18 2.31 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
# 2 0.02731 0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
# 3 0.02729 0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
# 4 0.03237 0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
# 5 0.06905 0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
# 6 0.02985 0 2.18 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
head(Boston[, sapply(Boston, function(z) length(unique(z)) >= 4)])
dplyr
library(dplyr)
select(Boston, where(~ n_distinct(.) >= 4)) %>%
head()
### same result
答案2
得分: 2
这样会有效吗?
你可以使用 dplyr::n_distinct()
来执行唯一性检查。
# 唯一值的数量
n_unique_vals <- map_dbl(Boston, n_distinct)
# 具有 >= 4 个唯一值的列的名称
keep <- names(n_unique_vals)[n_unique_vals >= 4]
# 模型数据
gam_data <- Boston %>%
dplyr::select(all_of(keep))
英文:
Would this work?
You can use dplyr::n_distinct()
to perform the unique check.
# Number of unique values
n_unique_vals <- map_dbl(Boston, n_distinct)
# Names of columns with >= 4 unique vals
keep <- names(n_unique_vals)[n_unique_vals >= 4]
# Model data
gam_data <- Boston %>%
dplyr::select(all_of(keep))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论