如何查找具有三个或更少不同值的列

huangapple go评论98阅读模式
英文:

How to find columns with three or fewer distinct values

问题

"使用MASS包的波士顿房屋数据集,并使用R中的gam包中的样条进行工作。然而,使用此代码返回错误:

错误消息为:

遇到具有3个或更少唯一值的平滑变量;至少需要4个

导致问题的变量是chas,它只有两个值,1和0。

有什么测试可以确定某一列是否具有3个或更少的唯一值,以便可以从样条分析中排除它吗?"

英文:

I'm using the Boston Housing data set from the MASS package, and working with splines from the gam package in R. However, an error is returned with this code:

  1. library(gam)
  2. library(MASS)
  3. library(tidyverse)
  4. Boston.gam <- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) + s(rad) + s(tax) + s(ptratio) + s(black) + s(lstat), data = Boston)

The error message is:

  1. A smoothing variable encountered with 3 or less unique values; at least 4 needed

The variable that is causing the issue is chas, it only has two values, 1 and 0.

What is a test to determine if a column has 3 or fewer unique values so it can be eliminated from the spline analysis?

答案1

得分: 3

base R

  1. data("Boston", package = "MASS")
  2. head(Boston)
  3. # crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
  4. # 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
  5. # 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
  6. # 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
  7. # 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
  8. # 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
  9. # 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
  10. head(Filter(function(z) length(unique(z)) >= 4, Boston))
  11. # crim zn indus nox rm age dis rad tax ptratio black lstat medv
  12. # 1 0.00632 18 2.31 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
  13. # 2 0.02731 0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
  14. # 3 0.02729 0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
  15. # 4 0.03237 0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
  16. # 5 0.06905 0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
  17. # 6 0.02985 0 2.18 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
  18. head(Boston[, sapply(Boston, function(z) length(unique(z)) >= 4)])

dplyr

  1. library(dplyr)
  2. select(Boston, where(~ n_distinct(.) >= 4)) %>%
  3. head()
  4. ### same result
英文:

base R

  1. data("Boston", package = "MASS")
  2. head(Boston)
  3. # crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
  4. # 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
  5. # 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
  6. # 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
  7. # 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
  8. # 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
  9. # 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
  10. head(Filter(function(z) length(unique(z)) >= 4, Boston))
  11. # crim zn indus nox rm age dis rad tax ptratio black lstat medv
  12. # 1 0.00632 18 2.31 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
  13. # 2 0.02731 0 7.07 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
  14. # 3 0.02729 0 7.07 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
  15. # 4 0.03237 0 2.18 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
  16. # 5 0.06905 0 2.18 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
  17. # 6 0.02985 0 2.18 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
  18. head(Boston[, sapply(Boston, function(z) length(unique(z)) >= 4)])

dplyr

  1. library(dplyr)
  2. select(Boston, where(~ n_distinct(.) >= 4)) %>%
  3. head()
  4. ### same result

答案2

得分: 2

这样会有效吗?

你可以使用 dplyr::n_distinct() 来执行唯一性检查。

  1. # 唯一值的数量
  2. n_unique_vals <- map_dbl(Boston, n_distinct)
  3. # 具有 >= 4 个唯一值的列的名称
  4. keep <- names(n_unique_vals)[n_unique_vals >= 4]
  5. # 模型数据
  6. gam_data <- Boston %>%
  7. dplyr::select(all_of(keep))
英文:

Would this work?

You can use dplyr::n_distinct() to perform the unique check.

  1. # Number of unique values
  2. n_unique_vals &lt;- map_dbl(Boston, n_distinct)
  3. # Names of columns with &gt;= 4 unique vals
  4. keep &lt;- names(n_unique_vals)[n_unique_vals &gt;= 4]
  5. # Model data
  6. gam_data &lt;- Boston %&gt;%
  7. dplyr::select(all_of(keep))

huangapple
  • 本文由 发表于 2023年6月1日 19:49:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76381561.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定