2023年6月1日 19:49:01go评论98阅读模式

英文:

How to find columns with three or fewer distinct values

问题

"使用MASS包的波士顿房屋数据集，并使用R中的gam包中的样条进行工作。然而，使用此代码返回错误：

错误消息为：

遇到具有3个或更少唯一值的平滑变量；至少需要4个

导致问题的变量是chas，它只有两个值，1和0。

有什么测试可以确定某一列是否具有3个或更少的唯一值，以便可以从样条分析中排除它吗？"

英文:

I'm using the Boston Housing data set from the MASS package, and working with splines from the gam package in R. However, an error is returned with this code:

library(gam)
library(MASS)
library(tidyverse)
Boston.gam &lt;- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) + s(rad) + s(tax) + s(ptratio) + s(black) + s(lstat), data = Boston)

The error message is:

A smoothing variable encountered with 3 or less unique values; at least 4 needed

The variable that is causing the issue is chas, it only has two values, 1 and 0.

What is a test to determine if a column has 3 or fewer unique values so it can be eliminated from the spline analysis?

答案1

得分: 3

base R

data("Boston", package = "MASS")
head(Boston)
#      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
# 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
# 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
# 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
# 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
# 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
# 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7
head(Filter(function(z) length(unique(z)) >= 4, Boston))
#      crim zn indus   nox    rm  age    dis rad tax ptratio  black lstat medv
# 1 0.00632 18  2.31 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
# 2 0.02731  0  7.07 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
# 3 0.02729  0  7.07 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
# 4 0.03237  0  2.18 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
# 5 0.06905  0  2.18 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
# 6 0.02985  0  2.18 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7
head(Boston[, sapply(Boston, function(z) length(unique(z)) >= 4)])

dplyr

library(dplyr)
select(Boston, where(~ n_distinct(.) >= 4)) %>%
  head()
### same result

英文:

base R

data(&quot;Boston&quot;, package = &quot;MASS&quot;)
head(Boston)
#      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
# 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
# 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
# 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
# 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
# 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
# 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7
head(Filter(function(z) length(unique(z)) &gt;= 4, Boston))
#      crim zn indus   nox    rm  age    dis rad tax ptratio  black lstat medv
# 1 0.00632 18  2.31 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
# 2 0.02731  0  7.07 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
# 3 0.02729  0  7.07 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
# 4 0.03237  0  2.18 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
# 5 0.06905  0  2.18 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
# 6 0.02985  0  2.18 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7
head(Boston[, sapply(Boston, function(z) length(unique(z)) &gt;= 4)])

dplyr

library(dplyr)
select(Boston, where(~ n_distinct(.) &gt;= 4)) %&gt;%
  head()
### same result

答案2

得分: 2

这样会有效吗？

你可以使用 dplyr::n_distinct() 来执行唯一性检查。

# 唯一值的数量
n_unique_vals <- map_dbl(Boston, n_distinct)
# 具有 >= 4 个唯一值的列的名称
keep <- names(n_unique_vals)[n_unique_vals >= 4]
# 模型数据
gam_data <- Boston %>%
  dplyr::select(all_of(keep))

英文:

Would this work?

You can use dplyr::n_distinct() to perform the unique check.

# Number of unique values
n_unique_vals &lt;- map_dbl(Boston, n_distinct)
# Names of columns with &gt;= 4 unique vals
keep &lt;- names(n_unique_vals)[n_unique_vals &gt;= 4]
# Model data
gam_data &lt;- Boston %&gt;%
  dplyr::select(all_of(keep))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何查找具有三个或更少不同值的列

问题

答案1

base R

dplyr

base R

dplyr

答案2

如何在R中删除数据框中的空白空间

传递向量化输入给 element_text 的正确方式是什么？

根据行和列索引在R中高效地填充矩阵

在tidyverse中R中每个组的累积总和

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论