2023年2月14日 21:19:24go评论53阅读模式

英文:

Data Cleaning Error in Classification KNN Alrogithm Problem

问题

错误信息:
ValueError: 输入X包含NaN。
KNeighborsClassifier 不本地接受缺失值编码为 NaN。对于监督学习，您可能希望考虑 sklearn.ensemble.HistGradientBoostingClassifier 和 Regressor，它们本地接受缺失值编码为 NaN。或者，可以对数据进行预处理，例如使用管道中的 imputer 转换器或删除带有缺失值的样本。请参阅 https://scikit-learn.org/stable/modules/impute.html 查看处理 NaN 值的所有估计器的列表。您可以在以下页面找到处理 NaN 值的所有估计器的列表: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

英文:

I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis.
The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.

Code:

import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r&quot;C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx&quot;)
#print(dataset.columns)

#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy=&#39;mean&#39;)

#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)

# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])

labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])

labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])

labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])

#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)

y_pred = knn.predict(X)

print(y_pred)

Error Message:
ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

答案1

得分: 1

你需要使用你创建的SimpleImputer来拟合和转换数据。来自文档：

import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')  # 这里创建了一个填充器
imputer.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])  # 这里拟合了填充器，即学习了均值

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imputer.transform(X))  # 这里应用了填充器，即填充了均值

关键部分是imputer.fit()和imputer.transform(X)。

此外，由于LabelEncoder 在这里不适用，我会使用另一种处理分类数据的技术：

这个转换器应该用于编码目标值，即 y，而不是输入 X。

有关备选方法，请参考这里：如何在基于距离的算法如KNN或SVM中考虑分类变量？

英文:

You have to fit and transform the data with the SimpleImputer you created. From the documentation:

import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy=&#39;mean&#39;)  # Here the imputer is created
imputer.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])  # Here the imputer is fitted, i.e. learns the mean

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imputer.transform(X))  # Here the imputer is applied, i.e. filling the mean

The crucial parts here are imputer.fit() and imputer.transform(X)

Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:

> This transformer should be used to encode target values, i.e. y, and not the input X.

For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?

答案2

得分: 0

你需要使用SimpleImputer来填补X中的缺失值。我们在X上拟合imputer，然后对X进行转换，将NaN值替换为该列的平均值。在填补缺失值后，我们使用LabelEncoder来编码目标变量。

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)

# 编码目标变量
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

英文:

You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.

    imputer = SimpleImputer(missing_values=np.nan, strategy=&#39;mean&#39;)
X = imputer.fit_transform(X)

# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据清理错误在分类KNN算法问题中。

问题

答案1

答案2

按类别分组，筛选具有最大值的项目。

如何查看我的逻辑回归已分类的特定行

内存问题在获取TF-IDF数据时。

Feature importance scores with GridSearchCV

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论