数据清理错误在分类KNN算法问题中。

huangapple go评论53阅读模式
英文:

Data Cleaning Error in Classification KNN Alrogithm Problem

问题

错误信息:
ValueError: 输入X包含NaN。
KNeighborsClassifier 不本地接受缺失值编码为 NaN。对于监督学习,您可能希望考虑 sklearn.ensemble.HistGradientBoostingClassifier 和 Regressor,它们本地接受缺失值编码为 NaN。或者,可以对数据进行预处理,例如使用管道中的 imputer 转换器或删除带有缺失值的样本。请参阅 https://scikit-learn.org/stable/modules/impute.html 查看处理 NaN 值的所有估计器的列表。您可以在以下页面找到处理 NaN 值的所有估计器的列表: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

英文:

I believe the error is telling me I have null values in my data and I've tried fixing it but the error keeps appearing. I don't want to delete the null data because I consider it relevant to my analysis.
The columns of my data are in this order: 'Titulo', 'Autor', 'Género', 'Año Leido', 'Puntaje', 'Precio', 'Año Publicado', 'Paginas', **'Estado.' **The ones in bold are strings data.

Code:

import numpy as np
#Load Data
import pandas as pd
dataset = pd.read_excel(r"C:\Users\renat\Documents\Data Science Projects\Classification\Book Purchases\Biblioteca.xlsx")
#print(dataset.columns)

#Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

#Handling missing values
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')

#Convert X and y to NumPy arrays
X=dataset.iloc[:,:-1].values
y=dataset.iloc[:,8].values
print(X.shape, y.shape)

# Crea una instancia de LabelEncoder
labelEncoderTitulo = LabelEncoder()
X[:, 0] = labelEncoderTitulo.fit_transform(X[:, 0])

labelEncoderAutor = LabelEncoder()
X[:, 1] = labelEncoderAutor.fit_transform(X[:, 1])

labelEncoderGenero = LabelEncoder()
X[:, 2] = labelEncoderGenero.fit_transform(X[:, 2])

labelEncoderEstado = LabelEncoder()
X[:, -1] = labelEncoderEstado.fit_transform(X[:, -1])

#Instantiate our KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)

y_pred = knn.predict(X)

print(y_pred)

Error Message:
ValueError: Input X contains NaN.
KNeighborsClassifier does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


答案1

得分: 1

你需要使用你创建的SimpleImputer来拟合和转换数据。来自文档:

import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')  # 这里创建了一个填充器
imputer.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])  # 这里拟合了填充器,即学习了均值

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imputer.transform(X))  # 这里应用了填充器,即填充了均值

关键部分是imputer.fit()imputer.transform(X)

此外,由于LabelEncoder 在这里不适用,我会使用另一种处理分类数据的技术:

这个转换器应该用于编码目标值,即 y,而不是输入 X。

有关备选方法,请参考这里:如何在基于距离的算法如KNN或SVM中考虑分类变量?

英文:

You have to fit and transform the data with the SimpleImputer you created. From the documentation:

import numpy as np
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')  # Here the imputer is created
imputer.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])  # Here the imputer is fitted, i.e. learns the mean

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imputer.transform(X))  # Here the imputer is applied, i.e. filling the mean 

The crucial parts here are imputer.fit() and imputer.transform(X)

Additionally I'd use another technique to handle categorical data since LabelEncoder is not suitable here:

> This transformer should be used to encode target values, i.e. y, and not the input X.

For alternatives see here: How to consider categorical variables in distance based algorithms like KNN or SVM?

答案2

得分: 0

你需要使用SimpleImputer来填补X中的缺失值。我们在X上拟合imputer,然后对X进行转换,将NaN值替换为该列的平均值。在填补缺失值后,我们使用LabelEncoder来编码目标变量。

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)

# 编码目标变量
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)
英文:

You need SimpleImputer to impute the missing values in X. We fit the imputer on X and then transform X to replace the NaN values with the mean of the column.After imputing missing values, we encode the target variable using LabelEncoder.

    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)

# Encode target variable
labelEncoderEstado = LabelEncoder()
y = labelEncoderEstado.fit_transform(y)

huangapple
  • 本文由 发表于 2023年2月14日 21:19:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75448437.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定