英文:
KNN imputation for missing categorical-string values python for a specific column in a dataframe and return with replaced value as a dataframe
问题
"Gender_imputed" 列不应包含 NaN 值。
英文:
There are some missing values in Gender Column and would like to impute these values using KNN imputation. But i ain't getting filled result! Can someone help on this?
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'ID': [1, 2, 3, 4, 5],
'Age': [20, 25, 30, 35, 40],
'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = pd.factorize(df['Gender'])[0]
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
Output:
ID Age Gender Gendermap Gender_imputed_factorized Gender_imputed
0 1 20 M 0 0.0 M
1 2 25 F 1 1.0 F
2 3 30 NaN -1 -1.0 NaN
3 4 35 F 1 1.0 F
4 5 40 NaN -1 -1.0 NaN
"Gender_imputed" column shoudn't contain Nan value.
答案1
得分: 2
I believe it is the factorize
function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.
Try using map to convert gender to a numerical column like this:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'ID': [1, 2, 3, 4, 5],
'Age': [20, 25, 30, 35, 40],
'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1}) # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
英文:
I believe it is the factorize
function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.
Try using map to convert gender to a numerical column like this:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'ID': [1, 2, 3, 4, 5],
'Age': [20, 25, 30, 35, 40],
'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1}) # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
答案2
得分: 0
已解决,谢谢。
df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]
print(df)
英文:
Got a solution thanks.
df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]
print(df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论