英文:
KNN imputation for missing categorical-string values python for a specific column in a dataframe and return with replaced value as a dataframe
问题
"Gender_imputed" 列不应包含 NaN 值。
英文:
There are some missing values in Gender Column and would like to impute these values using KNN imputation. But i ain't getting filled result! Can someone help on this?
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = pd.factorize(df['Gender'])[0]
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
Output:
   ID  Age Gender  Gendermap  Gender_imputed_factorized Gender_imputed
0   1   20      M          0                        0.0              M
1   2   25      F          1                        1.0              F
2   3   30    NaN         -1                       -1.0            NaN
3   4   35      F          1                        1.0              F
4   5   40    NaN         -1                       -1.0            NaN
"Gender_imputed" column shoudn't contain Nan value.
答案1
得分: 2
I believe it is the factorize function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.
Try using map to convert gender to a numerical column like this:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1})  # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
英文:
I believe it is the factorize function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.
Try using map to convert gender to a numerical column like this:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1}) # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
答案2
得分: 0
已解决,谢谢。
df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]
print(df)
英文:
Got a solution thanks.
df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]
print(df)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论