KNN imputation for missing categorical-string values python for a specific column in a dataframe and return with replaced value as a dataframe

huangapple go评论49阅读模式
英文:

KNN imputation for missing categorical-string values python for a specific column in a dataframe and return with replaced value as a dataframe

问题

"Gender_imputed" 列不应包含 NaN 值。

英文:

There are some missing values in Gender Column and would like to impute these values using KNN imputation. But i ain't getting filled result! Can someone help on this?

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = pd.factorize(df['Gender'])[0]
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df

Output:

   ID  Age Gender  Gendermap  Gender_imputed_factorized Gender_imputed
0   1   20      M          0                        0.0              M
1   2   25      F          1                        1.0              F
2   3   30    NaN         -1                       -1.0            NaN
3   4   35      F          1                        1.0              F
4   5   40    NaN         -1                       -1.0            NaN

"Gender_imputed" column shoudn't contain Nan value.

答案1

得分: 2

I believe it is the factorize function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.

Try using map to convert gender to a numerical column like this:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1})  # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df
英文:

I believe it is the factorize function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.

Try using map to convert gender to a numerical column like this:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1}) # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df

答案2

得分: 0

已解决,谢谢。

df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]

print(df)
英文:

Got a solution thanks.

df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]

print(df)

huangapple
  • 本文由 发表于 2023年3月15日 20:44:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/75744885.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定