2023年3月15日 20:44:09go评论66阅读模式

英文:

KNN imputation for missing categorical-string values python for a specific column in a dataframe and return with replaced value as a dataframe

问题

"Gender_imputed" 列不应包含 NaN 值。

英文:

There are some missing values in Gender Column and would like to impute these values using KNN imputation. But i ain't getting filled result! Can someone help on this?

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {&#39;ID&#39;: [1, 2, 3, 4, 5],
        &#39;Age&#39;: [20, 25, 30, 35, 40],
        &#39;Gender&#39;: [&#39;M&#39;, &#39;F&#39;, np.nan, &#39;F&#39;, np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df[&#39;Gendermap&#39;] = pd.factorize(df[&#39;Gender&#39;])[0]
df[&#39;Gender_imputed_factorized&#39;] = imputer.fit_transform(df[[&#39;Gendermap&#39;]])
df[&#39;Gender_imputed&#39;] = pd.unique(df[&#39;Gender&#39;])[df[&#39;Gender_imputed_factorized&#39;].astype(int)]
df

Output:

   ID  Age Gender  Gendermap  Gender_imputed_factorized Gender_imputed
0   1   20      M          0                        0.0              M
1   2   25      F          1                        1.0              F
2   3   30    NaN         -1                       -1.0            NaN
3   4   35      F          1                        1.0              F
4   5   40    NaN         -1                       -1.0            NaN

"Gender_imputed" column shoudn't contain Nan value.

答案1

得分: 2

I believe it is the factorize function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.

Try using map to convert gender to a numerical column like this:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {'ID': [1, 2, 3, 4, 5],
        'Age': [20, 25, 30, 35, 40],
        'Gender': ['M', 'F', np.nan, 'F', np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df['Gendermap'] = df['Gender'].map({'M': 0, 'F': 1})  # the new map function
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
df['Gender_imputed'] = pd.unique(df['Gender'])[df['Gender_imputed_factorized'].astype(int)]
df

英文:

I believe it is the factorize function that you are using which is causing issues. It is removing the NaN values so there is nothing to impute when you use fit_transform.

Try using map to convert gender to a numerical column like this:

import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

data = {&#39;ID&#39;: [1, 2, 3, 4, 5],
        &#39;Age&#39;: [20, 25, 30, 35, 40],
        &#39;Gender&#39;: [&#39;M&#39;, &#39;F&#39;, np.nan, &#39;F&#39;, np.nan]}
df = pd.DataFrame(data)
imputer = KNNImputer(n_neighbors=2)
df[&#39;Gendermap&#39;] = df[&#39;Gender&#39;].map({&#39;M&#39;: 0, &#39;F&#39;: 1}) # the new map function
df[&#39;Gender_imputed_factorized&#39;] = imputer.fit_transform(df[[&#39;Gendermap&#39;]])
df[&#39;Gender_imputed&#39;] = pd.unique(df[&#39;Gender&#39;])[df[&#39;Gender_imputed_factorized&#39;].astype(int)]
df

答案2

得分: 0

已解决，谢谢。

df['Gendermap'] = pd.factorize(df['Gender'])[0]
imputer = KNNImputer(n_neighbors=2)
df['Gender_imputed_factorized'] = imputer.fit_transform(df[['Gendermap']])
imputed_labels = pd.unique(df['Gender'].dropna())
df['Gender_imputed'] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df['Gender_imputed_factorized']]

print(df)

英文:

Got a solution thanks.

df[&#39;Gendermap&#39;] = pd.factorize(df[&#39;Gender&#39;])[0]
imputer = KNNImputer(n_neighbors=2)
df[&#39;Gender_imputed_factorized&#39;] = imputer.fit_transform(df[[&#39;Gendermap&#39;]])
imputed_labels = pd.unique(df[&#39;Gender&#39;].dropna())
df[&#39;Gender_imputed&#39;] = [imputed_labels[int(i)] if not np.isnan(i) else np.nan for i in df[&#39;Gender_imputed_factorized&#39;]]

print(df)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

KNN imputation for missing categorical-string values python for a specific column in a dataframe and return with replaced value as a dataframe

问题

答案1

答案2

使用唯一的分组从路径中移除文件。

无法访问Biopython中成对对齐的单个对齐字符串。

Dask/pandas应用函数并返回多行

2D字典 – while True循环正在覆盖所有键值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论