用列最小差异和的值替换缺失值

huangapple go评论58阅读模式
英文:

Replace missing values with the value of the column with the minimum sum of differences

问题

我有以下数据框。

# 创建一个示例数据框
df = pd.DataFrame({'Age': [np.nan, 31, 29, 43, np.nan],
                   'Weight': [np.nan, 100, 60, 75, np.nan],
                   'Height': [1.65, 1.64, 1.75, 1.70, 1.68],
                   'BMI': [19, 15, 10, 25, 30]})

我想要替换缺失值的列如下:

case_columns = ['Age', 'Weight']

我想要一个Python算法,它将用具有与缺失值所在行的差值之和最小的行的相同值来替换缺失值。

在我的示例中,在第0行,年龄应为31,体重为100,因为与第1行的差值之和最小((1.65-1.64) + (19-15))。在第4行,年龄应为43,体重应为75。

我该如何在Python中实现这个功能?

英文:

I have the dataframe below.

# Create a sample DataFrame
df = pd.DataFrame({'Age': [np.nan, 31, 29, 43, np.nan],
                   'Weight': [np.nan, 100, 60, 75, np.nan],
                   'Height': [1.65, 1.64, 1.75, 1.70, 1.68],
                   'BMI': [19, 15, 10, 25, 30]})

and the columns I want to replace missing values for:

case_columns = ['Age', 'Weight']

I want an algorithm -in python- which will replace the missing values with the same value of the row with: the minimum sum of the difference between the row of the missing value and the others.

In my example, in row 0, the age should be 31 and the weight 100, having the min difference ((1.65-164) + (19-15)) with row 1. In row 4 the age should be 43 and the weight 75.

How can I do this in Python?

答案1

得分: 1

你可以尝试创建一个函数并使用df.apply()来处理缺失值。

def fill_missing(x):
    # 如果年龄或体重缺失
    if any(np.isnan(x.drop('Height')):
        # 创建一个包含身高差异的Series(不包括当前行)
        height_diff = np.abs(df.drop(x.name)['Height'] - x['Height'])
        # 获取最小差异的行索引(注意使用绝对值)
        row_idx = height_diff.idxmin()
        # 替代缺失值
        for feature in x.index:
            if np.isnan(x[feature]):
                x[feature] = df.loc[row_idx][feature]
    return x

df.apply(fill_missing, axis=1)

# 如果你想改变df的值
df = df.apply(fill_missing, axis=1)

请注意,这是一段Python代码,用于处理数据框df中的缺失值。

英文:

You can try creating a function and using df.apply()

def fill_missing(x):
    # if age or weight are missing
    if any(np.isnan(x.drop('Height'))):
        # create series df height - row height (exlude current row)
        height_diff = np.abs(df.drop(x.name)['Height'] - x['Height'])
        # get row index of minimum (obs: remember to use abs)
        row_idx = height_diff.idxmin()
        # substitute whatever is missing
        for feature in x.index:
            if np.isnan(x[feature]):
                x[feature] = df.loc[row_idx][feature]
    return x

df.apply(fill_missing, axis=1)

# if you want to change the value of df
df = df.apply(fill_missing, axis=1)


huangapple
  • 本文由 发表于 2023年3月10日 01:40:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/75688214.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定