2023年3月10日 01:40:49go评论100阅读模式

英文:

Replace missing values with the value of the column with the minimum sum of differences

问题

我有以下数据框。

# 创建一个示例数据框
df = pd.DataFrame({'Age': [np.nan, 31, 29, 43, np.nan],
                   'Weight': [np.nan, 100, 60, 75, np.nan],
                   'Height': [1.65, 1.64, 1.75, 1.70, 1.68],
                   'BMI': [19, 15, 10, 25, 30]})

我想要替换缺失值的列如下：

case_columns = ['Age', 'Weight']

我想要一个Python算法，它将用具有与缺失值所在行的差值之和最小的行的相同值来替换缺失值。

在我的示例中，在第0行，年龄应为31，体重为100，因为与第1行的差值之和最小((1.65-1.64) + (19-15))。在第4行，年龄应为43，体重应为75。

我该如何在Python中实现这个功能？

英文:

I have the dataframe below.

# Create a sample DataFrame
df = pd.DataFrame({&#39;Age&#39;: [np.nan, 31, 29, 43, np.nan],
                   &#39;Weight&#39;: [np.nan, 100, 60, 75, np.nan],
                   &#39;Height&#39;: [1.65, 1.64, 1.75, 1.70, 1.68],
                   &#39;BMI&#39;: [19, 15, 10, 25, 30]})

and the columns I want to replace missing values for:

case_columns = ['Age', 'Weight']

I want an algorithm -in python- which will replace the missing values with the same value of the row with: the minimum sum of the difference between the row of the missing value and the others.

In my example, in row 0, the age should be 31 and the weight 100, having the min difference ((1.65-164) + (19-15)) with row 1. In row 4 the age should be 43 and the weight 75.

How can I do this in Python?

答案1

得分: 1

你可以尝试创建一个函数并使用df.apply()来处理缺失值。

def fill_missing(x):
    # 如果年龄或体重缺失
    if any(np.isnan(x.drop('Height')):
        # 创建一个包含身高差异的Series（不包括当前行）
        height_diff = np.abs(df.drop(x.name)['Height'] - x['Height'])
        # 获取最小差异的行索引（注意使用绝对值）
        row_idx = height_diff.idxmin()
        # 替代缺失值
        for feature in x.index:
            if np.isnan(x[feature]):
                x[feature] = df.loc[row_idx][feature]
    return x
df.apply(fill_missing, axis=1)
# 如果你想改变df的值
df = df.apply(fill_missing, axis=1)

请注意，这是一段Python代码，用于处理数据框df中的缺失值。

英文:

You can try creating a function and using df.apply()

def fill_missing(x):
    # if age or weight are missing
    if any(np.isnan(x.drop(&#39;Height&#39;))):
        # create series df height - row height (exlude current row)
        height_diff = np.abs(df.drop(x.name)[&#39;Height&#39;] - x[&#39;Height&#39;])
        # get row index of minimum (obs: remember to use abs)
        row_idx = height_diff.idxmin()
        # substitute whatever is missing
        for feature in x.index:
            if np.isnan(x[feature]):
                x[feature] = df.loc[row_idx][feature]
    return x
df.apply(fill_missing, axis=1)
# if you want to change the value of df
df = df.apply(fill_missing, axis=1)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用列最小差异和的值替换缺失值

问题

答案1

在Google Colab或Jupyter Notebook中使用Sherlock。

Python类层次结构中的动态参数在args之前。

正则表达式语句，匹配可选字符到同一组中。

创建分类之间的层次结构。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。