2023年7月24日 16:51:02go评论150阅读模式

英文:

How to create a function which replace an empty values with the most appearing value or average value based on the specific columns

问题

一个CSV文件包含超过18列，我想进行数据清洗，而不丢弃带有空值的行。

某些列包含字符串，而其他列包含浮点数和整数，因此我想创建一个函数，该函数将自动检查某个列是否为空，并将该空值替换为以下值：

如果这是基于字符串的列，则替换为出现频率最高的值。
如果是基于整数和浮点数的列，则替换为平均值，而不是手动操作。

我的示例代码：

df['response'] = df['response'].fillna('A').astype(str)
df['score'] = df['score'].fillna(df['score'].mean()).astype(float)

英文:

A csv file contains more than 18 columns and I want to perform data cleaning without dropping rows with empty values.

Some columns have string while other have float and int so I want to create function that will automatically check a certain columns with an empty value and replace that empty with either the most appearing value if this is string based column and the average value for int and float based columns instead of doing it manually.

|response|score|
|A       |25   |
|A       |     |
|B       |20   |
|C       |15   |
|        |25   |

My sample codes

df[&#39;response&#39;]=df[&#39;response&#39;].fillna(&#39;A&#39;).astype(str)
df[&#39;score&#39;]=df[&#39;score&#39;].fillna(df.score.mean()).astype(float)

答案1

得分: 0

你可以尝试这样做：

df['score'] = df['score'].fillna(df.groupby('response')['score'].transform('mean'))

这将用相同 'response' 的值的均值替换空值。

你可以根据你的用例将函数更改为中值、最大值、最小值 - 取决于你的需求。

如果你想对所有列都这样做：

for col in df.columns:
    df[col] = df[col].fillna(df.groupby('response')[col].transform('mean'))

英文:

You can try this:

df[&#39;score&#39;] = df[&#39;score&#39;].fillna(df.groupby(&#39;response&#39;)[&#39;score&#39;].transform(&#39;mean&#39;))

This will replace the nulls with mean of values of the same 'response'.

You can change the function to median, max, min - depends on your use case

If you want to do that for all the columns:

for col in df.columns:
    df[col] = df[col].fillna(df.groupby(&#39;response&#39;)[col].transform(&#39;mean&#39;))

答案2

得分: 0

使用自定义函数与 pandas.api.types.is_numeric_dtype 一起：

def filler(s):
    if pd.api.types.is_numeric_dtype(s):
        fill_value = s.mean()
    else:                         # add other conditions if needed
        fill_value = s.mode()[0]
    return s.fillna(fill_value)
out = df.apply(filler)

作为一行代码：

out = df.apply(lambda s: s.fillna(s.mean() if pd.api.types.is_numeric_dtype(s)
                                  else s.mode()[0]))

输出：

  response  score
0        A  25.00
1        A  21.25
2        B  20.00
3        C  15.00
4        A  25.00

英文:

Use a custom function with pandas.api.types.is_numeric_dtype:

def filler(s):
    if pd.api.types.is_numeric_dtype(s):
        fill_value = s.mean()
    else:                         # add other conditions if needed
        fill_value = s.mode()[0]
    return s.fillna(fill_value)
out = df.apply(filler)

As a one-liner

out = df.apply(lambda s: s.fillna(s.mean() if pd.api.types.is_numeric_dtype(s)
                                  else s.mode()[0]))

Output:

  response  score
0        A  25.00
1        A  21.25
2        B  20.00
3        C  15.00
4        A  25.00

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How to create a function which replace an empty values with the most appearing value or average value based on the specific columns

问题

答案1

答案2

如何在索引函数中存储字符串列表？

复制远程的PostgreSQL数据库到第二个远程服务器。

pickle文件可复制吗？

Azure数据工厂：在Python的ForEach循环中使用Lookup结果

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。