How to create a function which replace an empty values with the most appearing value or average value based on the specific columns

huangapple go评论118阅读模式
英文:

How to create a function which replace an empty values with the most appearing value or average value based on the specific columns

问题

一个CSV文件包含超过18列,我想进行数据清洗,而不丢弃带有空值的行。

某些列包含字符串,而其他列包含浮点数和整数,因此我想创建一个函数,该函数将自动检查某个列是否为空,并将该空值替换为以下值:

  • 如果这是基于字符串的列,则替换为出现频率最高的值。
  • 如果是基于整数和浮点数的列,则替换为平均值,而不是手动操作。

我的示例代码:

df['response'] = df['response'].fillna('A').astype(str)
df['score'] = df['score'].fillna(df['score'].mean()).astype(float)
英文:

A csv file contains more than 18 columns and I want to perform data cleaning without dropping rows with empty values.

Some columns have string while other have float and int so I want to create function that will automatically check a certain columns with an empty value and replace that empty with either the most appearing value if this is string based column and the average value for int and float based columns instead of doing it manually.

|response|score|
|A       |25   |
|A       |     |
|B       |20   |
|C       |15   |
|        |25   |

My sample codes

df['response']=df['response'].fillna('A').astype(str)
df['score']=df['score'].fillna(df.score.mean()).astype(float)

答案1

得分: 0

你可以尝试这样做:

df['score'] = df['score'].fillna(df.groupby('response')['score'].transform('mean'))

这将用相同 'response' 的值的均值替换空值。

你可以根据你的用例将函数更改为中值、最大值、最小值 - 取决于你的需求。

如果你想对所有列都这样做:

for col in df.columns:
    df[col] = df[col].fillna(df.groupby('response')[col].transform('mean'))
英文:

You can try this:

df['score'] = df['score'].fillna(df.groupby('response')['score'].transform('mean'))

This will replace the nulls with mean of values of the same 'response'.

You can change the function to median, max, min - depends on your use case

If you want to do that for all the columns:

for col in df.columns:
    df[col] = df[col].fillna(df.groupby('response')[col].transform('mean'))

答案2

得分: 0

使用自定义函数与 pandas.api.types.is_numeric_dtype 一起:

def filler(s):
    if pd.api.types.is_numeric_dtype(s):
        fill_value = s.mean()
    else:                         # add other conditions if needed
        fill_value = s.mode()[0]
    return s.fillna(fill_value)

out = df.apply(filler)

作为一行代码:

out = df.apply(lambda s: s.fillna(s.mean() if pd.api.types.is_numeric_dtype(s)
                                  else s.mode()[0]))

输出:

  response  score
0        A  25.00
1        A  21.25
2        B  20.00
3        C  15.00
4        A  25.00
英文:

Use a custom function with pandas.api.types.is_numeric_dtype:

def filler(s):
    if pd.api.types.is_numeric_dtype(s):
        fill_value = s.mean()
    else:                         # add other conditions if needed
        fill_value = s.mode()[0]
    return s.fillna(fill_value)

out = df.apply(filler)

As a one-liner

out = df.apply(lambda s: s.fillna(s.mean() if pd.api.types.is_numeric_dtype(s)
                                  else s.mode()[0]))

Output:

  response  score
0        A  25.00
1        A  21.25
2        B  20.00
3        C  15.00
4        A  25.00

huangapple
  • 本文由 发表于 2023年7月24日 16:51:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76752810.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定