英文:
How to create a function which replace an empty values with the most appearing value or average value based on the specific columns
问题
一个CSV文件包含超过18列,我想进行数据清洗,而不丢弃带有空值的行。
某些列包含字符串,而其他列包含浮点数和整数,因此我想创建一个函数,该函数将自动检查某个列是否为空,并将该空值替换为以下值:
- 如果这是基于字符串的列,则替换为出现频率最高的值。
- 如果是基于整数和浮点数的列,则替换为平均值,而不是手动操作。
我的示例代码:
df['response'] = df['response'].fillna('A').astype(str)
df['score'] = df['score'].fillna(df['score'].mean()).astype(float)
英文:
A csv file contains more than 18 columns and I want to perform data cleaning without dropping rows with empty values.
Some columns have string while other have float and int so I want to create function that will automatically check a certain columns with an empty value and replace that empty with either the most appearing value if this is string based column and the average value for int and float based columns instead of doing it manually.
|response|score|
|A |25 |
|A | |
|B |20 |
|C |15 |
| |25 |
My sample codes
df['response']=df['response'].fillna('A').astype(str)
df['score']=df['score'].fillna(df.score.mean()).astype(float)
答案1
得分: 0
你可以尝试这样做:
df['score'] = df['score'].fillna(df.groupby('response')['score'].transform('mean'))
这将用相同 'response' 的值的均值替换空值。
你可以根据你的用例将函数更改为中值、最大值、最小值 - 取决于你的需求。
如果你想对所有列都这样做:
for col in df.columns:
df[col] = df[col].fillna(df.groupby('response')[col].transform('mean'))
英文:
You can try this:
df['score'] = df['score'].fillna(df.groupby('response')['score'].transform('mean'))
This will replace the nulls with mean of values of the same 'response'.
You can change the function to median, max, min - depends on your use case
If you want to do that for all the columns:
for col in df.columns:
df[col] = df[col].fillna(df.groupby('response')[col].transform('mean'))
答案2
得分: 0
使用自定义函数与 pandas.api.types.is_numeric_dtype
一起:
def filler(s):
if pd.api.types.is_numeric_dtype(s):
fill_value = s.mean()
else: # add other conditions if needed
fill_value = s.mode()[0]
return s.fillna(fill_value)
out = df.apply(filler)
作为一行代码:
out = df.apply(lambda s: s.fillna(s.mean() if pd.api.types.is_numeric_dtype(s)
else s.mode()[0]))
输出:
response score
0 A 25.00
1 A 21.25
2 B 20.00
3 C 15.00
4 A 25.00
英文:
Use a custom function with pandas.api.types.is_numeric_dtype
:
def filler(s):
if pd.api.types.is_numeric_dtype(s):
fill_value = s.mean()
else: # add other conditions if needed
fill_value = s.mode()[0]
return s.fillna(fill_value)
out = df.apply(filler)
As a one-liner
out = df.apply(lambda s: s.fillna(s.mean() if pd.api.types.is_numeric_dtype(s)
else s.mode()[0]))
Output:
response score
0 A 25.00
1 A 21.25
2 B 20.00
3 C 15.00
4 A 25.00
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论