英文:
Pandas Data Error on value_counts() does not display the count correctly to clean data
问题
当清洗数据时,需要识别特定列中的任何拼写错误,该列的值为1或0,表示是或否。
为了查看拼写错误,我尝试执行 print(df["Column Name"].value_counts())
。
结果如下:
1 40
1 67
0 89
0 33
Y 3
我尝试使用替换命令来替换Y,但结果是将3添加到一组1中,并仅显示该组1和单个0。
为什么相同类型被分类为两种类型?
如何将字符串更改为数字,使结果如下所示:
1 110
0 122
我尝试了以下操作:
df["Column Name"].str.strip()
df["Column Name"].replace(" 1", "1")
df["Column Name"].replace("Y", "1")
英文:
When cleaning the data it is required to identify any typos
in the particular column that has to be cleaned the values are either 1 or 0 for denoting Yes or No.
To view the typos i try to print(df["Column Name"].value_counts())
The results come as
1 40
1 67
0 89
0 33
Y 3
I try the replace command for Y but it will then result adding 3 for one set of 1s and display only that 1 set and a single 0 set.
Why the same type are being categorised as two types ?
How is it possible to amend the string to the Numbers and get the following result as it should be
1 110
0 122
I tried
df["Column Name"].str.strip()
df["Column Name"].replace(" 1","1")
df["Column Name"].replace("Y","1")
答案1
得分: 1
尝试使用 pd.to_numeric
:
df['Column Name'] = pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0}))
df.value_counts()
尝试使用 np.unique
来检查你的数据框:
import numpy as np
np.unique(df['Column Name'], return_counts=True)
未修改的部分:
>>> df['Column Name'].value_counts(sort=False)
1 40
1 67
0 89
0 33
Y 3
Name: Column Name, dtype: int64
带有修改的部分:
>>> pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0})).value_counts()
0 122
1 110
Name: Column Name, dtype: int64
英文:
Try to use pd.to_numeric
:
df['Column Name'] = pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0}))
df.value_counts()
Try to use the np.unique
to check your dataframe:
import numpy as np
np.unique(df['Column Name'], return_counts=True)
Without modification:
>>> df['Column Name'].value_counts(sort=False)
1 40
1 67
0 89
0 33
Y 3
Name: Column Name, dtype: int64
With modification:
>>> pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0})).value_counts()
0 122
1 110
Name: Column Name, dtype: int64
答案2
得分: 1
以下是翻译好的内容:
一个强大的将您的数据转换的方法可能是:
df = pd.DataFrame({'列名': [0, 1, '1', '1 ', ' 1 ', 'Y', 'N']})
映射器 = {'Y': 1, 'N': 0}
df['输出'] = df['列名'].astype(str).str.strip().replace(映射器)#.astype(int)
输出:
列名 输出
0 0 0
1 1 1
2 1 1
3 1 1
4 1 1
5 Y 1
6 N 0
<details>
<summary>英文:</summary>
A robust method to convert your data might be:
df = pd.DataFrame({'Column Name': [0, 1, '1', '1 ', ' 1 ', 'Y', 'N']})
mapper = {'Y': 1, 'N': 0}
df['out'] = df['Column Name'].astype(str).str.strip().replace(mapper)#.astype(int)
Output:
Column Name out
0 0 0
1 1 1
2 1 1
3 1 1
4 1 1
5 Y 1
6 N 0
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论