Pandas Data Error on value_counts() does not display the count correctly to clean data.

huangapple go评论65阅读模式
英文:

Pandas Data Error on value_counts() does not display the count correctly to clean data

问题

当清洗数据时,需要识别特定列中的任何拼写错误,该列的值为1或0,表示是或否。

为了查看拼写错误,我尝试执行 print(df["Column Name"].value_counts())

结果如下:

1     40
1     67
0     89
0     33
Y      3

我尝试使用替换命令来替换Y,但结果是将3添加到一组1中,并仅显示该组1和单个0。

为什么相同类型被分类为两种类型?
如何将字符串更改为数字,使结果如下所示:

1     110
0     122

我尝试了以下操作:

df["Column Name"].str.strip()
df["Column Name"].replace(" 1", "1")
df["Column Name"].replace("Y", "1")
英文:

When cleaning the data it is required to identify any typos
in the particular column that has to be cleaned the values are either 1 or 0 for denoting Yes or No.

To view the typos i try to print(df["Column Name"].value_counts())

The results come as

1     40
1     67
0     89
0     33
Y      3

I try the replace command for Y but it will then result adding 3 for one set of 1s and display only that 1 set and a single 0 set.

Why the same type are being categorised as two types ?
How is it possible to amend the string to the Numbers and get the following result as it should be

1     110
0     122

I tried

df["Column Name"].str.strip()
df["Column Name"].replace(" 1","1")
df["Column Name"].replace("Y","1")

答案1

得分: 1

尝试使用 pd.to_numeric

df['Column Name'] = pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0}))
df.value_counts()

尝试使用 np.unique 来检查你的数据框:

import numpy as np

np.unique(df['Column Name'], return_counts=True)

未修改的部分:

>>> df['Column Name'].value_counts(sort=False)
1     40
1     67
0     89
0     33
Y      3
Name: Column Name, dtype: int64

带有修改的部分:

>>> pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0})).value_counts()
0    122
1    110
Name: Column Name, dtype: int64
英文:

Try to use pd.to_numeric:

df['Column Name'] = pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0}))
df.value_counts()

Try to use the np.unique to check your dataframe:

import numpy as np

np.unique(df['Column Name'], return_counts=True)

Without modification:

>>> df['Column Name'].value_counts(sort=False)
1     40
1     67
0     89
0     33
Y      3
Name: Column Name, dtype: int64

With modification:

>>> pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0})).value_counts()
0    122
1    110
Name: Column Name, dtype: int64

答案2

得分: 1

以下是翻译好的内容:

一个强大的将您的数据转换的方法可能是

df = pd.DataFrame({'列名': [0, 1, '1', '1 ', ' 1 ', 'Y', 'N']})

映射器 = {'Y': 1, 'N': 0}

df['输出'] = df['列名'].astype(str).str.strip().replace(映射器)#.astype(int)

输出:

列名 输出
0 0 0
1 1 1
2 1 1
3 1 1
4 1 1
5 Y 1
6 N 0


<details>
<summary>英文:</summary>

A robust method to convert your data might be:

df = pd.DataFrame({'Column Name': [0, 1, '1', '1 ', ' 1 ', 'Y', 'N']})

mapper = {'Y': 1, 'N': 0}

df['out'] = df['Column Name'].astype(str).str.strip().replace(mapper)#.astype(int)

Output:

Column Name out
0 0 0
1 1 1
2 1 1
3 1 1
4 1 1
5 Y 1
6 N 0


</details>



huangapple
  • 本文由 发表于 2023年4月10日 23:03:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75978232.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定