Pandas Data Error on value_counts() does not display the count correctly to clean data.

huangapple go评论92阅读模式

Pandas Data Error on value_counts() does not display the count correctly to clean data



为了查看拼写错误,我尝试执行 print(df["Column Name"].value_counts())


  1. 1 40
  2. 1 67
  3. 0 89
  4. 0 33
  5. Y 3



  1. 1 110
  2. 0 122


  1. df["Column Name"].str.strip()
  2. df["Column Name"].replace(" 1", "1")
  3. df["Column Name"].replace("Y", "1")

When cleaning the data it is required to identify any typos
in the particular column that has to be cleaned the values are either 1 or 0 for denoting Yes or No.

To view the typos i try to print(df["Column Name"].value_counts())

The results come as

  1. 1 40
  2. 1 67
  3. 0 89
  4. 0 33
  5. Y 3

I try the replace command for Y but it will then result adding 3 for one set of 1s and display only that 1 set and a single 0 set.

Why the same type are being categorised as two types ?
How is it possible to amend the string to the Numbers and get the following result as it should be

  1. 1 110
  2. 0 122

I tried

  1. df["Column Name"].str.strip()
  2. df["Column Name"].replace(" 1","1")
  3. df["Column Name"].replace("Y","1")


得分: 1

尝试使用 pd.to_numeric

  1. df['Column Name'] = pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0}))
  2. df.value_counts()

尝试使用 np.unique 来检查你的数据框:

  1. import numpy as np
  2. np.unique(df['Column Name'], return_counts=True)


  1. >>> df['Column Name'].value_counts(sort=False)
  2. 1 40
  3. 1 67
  4. 0 89
  5. 0 33
  6. Y 3
  7. Name: Column Name, dtype: int64


  1. >>> pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0})).value_counts()
  2. 0 122
  3. 1 110
  4. Name: Column Name, dtype: int64

Try to use pd.to_numeric:

  1. df['Column Name'] = pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0}))
  2. df.value_counts()

Try to use the np.unique to check your dataframe:

  1. import numpy as np
  2. np.unique(df['Column Name'], return_counts=True)

Without modification:

  1. >>> df['Column Name'].value_counts(sort=False)
  2. 1 40
  3. 1 67
  4. 0 89
  5. 0 33
  6. Y 3
  7. Name: Column Name, dtype: int64

With modification:

  1. >>> pd.to_numeric(df["Column Name"].str.strip().replace({'Y': 1, 'N': 0})).value_counts()
  2. 0 122
  3. 1 110
  4. Name: Column Name, dtype: int64


得分: 1


  1. 一个强大的将您的数据转换的方法可能是

df = pd.DataFrame({'列名': [0, 1, '1', '1 ', ' 1 ', 'Y', 'N']})

映射器 = {'Y': 1, 'N': 0}

df['输出'] = df['列名'].astype(str).str.strip().replace(映射器)#.astype(int)

  1. 输出:

列名 输出
0 0 0
1 1 1
2 1 1
3 1 1
4 1 1
5 Y 1
6 N 0

  1. <details>
  2. <summary>英文:</summary>
  3. A robust method to convert your data might be:

df = pd.DataFrame({'Column Name': [0, 1, '1', '1 ', ' 1 ', 'Y', 'N']})

mapper = {'Y': 1, 'N': 0}

df['out'] = df['Column Name'].astype(str).str.strip().replace(mapper)#.astype(int)

  1. Output:

Column Name out
0 0 0
1 1 1
2 1 1
3 1 1
4 1 1
5 Y 1
6 N 0

  1. </details>

  • 本文由 发表于 2023年4月10日 23:03:11
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
