如何仅对重复的行进行排名,而不包括NaN值?

huangapple go评论119阅读模式
英文:

How to rank only duplicated rows and without Nan?

问题

我有一张包含数据的表格:

如何仅对重复值进行排名(不考虑NaN)?

我的当前输出很遗憾也对唯一值进行了排名:

我需要的输出是:

代码示例:

谢谢!

英文:

I have a table with data:

  1. Col1
  2. 0 1.0
  3. 1 1.0
  4. 2 1.0
  5. 3 2.0
  6. 4 3.0
  7. 5 4.0
  8. 6 NaN

How can I rank only duplicated values (without taking into account NaN as well)?
My current output is where unfortunately unique values are ranked as well:

  1. Col1 Rn
  2. 0 1.0 1.0
  3. 1 1.0 2.0
  4. 2 1.0 3.0
  5. 3 2.0 1.0
  6. 4 3.0 1.0
  7. 5 4.0 1.0
  8. 6 NaN NaN

The output I need is:

  1. Col1 Rn
  2. 0 1.0 1.0
  3. 1 1.0 2.0
  4. 2 1.0 3.0
  5. 3 2.0 NaN
  6. 4 3.0 NaN
  7. 5 4.0 NaN
  8. 6 NaN NaN

Example of the code:

  1. import numpy as np
  2. import pandas as pd
  3. df = pd.DataFrame([[1],
  4. [1],
  5. [1],
  6. [2],
  7. [3],
  8. [4],
  9. [np.NaN]], columns=['Col1'])
  10. print(df)
  11. # Adding row_number for each pair:
  12. df['Rn'] = df[df['Col1'].notnull()].groupby('Col1')['Col1'].rank(method="first", ascending=True)
  13. print(df)
  14. # I managed to select only necessary rows for mask, but how can I apply it along with groupby?:
  15. m = df.dropna().loc[df['Col1'].duplicated(keep=False)]
  16. print(m)

Thank you!

答案1

得分: 2

尝试:

  1. m = df['Col1'].duplicated(keep=False)
  2. df['Rn'] = df[m].groupby('Col1')['Col1'].rank(method="first", ascending=True)
  3. print(df)

打印:

  1. Col1 Rn
  2. 0 1.0 1.0
  3. 1 1.0 2.0
  4. 2 1.0 3.0
  5. 3 2.0 NaN
  6. 4 3.0 NaN
  7. 5 4.0 NaN
  8. 6 NaN NaN
英文:

Try:

  1. m = df['Col1'].duplicated(keep=False)
  2. df['Rn'] = df[m].groupby('Col1')['Col1'].rank(method="first", ascending=True)
  3. print(df)

Prints:

  1. Col1 Rn
  2. 0 1.0 1.0
  3. 1 1.0 2.0
  4. 2 1.0 3.0
  5. 3 2.0 NaN
  6. 4 3.0 NaN
  7. 5 4.0 NaN
  8. 6 NaN NaN

答案2

得分: 1

你可以识别duplicated值,并仅计算这些值的rank

  1. # 识别重复行
  2. m = df['Col1'].duplicated(keep=False)
  3. # 仅对这些值计算rank
  4. df['Rn'] = df.loc[m, 'Col1'].rank(method='first', ascending=True)

请注意,如果您想增加重复值的计数,您可以使用groupby.cumcount

  1. m = df['Col1'].duplicated(keep=False)
  2. df['Rn'] = df.loc[m, ['Col1']].groupby('Col1').cumcount().add(1)

输出:

  1. Col1 Rn
  2. 0 1.0 1.0
  3. 1 1.0 2.0
  4. 2 1.0 3.0
  5. 3 2.0 NaN
  6. 4 3.0 NaN
  7. 5 4.0 NaN
  8. 6 NaN NaN
英文:

You can identify the duplicated values and only compute the rank for those:

  1. # identify duplicated rows
  2. m = df['Col1'].duplicated(keep=False)
  3. # compute the rank only for those
  4. df['Rn'] = df.loc[m, 'Col1'].rank(method='first', ascending=True)

Note thank if you want to increment a count of the duplicates, you can use groupby.cumcount:

  1. m = df['Col1'].duplicated(keep=False)
  2. df['Rn'] = df.loc[m, ['Col1']].groupby('Col1').cumcount().add(1)

Output:

  1. Col1 Rn
  2. 0 1.0 1.0
  3. 1 1.0 2.0
  4. 2 1.0 3.0
  5. 3 2.0 NaN
  6. 4 3.0 NaN
  7. 5 4.0 NaN
  8. 6 NaN NaN

huangapple
  • 本文由 发表于 2023年6月15日 02:15:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476469.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定