如何仅对重复的行进行排名,而不包括NaN值?

huangapple go评论82阅读模式
英文:

How to rank only duplicated rows and without Nan?

问题

我有一张包含数据的表格:

如何仅对重复值进行排名(不考虑NaN)?

我的当前输出很遗憾也对唯一值进行了排名:

我需要的输出是:

代码示例:

谢谢!

英文:

I have a table with data:

   Col1
0   1.0
1   1.0
2   1.0
3   2.0
4   3.0
5   4.0
6   NaN

How can I rank only duplicated values (without taking into account NaN as well)?
My current output is where unfortunately unique values are ranked as well:

   Col1   Rn
0   1.0  1.0
1   1.0  2.0
2   1.0  3.0
3   2.0  1.0
4   3.0  1.0
5   4.0  1.0
6   NaN  NaN

The output I need is:

   Col1   Rn
0   1.0  1.0
1   1.0  2.0
2   1.0  3.0
3   2.0  NaN
4   3.0  NaN
5   4.0  NaN
6   NaN  NaN

Example of the code:

import numpy as np
import pandas as pd

df = pd.DataFrame([[1],
                   [1],
                   [1],
                   [2],
                   [3],
                   [4],
                   [np.NaN]], columns=['Col1'])
print(df)


# Adding row_number for each pair:
df['Rn'] = df[df['Col1'].notnull()].groupby('Col1')['Col1'].rank(method="first", ascending=True)
print(df)

# I managed to select only necessary rows for mask, but how can I apply it along with groupby?:
m = df.dropna().loc[df['Col1'].duplicated(keep=False)]
print(m)

Thank you!

答案1

得分: 2

尝试:

m = df['Col1'].duplicated(keep=False)
df['Rn'] = df[m].groupby('Col1')['Col1'].rank(method="first", ascending=True)
print(df)

打印:

   Col1   Rn
0   1.0  1.0
1   1.0  2.0
2   1.0  3.0
3   2.0  NaN
4   3.0  NaN
5   4.0  NaN
6   NaN  NaN
英文:

Try:

m = df['Col1'].duplicated(keep=False)
df['Rn'] = df[m].groupby('Col1')['Col1'].rank(method="first", ascending=True)
print(df)

Prints:

   Col1   Rn
0   1.0  1.0
1   1.0  2.0
2   1.0  3.0
3   2.0  NaN
4   3.0  NaN
5   4.0  NaN
6   NaN  NaN

答案2

得分: 1

你可以识别duplicated值,并仅计算这些值的rank

# 识别重复行
m = df['Col1'].duplicated(keep=False)

# 仅对这些值计算rank
df['Rn'] = df.loc[m, 'Col1'].rank(method='first', ascending=True)

请注意,如果您想增加重复值的计数,您可以使用groupby.cumcount

m = df['Col1'].duplicated(keep=False)

df['Rn'] = df.loc[m, ['Col1']].groupby('Col1').cumcount().add(1)

输出:

   Col1   Rn
0   1.0  1.0
1   1.0  2.0
2   1.0  3.0
3   2.0  NaN
4   3.0  NaN
5   4.0  NaN
6   NaN  NaN
英文:

You can identify the duplicated values and only compute the rank for those:

# identify duplicated rows
m = df['Col1'].duplicated(keep=False)

# compute the rank only for those
df['Rn'] = df.loc[m, 'Col1'].rank(method='first', ascending=True)

Note thank if you want to increment a count of the duplicates, you can use groupby.cumcount:

m = df['Col1'].duplicated(keep=False)

df['Rn'] = df.loc[m, ['Col1']].groupby('Col1').cumcount().add(1)

Output:

   Col1   Rn
0   1.0  1.0
1   1.0  2.0
2   1.0  3.0
3   2.0  NaN
4   3.0  NaN
5   4.0  NaN
6   NaN  NaN

huangapple
  • 本文由 发表于 2023年6月15日 02:15:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76476469.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定