使用Pandas DataFrame进行组内排名

huangapple go评论65阅读模式
英文:

Applying within group rankings for Pandas DataFrame

问题

问题描述要求人们选择3个项目并对它们进行1到3的排名。我的Pandas DataFrame 每行都包含所选项目的一个条目,其中第一列是人的姓名,第二列是该人对所选项目的排名。

类似于:

df = pd.DataFrame.from_dict({
  'Name':['Alice','Alice','Alice','Bob','Bob','Bob','Charlie','Charlie','Charlie'], 
  'Item':[...], 
  'Rank':[1,2,3,3,1,2,None,None,None]
})

问题是有些人没有为他们的项目指定排名,例如上面的DataFrame中的Charlie。对于这些人,我想要用有效的“随机”排名填充他们的排名。也就是说,只需为他们的每个项目分配1到3的唯一值。还有一些人只选择了2个项目,而且也忘了对项目进行排名,所以我需要能够处理可变数量的选择项目。

我尝试过进行累积求和,首先用1填充每个空值,然后在名字的分组中运行cumsum。

类似于(尽管我认为这远非正确方法):

df.groupby('Name')['Rank'].cumsum()

此外,我明白通过迭代行可能很容易解决此问题。然而,由于这是Pandas,我正在寻找更有效的解决方案。

英文:

The problem statement has people choosing 3 items and ranking them 1 to 3. My Pandas DataFrame contains one row for each item selected, where the first column is the person's name and the second column is the ranking the person has given to the selection.

Something like:

df = pd.DataFrame.from_dict({
  'Name':['Alice','Alice','Alice','Bob','Bob','Bob','Charlie','Charlie','Charlie'], 
  'Item':[...], 
  'Rank':[1,2,3,3,1,2,None,None,None]
})

The issue is some people did not assign rankings to their items, for example Charlie in the DataFrame above. For these people, I want to fill in their rankings with a valid 'random' ranking. AKA, just give each of their items a unique value from 1 to 3. Also, some people only selected 2 items and also forgot to rank their items, so I need to be able a variable amount of items selected.

I was attempting to do a cumsum, where I first filled in each null value with 1, and then run the cumsum within each group of a groupby on the names.

Something like (although I think this is far from correct):

df.groupby('Name').cumsum('Rank')

Also, I understand this may be easy by iterating over the rows. However, this being Pandas I am looking for a more optimal solution.

答案1

得分: 0

你可以使用 groupby_cumcount

df['Rank'] = df['Rank'].fillna(df.groupby('Name').cumcount().add(1))
print(df)

# 输出
      Name  Rank
0    Alice   1.0
1    Alice   2.0
2    Alice   3.0
3      Bob   3.0
4      Bob   1.0
5      Bob   2.0
6  Charlie   1.0
7  Charlie   2.0
8  Charlie   3.0

要使用随机排名,在之前使用 sample

df['Rank'] = df['Rank'].fillna(df.sample(frac=1).groupby('Name').cumcount().add(1))
print(df)

# 输出
      Name  Rank
0    Alice   1.0
1    Alice   2.0
2    Alice   3.0
3      Bob   3.0
4      Bob   1.0
5      Bob   2.0
6  Charlie   3.0
7  Charlie   1.0
8  Charlie   2.0
英文:

You can use groupby_cumcount:

df['Rank'] = df['Rank'].fillna(df.groupby('Name').cumcount().add(1))
print(df)

# Output
      Name  Rank
0    Alice   1.0
1    Alice   2.0
2    Alice   3.0
3      Bob   3.0
4      Bob   1.0
5      Bob   2.0
6  Charlie   1.0
7  Charlie   2.0
8  Charlie   3.0

To use random ranking, use sample before:

df['Rank'] = df['Rank'].fillna(df.sample(frac=1).groupby('Name').cumcount().add(1))
print(df)

# Output
      Name  Rank
0    Alice   1.0
1    Alice   2.0
2    Alice   3.0
3      Bob   3.0
4      Bob   1.0
5      Bob   2.0
6  Charlie   3.0
7  Charlie   1.0
8  Charlie   2.0

huangapple
  • 本文由 发表于 2023年5月30日 03:03:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76359784.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定