pandas基于另一列添加排名列

huangapple go评论125阅读模式
英文:

pandas add a ranking column based on another column

问题

我有一个DataFrame:

  1. df = pd.DataFrame({'feature':['a','b','c','d','e'],
  2. 'importance':[0.1, 0.5, 0.4, 0.2, 0.8]})
  3. df
  4. feature importance
  5. 0 a 0.1
  6. 1 b 0.5
  7. 2 c 0.4
  8. 3 d 0.2
  9. 4 e 0.8

我想添加一个名为ranking的列,通过以下方式为每个特征分配排名:

  1. feature_rank = 特征的重要性 / 所有特征重要性的总和

所以特征的排名如下:

  1. a -> 0.1 / (0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.05
  2. b -> 0.5 / (0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.25
  3. c -> 0.4 / (0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.2
  4. d -> 0.2 / (0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.1
  5. e -> 0.8 / (0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.4

预期结果:

因此,最终的df将如下所示:

  1. feature importance ranking
  2. 0 a 0.1 5
  3. 1 b 0.5 2
  4. 2 c 0.4 3
  5. 3 d 0.2 4
  6. 4 e 0.8 1
英文:

I have the DataFrame:

  1. df = pd.DataFrame({'feature':['a','b','c','d','e'],
  2. 'importance':[0.1, 0.5, 0.4, 0.2, 0.8]})
  3. df
  4. feature importance
  5. 0 a 0.1
  6. 1 b 0.5
  7. 2 c 0.4
  8. 3 d 0.2
  9. 4 e 0.8

I want to add a column ranking, that assigns rank to each feature by evaluating:

  1. feature_rank = feature's importance/sum of all features importance

So feature that:

  1. a -> 0.1 /(0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.05
  2. b -> 0.5 /(0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.25
  3. c -> 0.4 /(0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.2
  4. d -> 0.2 /(0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.1
  5. e -> 0.8 /(0.1 + 0.5 + 0.4 + 0.2 + 0.8) = 0.4

Expected results:

The final df will therefore be:

  1. feature importance ranking
  2. 0 a 0.1 5
  3. 1 b 0.5 2
  4. 2 c 0.4 3
  5. 3 d 0.2 4
  6. 4 e 0.8 1

答案1

得分: 2

你可以在使用Series的sum进行归一化后,使用rank方法:

  1. df['ranking'] = (df['importance'].div(df['importance'].sum())
  2. .rank(method='dense', ascending=False)
  3. .astype(int) # 可选
  4. )

请注意,如果总和是正数,通过除以严格正整数来计算不会改变排名,所以你可以简化为:

  1. df['ranking'] = df['importance'].rank(method='dense', ascending=False)

输出结果:

  1. feature importance ranking
  2. 0 a 0.1 5
  3. 1 b 0.5 2
  4. 2 c 0.4 3
  5. 3 d 0.2 4
  6. 4 e 0.8 1
英文:

You can use rank after normalizing with the Series' sum:

  1. df['ranking'] = (df['importance'].div(df['importance'].sum())
  2. .rank(method='dense', ascending=False)
  3. .astype(int) # optional
  4. )

Note that dividing by a strictly positive integer won't change the rank, so if the sum is positive, you can simplify to:

  1. df['ranking'] = df['importance'].rank(method='dense', ascending=False)

Output:

  1. feature importance ranking
  2. 0 a 0.1 5
  3. 1 b 0.5 2
  4. 2 c 0.4 3
  5. 3 d 0.2 4
  6. 4 e 0.8 1

答案2

得分: 1

这可能看起来不是很高效,但这只是实现相同问题的另一种方式。

  1. import pandas as pd
  2. df = pd.DataFrame({'feature':['a','b','c','d','e'],
  3. 'importance':[0.1, 0.5, 0.4, 0.2, 0.8]})
  4. df = df.sort_values(by='importance', ascending=False)
  5. df["rating"] = range(1, len(df) + 1)
  6. df = df.sort_index()

pandas基于另一列添加排名列

英文:

This may not seem very efficient, but this is just another way of achieveing the same problem.

  1. import pandas as pd
  2. df = pd.DataFrame({'feature':['a','b','c','d','e'],
  3. 'importance':[0.1, 0.5, 0.4, 0.2, 0.8]})
  4. df = df.sort_values(by='importance', ascending=False)
  5. df["rating"] = range(1, len(df) + 1)
  6. df = df.sort_index()

pandas基于另一列添加排名列

答案3

得分: 1

另一种可能的解决方案:

  1. df.assign(ranking = df.sort_values('importance', ascending=False).index + 1)

输出:

  1. feature importance ranking
  2. 0 a 0.1 5
  3. 1 b 0.5 2
  4. 2 c 0.4 3
  5. 3 d 0.2 4
  6. 4 e 0.8 1
英文:

Another possible solution:

  1. df.assign(ranking = df.sort_values('importance', ascending=False).index + 1)

Output:

  1. feature importance ranking
  2. 0 a 0.1 5
  3. 1 b 0.5 2
  4. 2 c 0.4 3
  5. 3 d 0.2 4
  6. 4 e 0.8 1

huangapple
  • 本文由 发表于 2023年6月15日 20:40:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/76482580.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定