数据框列基于简单多数进行聚合。

huangapple go评论94阅读模式
英文:

dataframe column's aggregate based on simple majority

问题

以下是您要求的代码部分的翻译:

  1. 我有一个来自我的模型预测的`dataframe`类似于下面的示例
  2. ```python
  3. df = pd.DataFrame({
  4. 'trip-id': [8,8,8,8,8,8,8,8,4,4,4,4,4,4,4,4,4,4,4,4],
  5. 'segment-id': [1,1,1,1,1,1,1,1,0,0,0,0,0,0,5,5,5,5,5,5],
  6. 'true_label': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
  7. 'prediction': [3, 3, 3, 1, 2, 4, 0, 0, 3, 3, 3, 0, 1, 2, 3, 3, 1, 1, 2, 2]})
  8. df
  9. trip-id segment-id true_label prediction
  10. 0 8 1 3 3
  11. 1 8 1 3 3
  12. 2 8 1 3 3
  13. 3 8 1 3 1
  14. 4 8 1 3 2
  15. 5 8 1 3 4
  16. 6 8 1 3 0
  17. 7 8 1 3 0
  18. 8 4 0 3 3
  19. 9 4 0 3 3
  20. 10 4 0 3 3
  21. 11 4 0 3 0
  22. 12 4 0 3 1
  23. 13 4 0 3 2
  24. 14 4 5 3 3
  25. 15 4 5 3 3
  26. 16 4 5 3 1
  27. 17 4 5 3 1
  28. 18 4 5 3 2
  29. 19 4 5 3 2

在给定的示例中,是对旅行的段落的预测和真实标签进行的,其中包括[0,1,..4]的实例。

我想基于简单的多数生成段的预测摘要。

  • 将段的预测值视为具有简单多数的预测实例[0,1,..4]的值。
  • 如果有多数预测实例的平局,将考虑匹配true_label的值作为段的预测。
  • 如果存在多数的平局,并且没有实例与true_label匹配,则从平局中首先出现在df中的实例将被视为段的预测值。

目前我可以这样做:

  1. segments_summary = (
  2. df['true_label'].eq(df['prediction'])
  3. .groupby([df['true_label'],df['trip-id'], df['segment-id']]).mean()
  4. .ge(0.5)
  5. .groupby(level='true_label').agg(['size','sum'])
  6. .rename(columns={'size':'total-segments','sum':'correctly-predicted'})\
  7. .assign(recall = lambda x: round(x['correctly-predicted']/x['total-segments'], 2))
  8. .reindex(range(5), fill_value='-')
  9. .reset_index())

它产生了以下结果:

  1. segments_summary
  2. true_label total-segments correctly-predicted recall
  3. 0 0 - - -
  4. 1 1 - - -
  5. 2 2 - - -
  6. 3 3 3 1 0.33
  7. 4 4 - - -

但这不是我想要的。根据我上面的条件,所有3个段应该被正确预测。

  • trip 8, segment 13具有简单多数,因此该段应该被预测为3
  • trip 4, segment 03具有简单多数,该段被预测为3
  • trip 4, segment 5:存在平局,因此匹配true_label的预测应该是段的预测->3

预期结果:

  1. true_label total-segments correctly-predicted recall
  2. 0 0 - - -
  3. 1 1 - - -
  4. 2 2 - - -
  5. 3 3 3 3 1.0
  6. 4 4 - - -
  1. 希望这个翻译对您有帮助。如果您有其他问题,请随时提出。
  2. <details>
  3. <summary>英文:</summary>
  4. I have a `dataframe` from my model&#39;s prediction similar to the one below:
  5. ```python
  6. df = pd.DataFrame({
  7. &#39;trip-id&#39;: [8,8,8,8,8,8,8,8,4,4,4,4,4,4,4,4,4,4,4,4],
  8. &#39;segment-id&#39;: [1,1,1,1,1,1,1,1,0,0,0,0,0,0,5,5,5,5,5,5],
  9. &#39;true_label&#39;: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
  10. &#39;prediction&#39;: [3, 3, 3, 1, 2, 4, 0, 0, 3, 3, 3, 0, 1, 2, 3, 3, 1, 1, 2, 2]})
  11. df
  12. trip-id segment-id true_label prediction
  13. 0 8 1 3 3
  14. 1 8 1 3 3
  15. 2 8 1 3 3
  16. 3 8 1 3 1
  17. 4 8 1 3 2
  18. 5 8 1 3 4
  19. 6 8 1 3 0
  20. 7 8 1 3 0
  21. 8 4 0 3 3
  22. 9 4 0 3 3
  23. 10 4 0 3 3
  24. 11 4 0 3 0
  25. 12 4 0 3 1
  26. 13 4 0 3 2
  27. 14 4 5 3 3
  28. 15 4 5 3 3
  29. 16 4 5 3 1
  30. 17 4 5 3 1
  31. 18 4 5 3 2
  32. 19 4 5 3 2

In the given sample are predictions and true label for the instances [0,1,..4] of trips' segments.

I would like to generate a summary of segment's predictions based on simple majority.

  • to consider as segment's predicted value, the value of that predicted instance [0,1,..4] of the segment having simple majority.
  • where there's a tie for the majority predicted instances, the value matching the true_label is considered the segment's prediction.
  • if there's a tie of majority, and none of the instances in the tie matches the true_label, then from those in the tie, the instance coming first in the df is regarded the segment's predicted value.

Currently I can do this:

  1. segments_summary = (
  2. df[&#39;true_label&#39;].eq(df[&#39;prediction&#39;])
  3. .groupby([df[&#39;true_label&#39;],df[&#39;trip-id&#39;], df[&#39;segment-id&#39;]]).mean()
  4. .ge(0.5)
  5. .groupby(level=&#39;true_label&#39;).agg([&#39;size&#39;,&#39;sum&#39;])
  6. .rename(columns={&#39;size&#39;:&#39;total-segments&#39;,&#39;sum&#39;:&#39;correctly-predicted&#39;})\
  7. .assign(recall = lambda x: round(x[&#39;correctly-predicted&#39;]/x[&#39;total-segments&#39;], 2))
  8. .reindex(range(5), fill_value=&#39;-&#39;)
  9. .reset_index())

Which produces:

  1. segments_summary
  2. true_label total-segments correctly-predicted recall
  3. 0 0 - - -
  4. 1 1 - - -
  5. 2 2 - - -
  6. 3 3 3 1 0.33
  7. 4 4 - - -

But this is not exactly what I wanted. Going by the conditions I above, all the 3 segments should have been predicted correctly.

  • trip 8, segment 1: 3 has the simple majority, so that segment should considered as predicted 3
  • trip 4, segment 0: 3 has simple majority, that segment is predicted as 3.
  • trip 4, segment 5: is s tie, so the prediction matching true_label should be the segment's prediction -> 3.

Expected result:

  1. true_label total-segments correctly-predicted recall
  2. 0 0 - - -
  3. 1 1 - - -
  4. 2 2 - - -
  5. 3 3 3 3 1.0
  6. 4 4 - - -

答案1

得分: 2

以下是您提供的代码的中文翻译结果:

  1. 我会使用

out = (df

获取顶部预测

.value_counts(sort=False).reset_index(name='count')
.assign(flag=lambda d: d['true_label'].eq(d['prediction']))
.sort_values(by=['trip-id', 'segment-id', 'count', 'flag'],
ascending=[True, True, False, False],
kind='stable'
)
.groupby(['trip-id', 'segment-id']).first()

检查是否正确预测

.assign(**{'correctly-predicted': lambda d: d['true_label'].eq(d['prediction'])})

按预测聚合

.groupby('prediction')
.agg({'total-segments': ('prediction', 'count'),
'correctly-predicted': ('correctly-predicted', 'sum')
})
.assign(
{'recall': lambda d: d['correctly-predicted'].div(d['total-segments'])})
.reindex(range(5), fill_value='-')
.reset_index()
)

  1. 输出:

prediction total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 3 1.0
4 4 - - -

希望这对您有所帮助。如果您有任何其他翻译需求,请随时告诉我。

英文:

I would use:

  1. out = (df
  2. # get the top prediction
  3. .value_counts(sort=False).reset_index(name=&#39;count&#39;)
  4. .assign(flag=lambda d: d[&#39;true_label&#39;].eq(d[&#39;prediction&#39;]))
  5. .sort_values(by=[&#39;trip-id&#39;, &#39;segment-id&#39;, &#39;count&#39;, &#39;flag&#39;],
  6. ascending=[True, True, False, False],
  7. kind=&#39;stable&#39;
  8. )
  9. .groupby([&#39;trip-id&#39;, &#39;segment-id&#39;]).first()
  10. # check if correctly predicted
  11. .assign(**{&#39;correctly-predicted&#39;: lambda d: d[&#39;true_label&#39;].eq(d[&#39;prediction&#39;])})
  12. # aggregate per prediction
  13. .groupby(&#39;prediction&#39;)
  14. .agg(**{&#39;total-segments&#39;: (&#39;prediction&#39;, &#39;count&#39;),
  15. &#39;correctly-predicted&#39;: (&#39;correctly-predicted&#39;, &#39;sum&#39;)
  16. })
  17. .assign(**{&#39;recall&#39;: lambda d: d[&#39;correctly-predicted&#39;].div(d[&#39;total-segments&#39;])})
  18. .reindex(range(5), fill_value=&#39;-&#39;)
  19. .reset_index()
  20. )

Output:

  1. prediction total-segments correctly-predicted recall
  2. 0 0 - - -
  3. 1 1 - - -
  4. 2 2 - - -
  5. 3 3 3 3 1.0
  6. 4 4 - - -

huangapple
  • 本文由 发表于 2023年5月24日 22:52:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76324830.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定