英文:
dataframe column's aggregate based on simple majority
问题
以下是您要求的代码部分的翻译:
我有一个来自我的模型预测的`dataframe`,类似于下面的示例:
```python
df = pd.DataFrame({
'trip-id': [8,8,8,8,8,8,8,8,4,4,4,4,4,4,4,4,4,4,4,4],
'segment-id': [1,1,1,1,1,1,1,1,0,0,0,0,0,0,5,5,5,5,5,5],
'true_label': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
'prediction': [3, 3, 3, 1, 2, 4, 0, 0, 3, 3, 3, 0, 1, 2, 3, 3, 1, 1, 2, 2]})
df
trip-id segment-id true_label prediction
0 8 1 3 3
1 8 1 3 3
2 8 1 3 3
3 8 1 3 1
4 8 1 3 2
5 8 1 3 4
6 8 1 3 0
7 8 1 3 0
8 4 0 3 3
9 4 0 3 3
10 4 0 3 3
11 4 0 3 0
12 4 0 3 1
13 4 0 3 2
14 4 5 3 3
15 4 5 3 3
16 4 5 3 1
17 4 5 3 1
18 4 5 3 2
19 4 5 3 2
在给定的示例中,是对旅行的段落的预测和真实标签进行的,其中包括[0,1,..4]
的实例。
我想基于简单的多数生成段的预测摘要。
- 将段的预测值视为具有简单多数的预测实例
[0,1,..4]
的值。 - 如果有多数预测实例的平局,将考虑匹配
true_label
的值作为段的预测。 - 如果存在多数的平局,并且没有实例与
true_label
匹配,则从平局中首先出现在df
中的实例将被视为段的预测值。
目前我可以这样做:
segments_summary = (
df['true_label'].eq(df['prediction'])
.groupby([df['true_label'],df['trip-id'], df['segment-id']]).mean()
.ge(0.5)
.groupby(level='true_label').agg(['size','sum'])
.rename(columns={'size':'total-segments','sum':'correctly-predicted'})\
.assign(recall = lambda x: round(x['correctly-predicted']/x['total-segments'], 2))
.reindex(range(5), fill_value='-')
.reset_index())
它产生了以下结果:
segments_summary
true_label total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 1 0.33
4 4 - - -
但这不是我想要的。根据我上面的条件,所有3个段应该被正确预测。
trip 8, segment 1
:3
具有简单多数,因此该段应该被预测为3
。trip 4, segment 0
:3
具有简单多数,该段被预测为3
。trip 4, segment 5
:存在平局,因此匹配true_label
的预测应该是段的预测->3
。
预期结果:
true_label total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 3 1.0
4 4 - - -
希望这个翻译对您有帮助。如果您有其他问题,请随时提出。
<details>
<summary>英文:</summary>
I have a `dataframe` from my model's prediction similar to the one below:
```python
df = pd.DataFrame({
'trip-id': [8,8,8,8,8,8,8,8,4,4,4,4,4,4,4,4,4,4,4,4],
'segment-id': [1,1,1,1,1,1,1,1,0,0,0,0,0,0,5,5,5,5,5,5],
'true_label': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
'prediction': [3, 3, 3, 1, 2, 4, 0, 0, 3, 3, 3, 0, 1, 2, 3, 3, 1, 1, 2, 2]})
df
trip-id segment-id true_label prediction
0 8 1 3 3
1 8 1 3 3
2 8 1 3 3
3 8 1 3 1
4 8 1 3 2
5 8 1 3 4
6 8 1 3 0
7 8 1 3 0
8 4 0 3 3
9 4 0 3 3
10 4 0 3 3
11 4 0 3 0
12 4 0 3 1
13 4 0 3 2
14 4 5 3 3
15 4 5 3 3
16 4 5 3 1
17 4 5 3 1
18 4 5 3 2
19 4 5 3 2
In the given sample are predictions and true label for the instances [0,1,..4]
of trips' segments.
I would like to generate a summary of segment's predictions based on simple majority.
- to consider as segment's predicted value, the value of that predicted instance
[0,1,..4]
of the segment having simple majority. - where there's a tie for the majority predicted instances, the value matching the
true_label
is considered the segment's prediction. - if there's a tie of majority, and none of the instances in the tie matches the
true_label
, then from those in the tie, the instance coming first in thedf
is regarded the segment's predicted value.
Currently I can do this:
segments_summary = (
df['true_label'].eq(df['prediction'])
.groupby([df['true_label'],df['trip-id'], df['segment-id']]).mean()
.ge(0.5)
.groupby(level='true_label').agg(['size','sum'])
.rename(columns={'size':'total-segments','sum':'correctly-predicted'})\
.assign(recall = lambda x: round(x['correctly-predicted']/x['total-segments'], 2))
.reindex(range(5), fill_value='-')
.reset_index())
Which produces:
segments_summary
true_label total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 1 0.33
4 4 - - -
But this is not exactly what I wanted. Going by the conditions I above, all the 3 segments should have been predicted correctly.
trip 8, segment 1
:3
has the simple majority, so that segment should considered as predicted3
trip 4, segment 0
:3
has simple majority, that segment is predicted as3
.trip 4, segment 5
: is s tie, so the prediction matchingtrue_label
should be the segment's prediction ->3
.
Expected result:
true_label total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 3 1.0
4 4 - - -
答案1
得分: 2
以下是您提供的代码的中文翻译结果:
我会使用:
out = (df
获取顶部预测
.value_counts(sort=False).reset_index(name='count')
.assign(flag=lambda d: d['true_label'].eq(d['prediction']))
.sort_values(by=['trip-id', 'segment-id', 'count', 'flag'],
ascending=[True, True, False, False],
kind='stable'
)
.groupby(['trip-id', 'segment-id']).first()
检查是否正确预测
.assign(**{'correctly-predicted': lambda d: d['true_label'].eq(d['prediction'])})
按预测聚合
.groupby('prediction')
.agg({'total-segments': ('prediction', 'count'),
'correctly-predicted': ('correctly-predicted', 'sum')
})
.assign({'recall': lambda d: d['correctly-predicted'].div(d['total-segments'])})
.reindex(range(5), fill_value='-')
.reset_index()
)
输出:
prediction total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 3 1.0
4 4 - - -
希望这对您有所帮助。如果您有任何其他翻译需求,请随时告诉我。
英文:
I would use:
out = (df
# get the top prediction
.value_counts(sort=False).reset_index(name='count')
.assign(flag=lambda d: d['true_label'].eq(d['prediction']))
.sort_values(by=['trip-id', 'segment-id', 'count', 'flag'],
ascending=[True, True, False, False],
kind='stable'
)
.groupby(['trip-id', 'segment-id']).first()
# check if correctly predicted
.assign(**{'correctly-predicted': lambda d: d['true_label'].eq(d['prediction'])})
# aggregate per prediction
.groupby('prediction')
.agg(**{'total-segments': ('prediction', 'count'),
'correctly-predicted': ('correctly-predicted', 'sum')
})
.assign(**{'recall': lambda d: d['correctly-predicted'].div(d['total-segments'])})
.reindex(range(5), fill_value='-')
.reset_index()
)
Output:
prediction total-segments correctly-predicted recall
0 0 - - -
1 1 - - -
2 2 - - -
3 3 3 3 1.0
4 4 - - -
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论