如何在DataFrame.describe中考虑权重?

huangapple go评论69阅读模式
英文:

How could get weights considered in DataFrame.describe?

问题

我有一个包含学生成绩和成绩人口的样本:

# 创建DataFrame
sample = pd.DataFrame(
{'score':[595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583,582, 581, 580, 579, 578, 577, 576], 
'population':[ 705,  745,  716,  742,  722,  746,  796,  750,  816,  809,  815,821,  820,  865,  876,  886,  947,  949, 1018,  967]})

然后,我计算了加权平均分数:

np.average(sample['score'], weights=sample['population'])
# 584.9062443219672

然而,当我运行sample.describe()时,它没有考虑权重:

sample.describe()

           score   population
count   20.00000    20.000000
mean   585.50000   825.550000
std      5.91608    91.465539
min    576.00000   705.000000
25%    580.75000   745.750000
50%    585.50000   815.500000
75%    590.25000   878.500000
max    595.00000  1018.000000

如何在sample.describe()中考虑权重?

英文:

I have such a sample with student's score and population of the score:

# Create the DataFrame
sample = pd.DataFrame(
{'score':[595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584, 583,582, 581, 580, 579, 578, 577, 576], 
'population':[ 705,  745,  716,  742,  722,  746,  796,  750,  816,  809,  815,821,  820,  865,  876,  886,  947,  949, 1018,  967]})

The I calculate it's weigthed average of scores:

np.average(sample['score'], weights=sample['population'])
# 584.9062443219672

However, when I run sample.describe(), it not get weights considered:

sample.describe()

           score   population
count   20.00000    20.000000
mean   585.50000   825.550000
std      5.91608    91.465539
min    576.00000   705.000000
25%    580.75000   745.750000
50%    585.50000   815.500000
75%    590.25000   878.500000
max    595.00000  1018.000000

How could get weights included in sample.describe()?

答案1

得分: 1

你需要自定义函数,因为输出是标量,所有列中的值都相同:

def describe(df, stats):
    d = df.describe()
    d.loc[stats] = np.average(df['score'], weights=df['population'])
    return d

out = describe(sample, 'wa')
print(out)
               score  population
count     20.000000   20.000000
mean     585.500000  825.550000
std        5.916080   91.465539
min      576.000000  705.000000
25%      580.750000  745.750000
50%      585.500000  815.500000
75%      590.250000  878.500000
max      595.000000 1018.000000
wa       584.906244  584.906244
英文:

You need custom function, because ouput is scalar get same values in all columns:

def describe(df, stats):
    d = df.describe()
    d.loc[stats] = np.average(df['score'], weights=df['population'])
    return d

out = describe(sample, 'wa')
print (out)
            score   population
count   20.000000    20.000000
mean   585.500000   825.550000
std      5.916080    91.465539
min    576.000000   705.000000
25%    580.750000   745.750000
50%    585.500000   815.500000
75%    590.250000   878.500000
max    595.000000  1018.000000
wa     584.906244   584.906244

huangapple
  • 本文由 发表于 2023年7月13日 18:16:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76678275.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定