Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

huangapple go评论65阅读模式
英文:

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

问题

我有一个看起来像这样的pandas数据框架:

id  age  weight  group
1    12    45    [10-20]
1    18    110   [10-20]
1    25    25    [20-30]
1    29    85    [20-30]
1    32    49    [30-40]
1    31    70    [30-40]
1    37    39    [30-40] 

我正在寻找一个看起来像这样的数据框架:(sd=标准差)

  group   group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
 [10-20]                       
 [20-30] 
 [30-40]
英文:

I have a pandas data frame that looks like this:

id  age  weight  group
1    12    45    [10-20]
1    18    110   [10-20]
1    25    25    [20-30]
1    29    85    [20-30]
1    32    49    [30-40]
1    31    70    [30-40]
1    37    39    [30-40] 

I am looking for a data frame that would look like this: (sd=standard deviation)

  group   group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
 [10-20]                       
 [20-30] 
 [30-40] 

Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.

答案1

得分: 1

以下是代码的翻译部分,如您所请求,不包含其他内容:

这是一种方法来做这件事
```python
res = df.group.to_frame().groupby('group').count()
for group in res.index:
    mask = df.group == group
    srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
    res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
        srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()

输出:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

另一种获得相同结果的方法是:

res = ( pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,'weight'].mean(), 
            df.loc[df.group==x.group,'weight'].std(), 
            df.loc[df.group!=x.group,'weight'].mean(), 
            df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
    .reset_index().rename(columns={'index':'group'}) )

输出:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

更新:
原帖中问到: "如果我有多个权重列怎么办?如果我有大约10个不同的权重列,我想要所有权重列的标准差怎么办?"

为了说明,我创建了两个权重列(weightweight2),并为每个权重列提供了所有4个聚合值(均值、标准差、其他列的均值和其他列的标准差)。

wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,wgtCol].mean(), 
            df.loc[df.group==x.group,wgtCol].std(), 
            df.loc[df.group!=x.group,wgtCol].mean(), 
            df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
    for wgtCol in wgtCols], axis=1)
    .reset_index().rename(columns={'index':'group'}) )

输入:

   id  age  weight  weight2    group
0   1   12      45       55  [10-20]
1   1   18     110      120  [10-20]
2   1   25      25       35  [20-30]
3   1   29      85       95  [20-30]
4   1   32      49       59  [30-40]
5   1   31      70       80  [30-40]
6   1   37      39       49  [30-40]

输出:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight  group_mean_weight2  group_sd_weight2  rest_mean_weight2  rest_sd_weight2
0  [10-20]          77.500000        45.961941             53.60       24.016661           87.500000         45.961941              63.60        24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411           65.000000         42.426407              72.60        28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596           62.666667         15.821926              76.25        38.378596
英文:

Here's a way to do it:

res = df.group.to_frame().groupby('group').count()
for group in res.index:
    mask = df.group==group
    srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
    res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
        srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()

Output:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

An alternative way to get the same result is:

res = ( pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,'weight'].mean(), 
            df.loc[df.group==x.group,'weight'].std(), 
            df.loc[df.group!=x.group,'weight'].mean(), 
            df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
    .reset_index().rename(columns={'index':'group'}) )

Output:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"

To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.

wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,wgtCol].mean(), 
            df.loc[df.group==x.group,wgtCol].std(), 
            df.loc[df.group!=x.group,wgtCol].mean(), 
            df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
    for wgtCol in wgtCols], axis=1)
    .reset_index().rename(columns={'index':'group'}) )

Input:

   id  age  weight  weight2    group
0   1   12      45       55  [10-20]
1   1   18     110      120  [10-20]
2   1   25      25       35  [20-30]
3   1   29      85       95  [20-30]
4   1   32      49       59  [30-40]
5   1   31      70       80  [30-40]
6   1   37      39       49  [30-40]

Output:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight  group_mean_weight2  group_sd_weight2  rest_mean_weight2  rest_sd_weight2
0  [10-20]          77.500000        45.961941             53.60       24.016661           87.500000         45.961941              63.60        24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411           65.000000         42.426407              72.60        28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596           62.666667         15.821926              76.25        38.378596

huangapple
  • 本文由 发表于 2023年2月14日 07:44:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75442206.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定