Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

huangapple go评论86阅读模式
英文:

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

问题

我有一个看起来像这样的pandas数据框架:

  1. id age weight group
  2. 1 12 45 [10-20]
  3. 1 18 110 [10-20]
  4. 1 25 25 [20-30]
  5. 1 29 85 [20-30]
  6. 1 32 49 [30-40]
  7. 1 31 70 [30-40]
  8. 1 37 39 [30-40]

我正在寻找一个看起来像这样的数据框架:(sd=标准差)

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
  2. [10-20]
  3. [20-30]
  4. [30-40]
英文:

I have a pandas data frame that looks like this:

  1. id age weight group
  2. 1 12 45 [10-20]
  3. 1 18 110 [10-20]
  4. 1 25 25 [20-30]
  5. 1 29 85 [20-30]
  6. 1 32 49 [30-40]
  7. 1 31 70 [30-40]
  8. 1 37 39 [30-40]

I am looking for a data frame that would look like this: (sd=standard deviation)

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
  2. [10-20]
  3. [20-30]
  4. [30-40]

Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.

答案1

得分: 1

以下是代码的翻译部分,如您所请求,不包含其他内容:

  1. 这是一种方法来做这件事
  2. ```python
  3. res = df.group.to_frame().groupby('group').count()
  4. for group in res.index:
  5. mask = df.group == group
  6. srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
  7. res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
  8. srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
  9. res = res.reset_index()

输出:

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
  2. 0 [10-20] 77.500000 45.961941 53.60 24.016661
  3. 1 [20-30] 55.000000 42.426407 62.60 28.953411
  4. 2 [30-40] 52.666667 15.821926 66.25 38.378596

另一种获得相同结果的方法是:

  1. res = ( pd.DataFrame(
  2. df.group.drop_duplicates().to_frame()
  3. .apply(lambda x: [
  4. df.loc[df.group==x.group,'weight'].mean(),
  5. df.loc[df.group==x.group,'weight'].std(),
  6. df.loc[df.group!=x.group,'weight'].mean(),
  7. df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
  8. .to_numpy(),
  9. index=list(df.group.drop_duplicates()),
  10. columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
  11. .reset_index().rename(columns={'index':'group'}) )

输出:

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
  2. 0 [10-20] 77.500000 45.961941 53.60 24.016661
  3. 1 [20-30] 55.000000 42.426407 62.60 28.953411
  4. 2 [30-40] 52.666667 15.821926 66.25 38.378596

更新:
原帖中问到: "如果我有多个权重列怎么办?如果我有大约10个不同的权重列,我想要所有权重列的标准差怎么办?"

为了说明,我创建了两个权重列(weightweight2),并为每个权重列提供了所有4个聚合值(均值、标准差、其他列的均值和其他列的标准差)。

  1. wgtCols = ['weight','weight2']
  2. res = ( pd.concat([ pd.DataFrame(
  3. df.group.drop_duplicates().to_frame()
  4. .apply(lambda x: [
  5. df.loc[df.group==x.group,wgtCol].mean(),
  6. df.loc[df.group==x.group,wgtCol].std(),
  7. df.loc[df.group!=x.group,wgtCol].mean(),
  8. df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
  9. .to_numpy(),
  10. index=list(df.group.drop_duplicates()),
  11. columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
  12. for wgtCol in wgtCols], axis=1)
  13. .reset_index().rename(columns={'index':'group'}) )

输入:

  1. id age weight weight2 group
  2. 0 1 12 45 55 [10-20]
  3. 1 1 18 110 120 [10-20]
  4. 2 1 25 25 35 [20-30]
  5. 3 1 29 85 95 [20-30]
  6. 4 1 32 49 59 [30-40]
  7. 5 1 31 70 80 [30-40]
  8. 6 1 37 39 49 [30-40]

输出:

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight group_mean_weight2 group_sd_weight2 rest_mean_weight2 rest_sd_weight2
  2. 0 [10-20] 77.500000 45.961941 53.60 24.016661 87.500000 45.961941 63.60 24.016661
  3. 1 [20-30] 55.000000 42.426407 62.60 28.953411 65.000000 42.426407 72.60 28.953411
  4. 2 [30-40] 52.666667 15.821926 66.25 38.378596 62.666667 15.821926 76.25 38.378596
英文:

Here's a way to do it:

  1. res = df.group.to_frame().groupby('group').count()
  2. for group in res.index:
  3. mask = df.group==group
  4. srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
  5. res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
  6. srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
  7. res = res.reset_index()

Output:

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
  2. 0 [10-20] 77.500000 45.961941 53.60 24.016661
  3. 1 [20-30] 55.000000 42.426407 62.60 28.953411
  4. 2 [30-40] 52.666667 15.821926 66.25 38.378596

An alternative way to get the same result is:

  1. res = ( pd.DataFrame(
  2. df.group.drop_duplicates().to_frame()
  3. .apply(lambda x: [
  4. df.loc[df.group==x.group,'weight'].mean(),
  5. df.loc[df.group==x.group,'weight'].std(),
  6. df.loc[df.group!=x.group,'weight'].mean(),
  7. df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
  8. .to_numpy(),
  9. index=list(df.group.drop_duplicates()),
  10. columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
  11. .reset_index().rename(columns={'index':'group'}) )

Output:

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
  2. 0 [10-20] 77.500000 45.961941 53.60 24.016661
  3. 1 [20-30] 55.000000 42.426407 62.60 28.953411
  4. 2 [30-40] 52.666667 15.821926 66.25 38.378596

UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"

To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.

  1. wgtCols = ['weight','weight2']
  2. res = ( pd.concat([ pd.DataFrame(
  3. df.group.drop_duplicates().to_frame()
  4. .apply(lambda x: [
  5. df.loc[df.group==x.group,wgtCol].mean(),
  6. df.loc[df.group==x.group,wgtCol].std(),
  7. df.loc[df.group!=x.group,wgtCol].mean(),
  8. df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
  9. .to_numpy(),
  10. index=list(df.group.drop_duplicates()),
  11. columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
  12. for wgtCol in wgtCols], axis=1)
  13. .reset_index().rename(columns={'index':'group'}) )

Input:

  1. id age weight weight2 group
  2. 0 1 12 45 55 [10-20]
  3. 1 1 18 110 120 [10-20]
  4. 2 1 25 25 35 [20-30]
  5. 3 1 29 85 95 [20-30]
  6. 4 1 32 49 59 [30-40]
  7. 5 1 31 70 80 [30-40]
  8. 6 1 37 39 49 [30-40]

Output:

  1. group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight group_mean_weight2 group_sd_weight2 rest_mean_weight2 rest_sd_weight2
  2. 0 [10-20] 77.500000 45.961941 53.60 24.016661 87.500000 45.961941 63.60 24.016661
  3. 1 [20-30] 55.000000 42.426407 62.60 28.953411 65.000000 42.426407 72.60 28.953411
  4. 2 [30-40] 52.666667 15.821926 66.25 38.378596 62.666667 15.821926 76.25 38.378596

huangapple
  • 本文由 发表于 2023年2月14日 07:44:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75442206.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定