2023年2月14日 07:44:48go评论86阅读模式

英文:

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

问题

我有一个看起来像这样的pandas数据框架：

id  age  weight  group
1    12    45    [10-20]
1    18    110   [10-20]
1    25    25    [20-30]
1    29    85    [20-30]
1    32    49    [30-40]
1    31    70    [30-40]
1    37    39    [30-40]

我正在寻找一个看起来像这样的数据框架：（sd=标准差）

  group   group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
 [10-20]                       
 [20-30] 
 [30-40]

英文:

I have a pandas data frame that looks like this:

id  age  weight  group
1    12    45    [10-20]
1    18    110   [10-20]
1    25    25    [20-30]
1    29    85    [20-30]
1    32    49    [30-40]
1    31    70    [30-40]
1    37    39    [30-40]

I am looking for a data frame that would look like this: (sd=standard deviation)

  group   group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
 [10-20]                       
 [20-30] 
 [30-40]

Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.

答案1

得分: 1

以下是代码的翻译部分，如您所请求，不包含其他内容：

这是一种方法来做这件事：
```python
res = df.group.to_frame().groupby('group').count()
for group in res.index:
    mask = df.group == group
    srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
    res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
        srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()

输出：

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

另一种获得相同结果的方法是：

res = ( pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,'weight'].mean(), 
            df.loc[df.group==x.group,'weight'].std(), 
            df.loc[df.group!=x.group,'weight'].mean(), 
            df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
    .reset_index().rename(columns={'index':'group'}) )

输出：

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

更新:
原帖中问到: "如果我有多个权重列怎么办？如果我有大约10个不同的权重列，我想要所有权重列的标准差怎么办？"

为了说明，我创建了两个权重列（weight 和 weight2），并为每个权重列提供了所有4个聚合值（均值、标准差、其他列的均值和其他列的标准差）。

wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,wgtCol].mean(), 
            df.loc[df.group==x.group,wgtCol].std(), 
            df.loc[df.group!=x.group,wgtCol].mean(), 
            df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
    for wgtCol in wgtCols], axis=1)
    .reset_index().rename(columns={'index':'group'}) )

输入:

   id  age  weight  weight2    group
0   1   12      45       55  [10-20]
1   1   18     110      120  [10-20]
2   1   25      25       35  [20-30]
3   1   29      85       95  [20-30]
4   1   32      49       59  [30-40]
5   1   31      70       80  [30-40]
6   1   37      39       49  [30-40]

输出:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight  group_mean_weight2  group_sd_weight2  rest_mean_weight2  rest_sd_weight2
0  [10-20]          77.500000        45.961941             53.60       24.016661           87.500000         45.961941              63.60        24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411           65.000000         42.426407              72.60        28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596           62.666667         15.821926              76.25        38.378596

英文:

Here's a way to do it:

res = df.group.to_frame().groupby(&#39;group&#39;).count()
for group in res.index:
    mask = df.group==group
    srGroup, srOther = df.loc[mask, &#39;weight&#39;], df.loc[~mask, &#39;weight&#39;]
    res.loc[group, [&#39;group_mean_weight&#39;,&#39;group_sd_weight&#39;,&#39;rest_mean_weight&#39;,&#39;rest_sd_weight&#39;]] = [
        srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()

Output:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

An alternative way to get the same result is:

res = ( pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,&#39;weight&#39;].mean(), 
            df.loc[df.group==x.group,&#39;weight&#39;].std(), 
            df.loc[df.group!=x.group,&#39;weight&#39;].mean(), 
            df.loc[df.group!=x.group,&#39;weight&#39;].std()], axis=1, result_type=&#39;expand&#39;)
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=[&#39;group_mean_weight&#39;,&#39;group_sd_weight&#39;,&#39;rest_mean_weight&#39;,&#39;rest_sd_weight&#39;])
    .reset_index().rename(columns={&#39;index&#39;:&#39;group&#39;}) )

Output:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
0  [10-20]          77.500000        45.961941             53.60       24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596

UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"

To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.

wgtCols = [&#39;weight&#39;,&#39;weight2&#39;]
res = ( pd.concat([ pd.DataFrame(
    df.group.drop_duplicates().to_frame()
        .apply(lambda x: [
            df.loc[df.group==x.group,wgtCol].mean(), 
            df.loc[df.group==x.group,wgtCol].std(), 
            df.loc[df.group!=x.group,wgtCol].mean(), 
            df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type=&#39;expand&#39;)
        .to_numpy(),
    index=list(df.group.drop_duplicates()),
    columns=[f&#39;group_mean_{wgtCol}&#39;,f&#39;group_sd_{wgtCol}&#39;,f&#39;rest_mean_{wgtCol}&#39;,f&#39;rest_sd_{wgtCol}&#39;])
    for wgtCol in wgtCols], axis=1)
    .reset_index().rename(columns={&#39;index&#39;:&#39;group&#39;}) )

Input:

   id  age  weight  weight2    group
0   1   12      45       55  [10-20]
1   1   18     110      120  [10-20]
2   1   25      25       35  [20-30]
3   1   29      85       95  [20-30]
4   1   32      49       59  [30-40]
5   1   31      70       80  [30-40]
6   1   37      39       49  [30-40]

Output:

     group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight  group_mean_weight2  group_sd_weight2  rest_mean_weight2  rest_sd_weight2
0  [10-20]          77.500000        45.961941             53.60       24.016661           87.500000         45.961941              63.60        24.016661
1  [20-30]          55.000000        42.426407             62.60       28.953411           65.000000         42.426407              72.60        28.953411
2  [30-40]          52.666667        15.821926             66.25       38.378596           62.666667         15.821926              76.25        38.378596

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

问题

答案1

For each group in Pandas dataframe, return the most common value if it shows up more than `x%` of the time

创建一个简单的时间线图。

在WSL上运行VSCode交互窗口，使用相对导入

AttributeError: ‘Series’ 对象没有 ‘iterrows’ 属性 – Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论