2023年2月18日 04:42:09go评论90阅读模式

英文:

Creating sub columns in Pandas Dataframes for Summary Statistics

问题

我正在处理地表水和地下水的水质数据。我想要创建一个汇总统计表，包括所有三个参数（pH、温度、盐度），并按样本采集地点（地表水与地下水）分组，如下所示：

           |           '地表水'    |       '地下水'          |
            ___________________________________________________________________________
           |  最小值  | 最大值  |  平均值  | 标准差  | 最小值  |  最大值  |  平均值  | 标准差
'pH'

我设置Excel表格以收集数据，包括以下列：日期，监测ID（地表水或地下水），pH，温度和盐度。

如何使用Python来做到这一点？我熟悉groupby和describe()函数，但我不知道如何以我想要的方式组织它。任何帮助将不胜感激！

我尝试使用groupby函数来进行每个描述性统计，例如：

平均值 = df.\
    groupby('监测ID')\
    [['pH', 'SAL (ppt)', '温度 (°C)', 'DO (mg/L)']].mean()
最小值 = df.\
    groupby('监测ID')\
    [['pH', 'SAL (ppt)', '温度 (°C)', 'DO (mg/L)']].min()

等等... 但我不知道如何将它们都整合到一个漂亮的表格中。

英文:

I am working with water quality data for both surface water locations and groundwater well locations. I would like to create a summary statistics table for all three of my parameters (pH, Temp, salinity) grouped by the location the samples were taken from (surface water vs. Groundwater) as shown below:

> | 'Surface Water' | 'Groundwater' |
> ___________________________________________________________________________
> | min | max | mean | std | min | max | mean | std
> 'pH'

The way I set up my Excel Sheet for data collection includes the following columns: Date, Monitoring ID (Either Surface Water or Groundwater), pH, Temp, and Salinity.

How can i tell python to do this? I am familiar with the groupby and describe() function but I don't know how to style organize it the way that I want. Any help would be appreciated!

I have tried using the groupby function for each descriptive stat for example:


mean = df.\
    groupby(&#39;Monitoring ID&#39;)\
    [[&#39;pH&#39;, &#39;SAL (ppt)&#39;, &#39;Temperature (&#176;C)&#39;, &#39;DO (mg/L)&#39;]].mean()
min = df.\
    groupby(&#39;Monitoring ID&#39;)\
    [[&#39;pH&#39;, &#39;SAL (ppt)&#39;, &#39;Temperature (&#176;C)&#39;, &#39;DO (mg/L)&#39;]].min()

etc.... but I don't know how to incorporate it all into one nice table

答案1

得分: 1

你可以按照你提议的使用groupby_describe然后stack_transpose：

metrics = ['count', 'mean', 'std', 'min', 'max']
out = df.groupby('Monitoring ID').describe().stack().T.loc[:, (slice(None), metrics)]

&gt;&gt;&gt; out
Monitoring ID    地下水                                   地表水                                  
                       计数       平均       标准差   最小    最大         计数       平均       标准差   最小    最大
pH                     159.0   6.979182  0.587316  6.00   7.98         141.0   6.991135  0.564097  6.00   7.99
SAL (ppt)              159.0   1.976226  0.577557  1.02   2.99         141.0   1.917589  0.576650  1.01   2.99
Temperature (&#176;C)       159.0  13.466101  4.805317  4.13  21.78         141.0  13.099645  4.989240  4.03  21.61
DO (mg/L)              159.0   1.984277  0.609071  1.00   2.99         141.0   1.939433  0.577651  1.00   2.96

英文:

You can use groupby_describe as you suggest then stack_transpose:

metrics = [&#39;count&#39;, &#39;mean&#39;, &#39;std&#39;, &#39;min&#39;, &#39;max&#39;]
out = df.groupby(&#39;Monitoring ID&#39;).describe().stack().T.loc[:, (slice(None), metrics)]

&gt;&gt;&gt; out
Monitoring ID    Groundwater                                   Surface Water                                  
                       count       mean       std   min    max         count       mean       std   min    max
pH                     159.0   6.979182  0.587316  6.00   7.98         141.0   6.991135  0.564097  6.00   7.99
SAL (ppt)              159.0   1.976226  0.577557  1.02   2.99         141.0   1.917589  0.576650  1.01   2.99
Temperature (&#176;C)       159.0  13.466101  4.805317  4.13  21.78         141.0  13.099645  4.989240  4.03  21.61
DO (mg/L)              159.0   1.984277  0.609071  1.00   2.99         141.0   1.939433  0.577651  1.00   2.96

答案2

得分: 0

您可以使用agg和groupby来执行以下操作：

import pandas as pd
import numpy as np
# 示例数据
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03'],
        'Monitoring ID': ['Surface Water', 'Surface Water', 'Surface Water', 'Groundwater', 'Groundwater', 'Groundwater'],
        'pH': [7.1, 7.2, 7.5, 7.8, 7.6, 7.4],
        'Temp': [10, 12, 9, 15, 13, 14],
        'Salinity': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
df = pd.DataFrame(data)
# 按'Monitoring ID'分组并计算摘要统计信息
summary_stats = df.groupby('Monitoring ID').agg({'pH': ['min', 'max', 'mean', 'std'],
                                                'Temp': ['min', 'max', 'mean', 'std'],
                                                'Salinity': ['min', 'max', 'mean', 'std']})
# 通过重命名重新组织列
summary_stats.columns = ['_'.join(col).strip() for col in summary_stats.columns.values]
# 摘要表
print(summary_stats)

请原谅我，我仍在努力找出如何在这里演示代码的输出，但我希望这有所帮助。

英文:

You can use agg along with groupby:

import pandas as pd
import numpy as np
# Sample data
data = {&#39;Date&#39;: [&#39;2022-01-01&#39;, &#39;2022-01-02&#39;, &#39;2022-01-03&#39;, &#39;2022-01-01&#39;, &#39;2022-01-02&#39;, &#39;2022-01-03&#39;],
        &#39;Monitoring ID&#39;: [&#39;Surface Water&#39;, &#39;Surface Water&#39;, &#39;Surface Water&#39;, &#39;Groundwater&#39;, &#39;Groundwater&#39;, &#39;Groundwater&#39;],
        &#39;pH&#39;: [7.1, 7.2, 7.5, 7.8, 7.6, 7.4],
        &#39;Temp&#39;: [10, 12, 9, 15, 13, 14],
        &#39;Salinity&#39;: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
df = pd.DataFrame(data)
# Group by &#39;Monitoring ID&#39; and calculate summary statistics
summary_stats = df.groupby(&#39;Monitoring ID&#39;).agg({&#39;pH&#39;: [&#39;min&#39;, &#39;max&#39;, &#39;mean&#39;, &#39;std&#39;],
                                                &#39;Temp&#39;: [&#39;min&#39;, &#39;max&#39;, &#39;mean&#39;, &#39;std&#39;],
                                                &#39;Salinity&#39;: [&#39;min&#39;, &#39;max&#39;, &#39;mean&#39;, &#39;std&#39;]})
# Reorganise column by renaming
summary_stats.columns = [&#39;_&#39;.join(col).strip() for col in summary_stats.columns.values]
# Summary table
print(summary_stats)

Pardon me I'm still trying to figure how to demonstrate the output of the code here but I hope this helps.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas数据框中创建子列以进行汇总统计。

问题

答案1

答案2

连接到新窗口，当您不知道其名称时，应如何操作？

GitHub仓库中的Python项目包文件夹应包括什么以管理依赖关系？

如何加快这个距离矩阵的计算速度？

如何在GoLang中使用训练好的Scikit Learn Python模型？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。