2023年1月9日 17:42:35go评论68阅读模式

英文:

Condition to group rows in dataframes python into an interval

问题

成功提取数据到数据框中，但我不知道如何管理它以检测在相同范围内的值并计算它们的平均值。

这是我的数据的简化版本：

import pandas as pd

matrix = [[0,11,0],[0.99,25,1],[1.99,17,1],[0.1,6,0],[0.88,9,1]]

data=pd.DataFrame(matrix,columns=['distance in mm','Lithium','Calcium'])

然后，我想在另一个数据框中分组具有相同距离范围的行（+/- 0.2毫米），计算平均值和标准差。

我希望输出类似于这样的结果：

是否应该创建一个包含平均值和标准差值的字典？

在名为data的数据框中，您可以找到第一列的名称是'distance in mm'。我想测量每一行的间隔并通过计算平均值和标准差来合并具有相同间隔的行。我尝试过：data['distance_bins'] = pd.cut(data['distance in mm'], np.arange(0, data['distance in mm'].max() + 0.2, 0.2))

这个想法是根据样本上的相对位置而不是标签来分组数据（分析复制品）。

然后我尝试过：groups = data.groupby('distance_bins')。但在那之后我完全迷失了，因为我不熟悉由groupby函数创建的对象 - <class 'pandas.core.series.Series'>。

英文:

I succeeded to extract data into a dataframe, but I don't know how can I manage it to detect values which are in the same range and mean them.

Here is a simple version of my data:

import pandas as pd

matrix = [[0,11,0],[0.99,25,1],[1.99,17,1],[0.1,6,0],[0.88,9,1]	]

data=pd.DataFrame(matrix,columns=[&#39;distance in mm&#39;,&#39;Lithium&#39;,&#39;Calcium&#39;])

Then I want to group in another dataframe rows that have a common distance in mm of +/- 0.2 mm, calculate the average and standard deviation.

I wish an output similar to this:

Should I create a dictionary which will contain mean and std deviation values?

In the dataframe call data you can find that the first column is named 'distance in mm'. I would like to measure the interval of each single row and unify the row that share the same interval by calculating the mean and the standard deviation. I tried : data['distance_bins'] = pd.cut(data['distance in mm'], np.arange(0, data['distance in mm'].max() + 0.2, 0.2))

The idea consists to group data (analytical replicates) not using labelling but relative position on a sample.

Then I tried: groups = data.groupby('distance_bins'). But after that I'm completely lost, since I am not familiar with object which has been created by the function groupby – <class 'pandas.core.series.Series'>.

答案1

得分: 1

尝试这样做 -

根据第一列的最小值和最大值创建您的bins。
接下来，使用pd.cut基于这些bins切割您的列。在这种情况下，您包括左侧，但不包括右侧[left, right)
然后，在字典中定义您的聚合操作，并根据上面定义的groups对数据进行分组，同时将聚合操作传递给.agg()方法。
最后（可选），根据第一列中的NaN值筛选出行。您还可以为可读性重命名索引。

import pandas as pd

# 使用+/- 0.2创建bins
mini = data['distance in mm'].min()
maxi = data['distance in mm'].max()
bins = np.arange(mini, maxi+0.4, 0.4)

# 切割为不同的组
groups = pd.cut(data['distance in mm'], bins, right=False)

# 使用groupby进行聚合
aggr = {'distance in mm':['mean', 'std'], 
        'Lithium':['mean','std'], 
        'Calcium':['mean', 'std']}
grouped = data.groupby(groups).agg(aggr)

# 根据第一列中的NaN值筛选行（可选）
filtered = grouped[grouped[('distance in mm','mean')].notna()] #<-- 可选
filtered.index.name = 'bins'                                   #<-- 可选

print(filtered)

           distance in mm           Lithium            Calcium     
                     mean       std    mean        std    mean  std
bins                                                               
[0.0, 0.4)          0.050  0.070711     8.5   3.535534     0.0  0.0
[0.8, 1.2)          0.935  0.077782    17.0  11.313708     1.0  0.0
[1.6, 2.0)          1.990       NaN    17.0        NaN     1.0  NaN

请注意，这是您提供的代码的中文翻译部分。如果您需要进一步的帮助，请随时提出。

英文:

Try this -

Create your bins based on the minimum and maximum values in the first column
Next, your pd.cut to cut your column based on the bins. In this case, you are including left and excluding right [left, right)
Then, define your aggregations in a dictionary and groupby your data based on the groups defined above, while passing the aggregations to .agg() method.
Finally, (Optionally), filter out the rows with nan values based on first column. You can also rename the index for readibility.

import pandas as pd

# Create bins with +/- 0.2
mini = data[&#39;distance in mm&#39;].min()
maxi = data[&#39;distance in mm&#39;].max()
bins = np.arange(mini, maxi+0.4, 0.4)

# Cut into groups
groups = pd.cut(data[&#39;distance in mm&#39;], bins, right=False)

# Aggregate using groupby
aggr = {&#39;distance in mm&#39;:[&#39;mean&#39;, &#39;std&#39;], 
        &#39;Lithium&#39;:[&#39;mean&#39;,&#39;std&#39;], 
        &#39;Calcium&#39;:[&#39;mean&#39;, &#39;std&#39;]}
grouped = data.groupby(groups).agg(aggr)

# Filter out rows with nans in first column (OPTIONAL)
filtered = grouped[grouped[(&#39;distance in mm&#39;,&#39;mean&#39;)].notna()] #&lt;-- Optional
filtered.index.name = &#39;bins&#39;                                   #&lt;-- Optional

print(filtered)

           distance in mm           Lithium            Calcium     
                     mean       std    mean        std    mean  std
bins                                                               
[0.0, 0.4)          0.050  0.070711     8.5   3.535534     0.0  0.0
[0.8, 1.2)          0.935  0.077782    17.0  11.313708     1.0  0.0
[1.6, 2.0)          1.990       NaN    17.0        NaN     1.0  NaN

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将数据框中的行分组为一个区间的条件。

问题

答案1

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

Python中调用存在的列表索引时出现“list index out of range”错误。

SKlearn分类器的predict_proba不等于1。

Pandas中特定列数值的3周滚动平均值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论