英文:
Condition to group rows in dataframes python into an interval
问题
成功提取数据到数据框中,但我不知道如何管理它以检测在相同范围内的值并计算它们的平均值。
这是我的数据的简化版本:
import pandas as pd
matrix = [[0,11,0],[0.99,25,1],[1.99,17,1],[0.1,6,0],[0.88,9,1]]
data=pd.DataFrame(matrix,columns=['distance in mm','Lithium','Calcium'])
然后,我想在另一个数据框中分组具有相同距离范围的行(+/- 0.2毫米),计算平均值和标准差。
我希望输出类似于这样的结果:
是否应该创建一个包含平均值和标准差值的字典?
在名为data的数据框中,您可以找到第一列的名称是'distance in mm'。我想测量每一行的间隔并通过计算平均值和标准差来合并具有相同间隔的行。我尝试过:data['distance_bins'] = pd.cut(data['distance in mm'], np.arange(0, data['distance in mm'].max() + 0.2, 0.2))
这个想法是根据样本上的相对位置而不是标签来分组数据(分析复制品)。
然后我尝试过:groups = data.groupby('distance_bins')
。但在那之后我完全迷失了,因为我不熟悉由groupby函数创建的对象 - <class 'pandas.core.series.Series'>
。
英文:
I succeeded to extract data into a dataframe, but I don't know how can I manage it to detect values which are in the same range and mean them.
Here is a simple version of my data:
import pandas as pd
matrix = [[0,11,0],[0.99,25,1],[1.99,17,1],[0.1,6,0],[0.88,9,1] ]
data=pd.DataFrame(matrix,columns=['distance in mm','Lithium','Calcium'])
Then I want to group in another dataframe rows that have a common distance in mm of +/- 0.2 mm, calculate the average and standard deviation.
I wish an output similar to this:
Should I create a dictionary which will contain mean and std deviation values?
In the dataframe call data you can find that the first column is named 'distance in mm'. I would like to measure the interval of each single row and unify the row that share the same interval by calculating the mean and the standard deviation. I tried : data['distance_bins'] = pd.cut(data['distance in mm'], np.arange(0, data['distance in mm'].max() + 0.2, 0.2))
The idea consists to group data (analytical replicates) not using labelling but relative position on a sample.
Then I tried: groups = data.groupby('distance_bins')
. But after that I'm completely lost, since I am not familiar with object which has been created by the function groupby – <class 'pandas.core.series.Series'>
.
答案1
得分: 1
尝试这样做 -
- 根据第一列的最小值和最大值创建您的
bins
。 - 接下来,使用
pd.cut
基于这些bins切割您的列。在这种情况下,您包括左侧,但不包括右侧[left, right)
- 然后,在字典中定义您的聚合操作,并根据上面定义的
groups
对数据进行分组,同时将聚合操作传递给.agg()
方法。 - 最后(可选),根据第一列中的NaN值筛选出行。您还可以为可读性重命名索引。
import pandas as pd
# 使用+/- 0.2创建bins
mini = data['distance in mm'].min()
maxi = data['distance in mm'].max()
bins = np.arange(mini, maxi+0.4, 0.4)
# 切割为不同的组
groups = pd.cut(data['distance in mm'], bins, right=False)
# 使用groupby进行聚合
aggr = {'distance in mm':['mean', 'std'],
'Lithium':['mean','std'],
'Calcium':['mean', 'std']}
grouped = data.groupby(groups).agg(aggr)
# 根据第一列中的NaN值筛选行(可选)
filtered = grouped[grouped[('distance in mm','mean')].notna()] #<-- 可选
filtered.index.name = 'bins' #<-- 可选
print(filtered)
distance in mm Lithium Calcium
mean std mean std mean std
bins
[0.0, 0.4) 0.050 0.070711 8.5 3.535534 0.0 0.0
[0.8, 1.2) 0.935 0.077782 17.0 11.313708 1.0 0.0
[1.6, 2.0) 1.990 NaN 17.0 NaN 1.0 NaN
请注意,这是您提供的代码的中文翻译部分。如果您需要进一步的帮助,请随时提出。
英文:
Try this -
- Create your
bins
based on the minimum and maximum values in the first column - Next, your
pd.cut
to cut your column based on the bins. In this case, you are including left and excluding right[left, right)
- Then, define your aggregations in a dictionary and groupby your data based on the
groups
defined above, while passing the aggregations to.agg()
method. - Finally, (Optionally), filter out the rows with nan values based on first column. You can also rename the index for readibility.
import pandas as pd
# Create bins with +/- 0.2
mini = data['distance in mm'].min()
maxi = data['distance in mm'].max()
bins = np.arange(mini, maxi+0.4, 0.4)
# Cut into groups
groups = pd.cut(data['distance in mm'], bins, right=False)
# Aggregate using groupby
aggr = {'distance in mm':['mean', 'std'],
'Lithium':['mean','std'],
'Calcium':['mean', 'std']}
grouped = data.groupby(groups).agg(aggr)
# Filter out rows with nans in first column (OPTIONAL)
filtered = grouped[grouped[('distance in mm','mean')].notna()] #<-- Optional
filtered.index.name = 'bins' #<-- Optional
print(filtered)
distance in mm Lithium Calcium
mean std mean std mean std
bins
[0.0, 0.4) 0.050 0.070711 8.5 3.535534 0.0 0.0
[0.8, 1.2) 0.935 0.077782 17.0 11.313708 1.0 0.0
[1.6, 2.0) 1.990 NaN 17.0 NaN 1.0 NaN
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论