将数据框中的行分组为一个区间的条件。

huangapple go评论68阅读模式
英文:

Condition to group rows in dataframes python into an interval

问题

成功提取数据到数据框中,但我不知道如何管理它以检测在相同范围内的值并计算它们的平均值。

这是我的数据的简化版本:

import pandas as pd

matrix = [[0,11,0],[0.99,25,1],[1.99,17,1],[0.1,6,0],[0.88,9,1]]

data=pd.DataFrame(matrix,columns=['distance in mm','Lithium','Calcium'])

然后,我想在另一个数据框中分组具有相同距离范围的行(+/- 0.2毫米),计算平均值和标准差。

我希望输出类似于这样的结果:

将数据框中的行分组为一个区间的条件。

是否应该创建一个包含平均值和标准差值的字典?

在名为data的数据框中,您可以找到第一列的名称是'distance in mm'。我想测量每一行的间隔并通过计算平均值和标准差来合并具有相同间隔的行。我尝试过:data['distance_bins'] = pd.cut(data['distance in mm'], np.arange(0, data['distance in mm'].max() + 0.2, 0.2))

这个想法是根据样本上的相对位置而不是标签来分组数据(分析复制品)。

然后我尝试过:groups = data.groupby('distance_bins')。但在那之后我完全迷失了,因为我不熟悉由groupby函数创建的对象 - <class 'pandas.core.series.Series'>

英文:

I succeeded to extract data into a dataframe, but I don't know how can I manage it to detect values which are in the same range and mean them.

Here is a simple version of my data:

import pandas as pd

matrix = [[0,11,0],[0.99,25,1],[1.99,17,1],[0.1,6,0],[0.88,9,1]	]

data=pd.DataFrame(matrix,columns=[&#39;distance in mm&#39;,&#39;Lithium&#39;,&#39;Calcium&#39;])

Then I want to group in another dataframe rows that have a common distance in mm of +/- 0.2 mm, calculate the average and standard deviation.

I wish an output similar to this:

将数据框中的行分组为一个区间的条件。

Should I create a dictionary which will contain mean and std deviation values?

In the dataframe call data you can find that the first column is named 'distance in mm'. I would like to measure the interval of each single row and unify the row that share the same interval by calculating the mean and the standard deviation. I tried : data[&#39;distance_bins&#39;] = pd.cut(data[&#39;distance in mm&#39;], np.arange(0, data[&#39;distance in mm&#39;].max() + 0.2, 0.2))

The idea consists to group data (analytical replicates) not using labelling but relative position on a sample.

Then I tried: groups = data.groupby(&#39;distance_bins&#39;). But after that I'm completely lost, since I am not familiar with object which has been created by the function groupby – &lt;class &#39;pandas.core.series.Series&#39;&gt;.

答案1

得分: 1

尝试这样做 -

  1. 根据第一列的最小值和最大值创建您的bins
  2. 接下来,使用pd.cut基于这些bins切割您的列。在这种情况下,您包括左侧,但不包括右侧[left, right)
  3. 然后,在字典中定义您的聚合操作,并根据上面定义的groups对数据进行分组,同时将聚合操作传递给.agg()方法。
  4. 最后(可选),根据第一列中的NaN值筛选出行。您还可以为可读性重命名索引。
import pandas as pd

# 使用+/- 0.2创建bins
mini = data['distance in mm'].min()
maxi = data['distance in mm'].max()
bins = np.arange(mini, maxi+0.4, 0.4)

# 切割为不同的组
groups = pd.cut(data['distance in mm'], bins, right=False)

# 使用groupby进行聚合
aggr = {'distance in mm':['mean', 'std'], 
        'Lithium':['mean','std'], 
        'Calcium':['mean', 'std']}
grouped = data.groupby(groups).agg(aggr)

# 根据第一列中的NaN值筛选行(可选)
filtered = grouped[grouped[('distance in mm','mean')].notna()] #<-- 可选
filtered.index.name = 'bins'                                   #<-- 可选

print(filtered)
           distance in mm           Lithium            Calcium     
                     mean       std    mean        std    mean  std
bins                                                               
[0.0, 0.4)          0.050  0.070711     8.5   3.535534     0.0  0.0
[0.8, 1.2)          0.935  0.077782    17.0  11.313708     1.0  0.0
[1.6, 2.0)          1.990       NaN    17.0        NaN     1.0  NaN

请注意,这是您提供的代码的中文翻译部分。如果您需要进一步的帮助,请随时提出。

英文:

Try this -

  1. Create your bins based on the minimum and maximum values in the first column
  2. Next, your pd.cut to cut your column based on the bins. In this case, you are including left and excluding right [left, right)
  3. Then, define your aggregations in a dictionary and groupby your data based on the groups defined above, while passing the aggregations to .agg() method.
  4. Finally, (Optionally), filter out the rows with nan values based on first column. You can also rename the index for readibility.
import pandas as pd

# Create bins with +/- 0.2
mini = data[&#39;distance in mm&#39;].min()
maxi = data[&#39;distance in mm&#39;].max()
bins = np.arange(mini, maxi+0.4, 0.4)

# Cut into groups
groups = pd.cut(data[&#39;distance in mm&#39;], bins, right=False)

# Aggregate using groupby
aggr = {&#39;distance in mm&#39;:[&#39;mean&#39;, &#39;std&#39;], 
        &#39;Lithium&#39;:[&#39;mean&#39;,&#39;std&#39;], 
        &#39;Calcium&#39;:[&#39;mean&#39;, &#39;std&#39;]}
grouped = data.groupby(groups).agg(aggr)

# Filter out rows with nans in first column (OPTIONAL)
filtered = grouped[grouped[(&#39;distance in mm&#39;,&#39;mean&#39;)].notna()] #&lt;-- Optional
filtered.index.name = &#39;bins&#39;                                   #&lt;-- Optional

print(filtered)
           distance in mm           Lithium            Calcium     
                     mean       std    mean        std    mean  std
bins                                                               
[0.0, 0.4)          0.050  0.070711     8.5   3.535534     0.0  0.0
[0.8, 1.2)          0.935  0.077782    17.0  11.313708     1.0  0.0
[1.6, 2.0)          1.990       NaN    17.0        NaN     1.0  NaN

huangapple
  • 本文由 发表于 2023年1月9日 17:42:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75055429.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定