2023年5月11日 00:24:45go评论93阅读模式

英文:

Calculating weighted average by sorting and aggregating in a pandas dataframe

问题

我有一个制造样本数据集，其中包含产品的父批次和输出批次。父批次属于数据集内的特定输出批次，我们之所以知道这一点是因为每个输出批次的父批次具有相同的工艺订单号，这是一个变量。

I want to calculate the weighted average, for every parent batch specific to an output batch, so for parent batch I need to aggregate the values where the process order number is the same for both parent and output batch. I want to go through each output batch with the same process number, find the parent batch with the same process number and aggregate the Quantity, take a sum for the denominator and perform the weighted function formula and store the value in another column named, "weighted feature". The other values to perform a weighted function are already part of the dataframe which is Value.

我想要计算加权平均值，对于每个特定于输出批次的父批次，因此对于父批次，我需要汇总工艺订单号相同的值。我想遍历具有相同工艺号的每个输出批次，找到具有相同工艺号的父批次并汇总数量，将分母求和并执行加权函数公式，然后将值存储在另一列中，名为“加权特征”。执行加权函数的其他值已经是数据框的一部分，这些值是Value。

The function to use is where Qi is Quantity and Qci is Value.

要使用的函数是，其中Qi是数量，Qci是值。

Please have a look at the example diagram below, it's for a specific order number and to demonstrate the various parent(input) and output batches, this may help in understanding what I am trying to do!

请查看下面的示例图，它是为了特定的订单号和演示不同的父（输入）和输出批次而准备的，这可能有助于理解我试图做什么！

This is a sample dataframe to be used for the same:

这是一个用于相同目的的示例数据框：

import pandas as pd
A = pd.DataFrame({'Batch_ID': ['A', 'B', 'C', 'D', 'E', 'F'], 'Process_Order_Number': [1, 1, 1, 2, 2, 2], 'Batch_type': ['parent', 'parent', 'output', 'parent', 'parent', 'output'], 'Quantity': [10, 20, 15, 5, 25, 50], 'Value': [2, 3, 1, 4, 0, 1]})

Batch_ID	Process_Order_Number	Batch_type	Quantity	Value
A	1	parent	10	2
B	1	parent	20	3
C	1	output	15	1
D	2	parent	5	4
E	2	parent	25	0
F	2	output	50	1

I wrote a function to calculate the weighted average:

我编写了一个函数来计算加权平均值：

def weighted_average(distribution, weights):
    return round(sum([distribution[i]*weights[i] for i in range(len(distribution))])/sum(weights), 2)
weighted_average(distribution, weights)

Next, I tried to aggregate the data by using the following methods but I wasn't able to get the specific cluster:

接下来，我尝试使用以下方法对数据进行汇总，但我无法获得特定的聚类：

df1 = A[A.duplicated('Process_Order_Number', keep=False)].sort_values('Process_Order_Number')
df1.head()

df[A.groupby('Process_Order_Number')['Batch_type'].transform('nunique').ne(1)]

These sorted it but still didn't come in the form as shown in the picture above, I am trying to bring the same process order number parent batches together and then use my weighted function to calculate and store the value in another column, it needs to traverse through each process order number and for every batch type "output" needs to find the batch type "parent" so I can take the weighted function. I am still trying to see how I can incorporate my weighted function with the sort, so I don't have to do it separately! I did look at other stackoverflow questions but couldn't find something that would fit here. I could use some help! Any guidance is much appreciated.

这些方法对其进行了排序，但仍然没有像上面的图片所示的形式，我正在尝试将具有相同工艺订单号的父批次放在一起，然后使用我的加权函数来计算并存储值在另一列中，它需要遍历每个工艺订单号，并且对于每个批次类型“output”，需要找到批次类型“parent”，以便我可以使用加权函数。我仍在尝试看看如何将我的加权函数与排序合并，以便不必分开执行！我查看了其他stackoverflow问题，但找不到适合这里的内容。我需要一些帮助！任何指导都将不胜感激。

Expected Output for the first three rows from the sample dataset:

示例数据集中前三行的预期输出：

Batch_ID	Process_Order_Number	Batch_type	Quantity	Value	Weighted_Average
A	1	parent	10	2	2.66

英文:

I have a manufacturing sample dataset, it has parent batches and output batches of a product.
The parent batches belong to a specific output batch within the dataset, the reason we know that is because the parent batches for each output batch has the same process order number, which is a variable.
I want to calculate the weighted average, for every parent batch specific to an output batch ,so for parent batch I need to aggregate the values where the process order number is same for both parent and output batch. I want to go through each output batch with the same process number, find the parent batch with the same process number and aggregate the Quantity , take a sum for the denominator and perform the weighted function formula and store the value in another column named, "weighted feature". The other values to perform a weighted function, are already part of the dataframe which is Value.

The function to use is where Qi is Quantity and Qci is Value.
Please have a look at the example diagram below, its for a specific order number and to demonstrate the various parent(input) and output batches, this may help in understanding what I am trying to do!

This is a sample dataframe to be used for the same:

import pandas as pd
A = pd.DataFrame({&#39;Batch_ID&#39;: [&#39;A&#39;, &#39;B&#39;, &#39;C&#39;, &#39;D&#39;, &#39;E&#39;, &#39;F&#39;], &#39;Process_Order_Number&#39;: [1,1,1,2,2,2], &#39;Batch_type&#39;: [&#39;parent&#39;, &#39;parent&#39;, &#39;output&#39;,&#39;parent&#39;, &#39;parent&#39;, &#39;output&#39;],&#39;Quantity&#39;: [10,20,15,5,25,50], &#39;Value&#39;: [2,3,1,4,0,1]})

Batch_ID	Process_Order_Number	Batch_type	Quantity	Value
A	1	parent	10	2
B	1	parent	20	3
C	1	output	15	1
D	2	parent	5	4
E	2	parent	25	0
F	2	output	50	1

I wrote a function to calculate the weighted average:
distributions are Quantity and Weights are the value in the dataset above.

def weighted_average(distribution, weights):
    return round(sum([distribution[i]*weights[i] for i in range(len(distribution))])/sum(weights),2)
weighted_average(distribution, weights)

Next, I tried to aggregate the data by using the following methods but I wasn't able to get the specific cluster:

df1 = A[A.duplicated(&#39;Process_Order_Number&#39;, keep=False)].sort_values(&#39;Process_Order_Number&#39;)
df1.head()

df[A.groupby(&#39;Process_Order_Number&#39;)[&#39;Batch_type&#39;].transform(&#39;nunique&#39;).ne(1)]

Expected Output for the first three rows from the sample dataset:

Batch_ID	Process_Order_Number	Batch_type	Quantity	Value	Weighted_Average
A	1	parent	10	2
B	1	parent	20	3
C	1	output			2.66

答案1

得分: 1

我们可以首先计算每个“process_order_number”中的“parent”的“weighted_average”：

mapper = df.loc[df.Batch_type.eq('parent'), :]\
           .groupby('Process_Order_Number')\
           .apply(lambda s: (s['Value'] * s['Quantity']).sum() / s['Quantity'].sum())

这将产生以下结果：

Process_Order_Number
1    2.666667
2    0.666667
dtype: float64

然后只需分配给一个新列：

df.loc[df['Batch_type'].eq('output'), 'Weighted_Average'] = df['Process_Order_Number'].map(mapper)

如果您愿意，您可以始终“删除”“quantity”和“value”值，因为它们在您的期望输出中为空白。

df.loc[df['Batch_type'].eq('output'), ['Quantity', 'Value']] = np.nan

最终的数据框如下所示：

  Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
0        A                     1     parent      10.0    2.0               NaN
1        B                     1     parent      20.0    3.0               NaN
2        C                     1     output       NaN    NaN          2.666667
3        D                     2     parent       5.0    4.0               NaN
4        E                     2     parent      25.0    0.0               NaN
5        F                     2     output       NaN    NaN          0.666667

英文:

We can first calculate the weighted_average for each process_order_number that is a parent:

mapper = df.loc[df.Batch_type.eq(&#39;parent&#39;), :]\
           .groupby(&#39;Process_Order_Number&#39;)\
           .apply(lambda s: (s[&#39;Value&#39;] * s[&#39;Quantity&#39;]).sum() / s[&#39;Quantity&#39;].sum())

which yields

Process_Order_Number
1    2.666667
2    0.666667
dtype: float64

And then just assign to a new column:

df.loc[df[&#39;Batch_type&#39;].eq(&#39;output&#39;), &#39;Weighted_Average&#39;] = df[&#39;Process_Order_Number&#39;].map(mapper)
  Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
0        A                     1     parent        10      2               NaN
1        B                     1     parent        20      3               NaN
2        C                     1     output        15      1          2.666667
3        D                     2     parent         5      4               NaN
4        E                     2     parent        25      0               NaN
5        F                     2     output        50      1          0.666667

If you prefer, you can always "delete" quantity and value values, since they are blank in your expected output.

df.loc[df[&#39;Batch_type&#39;].eq(&#39;output&#39;), [&#39;Quantity&#39;, &#39;Value&#39;]] = np.nan

  Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
0        A                     1     parent      10.0    2.0               NaN
1        B                     1     parent      20.0    3.0               NaN
2        C                     1     output       NaN    NaN          2.666667
3        D                     2     parent       5.0    4.0               NaN
4        E                     2     parent      25.0    0.0               NaN
5        F                     2     output       NaN    NaN          0.666667

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Calculating weighted average by sorting and aggregating in a pandas dataframe.

问题

答案1

TypeScript数组排序与可选属性预期不同。

Python混合使用asyncio和线程

Most efficient way to create an (n*m by m) numpy array with the first row all zeros and last row all n in Python?

在一个 pandas 数据框中添加多行到新创建的列中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。