Calculating weighted average by sorting and aggregating in a pandas dataframe.

huangapple go评论93阅读模式
英文:

Calculating weighted average by sorting and aggregating in a pandas dataframe

问题

I have a manufacturing sample dataset, it has parent batches and output batches of a product. The parent batches belong to a specific output batch within the dataset, the reason we know that is because the parent batches for each output batch has the same process order number, which is a variable.

我有一个制造样本数据集,其中包含产品的父批次和输出批次。父批次属于数据集内的特定输出批次,我们之所以知道这一点是因为每个输出批次的父批次具有相同的工艺订单号,这是一个变量。

I want to calculate the weighted average, for every parent batch specific to an output batch, so for parent batch I need to aggregate the values where the process order number is the same for both parent and output batch. I want to go through each output batch with the same process number, find the parent batch with the same process number and aggregate the Quantity, take a sum for the denominator and perform the weighted function formula and store the value in another column named, "weighted feature". The other values to perform a weighted function are already part of the dataframe which is Value.

我想要计算加权平均值,对于每个特定于输出批次的父批次,因此对于父批次,我需要汇总工艺订单号相同的值。我想遍历具有相同工艺号的每个输出批次,找到具有相同工艺号的父批次并汇总数量,将分母求和并执行加权函数公式,然后将值存储在另一列中,名为“加权特征”。执行加权函数的其他值已经是数据框的一部分,这些值是Value。

The function to use is Calculating weighted average by sorting and aggregating in a pandas dataframe. where Qi is Quantity and Qci is Value.

要使用的函数是Calculating weighted average by sorting and aggregating in a pandas dataframe.,其中Qi是数量,Qci是值。

Please have a look at the example diagram below, it's for a specific order number and to demonstrate the various parent(input) and output batches, this may help in understanding what I am trying to do!

请查看下面的示例图,它是为了特定的订单号和演示不同的父(输入)和输出批次而准备的,这可能有助于理解我试图做什么!

Calculating weighted average by sorting and aggregating in a pandas dataframe.

This is a sample dataframe to be used for the same:

这是一个用于相同目的的示例数据框:

  1. import pandas as pd
  2. A = pd.DataFrame({'Batch_ID': ['A', 'B', 'C', 'D', 'E', 'F'], 'Process_Order_Number': [1, 1, 1, 2, 2, 2], 'Batch_type': ['parent', 'parent', 'output', 'parent', 'parent', 'output'], 'Quantity': [10, 20, 15, 5, 25, 50], 'Value': [2, 3, 1, 4, 0, 1]})
Batch_ID Process_Order_Number Batch_type Quantity Value
A 1 parent 10 2
B 1 parent 20 3
C 1 output 15 1
D 2 parent 5 4
E 2 parent 25 0
F 2 output 50 1

I wrote a function to calculate the weighted average:

我编写了一个函数来计算加权平均值:

  1. def weighted_average(distribution, weights):
  2. return round(sum([distribution[i]*weights[i] for i in range(len(distribution))])/sum(weights), 2)
  3. weighted_average(distribution, weights)

Next, I tried to aggregate the data by using the following methods but I wasn't able to get the specific cluster:

接下来,我尝试使用以下方法对数据进行汇总,但我无法获得特定的聚类:

  1. df1 = A[A.duplicated('Process_Order_Number', keep=False)].sort_values('Process_Order_Number')
  2. df1.head()
  1. df[A.groupby('Process_Order_Number')['Batch_type'].transform('nunique').ne(1)]

These sorted it but still didn't come in the form as shown in the picture above, I am trying to bring the same process order number parent batches together and then use my weighted function to calculate and store the value in another column, it needs to traverse through each process order number and for every batch type "output" needs to find the batch type "parent" so I can take the weighted function. I am still trying to see how I can incorporate my weighted function with the sort, so I don't have to do it separately! I did look at other stackoverflow questions but couldn't find something that would fit here. I could use some help! Any guidance is much appreciated.

这些方法对其进行了排序,但仍然没有像上面的图片所示的形式,我正在尝试将具有相同工艺订单号的父批次放在一起,然后使用我的加权函数来计算并存储值在另一列中,它需要遍历每个工艺订单号,并且对于每个批次类型“output”,需要找到批次类型“parent”,以便我可以使用加权函数。我仍在尝试看看如何将我的加权函数与排序合并,以便不必分开执行!我查看了其他stackoverflow问题,但找不到适合这里的内容。我需要一些帮助!任何指导都将不胜感激。

Expected Output for the first three rows from the sample dataset:

示例数据集中前三行的预期输出:

Batch_ID Process_Order_Number Batch_type Quantity Value Weighted_Average
A 1 parent 10 2 2.66
英文:

I have a manufacturing sample dataset, it has parent batches and output batches of a product.
The parent batches belong to a specific output batch within the dataset, the reason we know that is because the parent batches for each output batch has the same process order number, which is a variable.
I want to calculate the weighted average, for every parent batch specific to an output batch ,so for parent batch I need to aggregate the values where the process order number is same for both parent and output batch. I want to go through each output batch with the same process number, find the parent batch with the same process number and aggregate the Quantity , take a sum for the denominator and perform the weighted function formula and store the value in another column named, "weighted feature". The other values to perform a weighted function, are already part of the dataframe which is Value.

The function to use is Calculating weighted average by sorting and aggregating in a pandas dataframe. where Qi is Quantity and Qci is Value.
Please have a look at the example diagram below, its for a specific order number and to demonstrate the various parent(input) and output batches, this may help in understanding what I am trying to do!

Calculating weighted average by sorting and aggregating in a pandas dataframe.

This is a sample dataframe to be used for the same:

  1. import pandas as pd
  2. A = pd.DataFrame({'Batch_ID': ['A', 'B', 'C', 'D', 'E', 'F'], 'Process_Order_Number': [1,1,1,2,2,2], 'Batch_type': ['parent', 'parent', 'output','parent', 'parent', 'output'],'Quantity': [10,20,15,5,25,50], 'Value': [2,3,1,4,0,1]})
Batch_ID Process_Order_Number Batch_type Quantity Value
A 1 parent 10 2
B 1 parent 20 3
C 1 output 15 1
D 2 parent 5 4
E 2 parent 25 0
F 2 output 50 1

I wrote a function to calculate the weighted average:
distributions are Quantity and Weights are the value in the dataset above.

  1. def weighted_average(distribution, weights):
  2. return round(sum([distribution[i]*weights[i] for i in range(len(distribution))])/sum(weights),2)
  3. weighted_average(distribution, weights)

Next, I tried to aggregate the data by using the following methods but I wasn't able to get the specific cluster:

  1. df1 = A[A.duplicated('Process_Order_Number', keep=False)].sort_values('Process_Order_Number')
  2. df1.head()
  1. df[A.groupby('Process_Order_Number')['Batch_type'].transform('nunique').ne(1)]

These sorted it but still didn't come in the form as shown in the picture above, I am trying to bring the same process order number parent batches together and then use my weighted function to calculate and store the value in another column, it needs to traverse through each process order number and for every batch type "output" needs to find the batch type "parent" so I can take the weighted function. I am still trying to see how I can incorporate my weighted function with the sort, so I don't have to do it separately! I did look at other stackoverflow questions but couldn't find something that would fit here. I could use some help! Any guidance is much appreciated.

Expected Output for the first three rows from the sample dataset:

Batch_ID Process_Order_Number Batch_type Quantity Value Weighted_Average
A 1 parent 10 2
B 1 parent 20 3
C 1 output 2.66

答案1

得分: 1

我们可以首先计算每个“process_order_number”中的“parent”的“weighted_average”:

  1. mapper = df.loc[df.Batch_type.eq('parent'), :]\
  2. .groupby('Process_Order_Number')\
  3. .apply(lambda s: (s['Value'] * s['Quantity']).sum() / s['Quantity'].sum())

这将产生以下结果:

  1. Process_Order_Number
  2. 1 2.666667
  3. 2 0.666667
  4. dtype: float64

然后只需分配给一个新列:

  1. df.loc[df['Batch_type'].eq('output'), 'Weighted_Average'] = df['Process_Order_Number'].map(mapper)

如果您愿意,您可以始终“删除”“quantity”和“value”值,因为它们在您的期望输出中为空白。

  1. df.loc[df['Batch_type'].eq('output'), ['Quantity', 'Value']] = np.nan

最终的数据框如下所示:

  1. Batch_ID Process_Order_Number Batch_type Quantity Value Weighted_Average
  2. 0 A 1 parent 10.0 2.0 NaN
  3. 1 B 1 parent 20.0 3.0 NaN
  4. 2 C 1 output NaN NaN 2.666667
  5. 3 D 2 parent 5.0 4.0 NaN
  6. 4 E 2 parent 25.0 0.0 NaN
  7. 5 F 2 output NaN NaN 0.666667
英文:

We can first calculate the weighted_average for each process_order_number that is a parent:

  1. mapper = df.loc[df.Batch_type.eq('parent'), :]\
  2. .groupby('Process_Order_Number')\
  3. .apply(lambda s: (s['Value'] * s['Quantity']).sum() / s['Quantity'].sum())

which yields

  1. Process_Order_Number
  2. 1 2.666667
  3. 2 0.666667
  4. dtype: float64

And then just assign to a new column:

  1. df.loc[df['Batch_type'].eq('output'), 'Weighted_Average'] = df['Process_Order_Number'].map(mapper)
  2. Batch_ID Process_Order_Number Batch_type Quantity Value Weighted_Average
  3. 0 A 1 parent 10 2 NaN
  4. 1 B 1 parent 20 3 NaN
  5. 2 C 1 output 15 1 2.666667
  6. 3 D 2 parent 5 4 NaN
  7. 4 E 2 parent 25 0 NaN
  8. 5 F 2 output 50 1 0.666667

If you prefer, you can always "delete" quantity and value values, since they are blank in your expected output.

  1. df.loc[df['Batch_type'].eq('output'), ['Quantity', 'Value']] = np.nan

  1. Batch_ID Process_Order_Number Batch_type Quantity Value Weighted_Average
  2. 0 A 1 parent 10.0 2.0 NaN
  3. 1 B 1 parent 20.0 3.0 NaN
  4. 2 C 1 output NaN NaN 2.666667
  5. 3 D 2 parent 5.0 4.0 NaN
  6. 4 E 2 parent 25.0 0.0 NaN
  7. 5 F 2 output NaN NaN 0.666667

huangapple
  • 本文由 发表于 2023年5月11日 00:24:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76220681.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定