加速使用Pandas构建物料清单的函数。

huangapple go评论66阅读模式
英文:

speed up my function about build bill of materials with pandas

问题

  1. 我发现整个过程中最令人不满意且耗时最长的部分是我的代码使用的一个函数,该函数用于查找当前需求物料的原材料,并根据需求物料的数量按照BOM中的比例计算原材料的需求量。以下是该代码:
def calculate_demand_raw(row, df_demand):
    try:
       if np.isnan(row['quantity_out']):
           raise ValueError('为了避免包括那些具有负输出的再生材料')

       list_index = list(df_demand['item_id'].isin([row['item_id']]) &
                         df_demand['process_No.'].isin([row['process_No.']]) &
                         df_demand['IN/OUT'].isin(['IN']))
       index = [i for i, x in enumerate(list_index) if x==True]  
       # 搜索以找到所需的生成过程的索引

       df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*\
                                               df_demand.loc[index, 'quantity_in']
       # 计算原材料的数量
       df_demand.loc[index, 'flag'] = 1

    except ValueError:
        pass  # 防止查询材料是基础材料,没有生成过程
    df_demand.loc[row.name, 'flag'] = 0

df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)

timeit 告诉我,在该函数中,查找符合条件的行的索引比计算原材料的数量要花费三倍的时间,而 calculate_demand_raw 也是循环中最耗时的函数。是否有办法减少搜索索引的时间?

  1. 循环中的第二个耗时函数是一个函数,将聚合的原材料需求填充到 df_demand 的属性 count_demand 中,这些需求是为了生成过程而产生的。
def fill_demand(row, qty_sum_demand, df_demand):
    df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
                                qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
                                'count_demand'].tolist()
    df_demand.loc[index, 'flag'] = 1

df_demand.loc[index_generated_process].apply(lambda row: 
                                fill_demand(row, qty_sum_demand, df_demand), axis=1)

是否是函数中的条件搜索使其花费如此长的时间,就像 calculate_demand_raw 一样?是否可能将此操作转换为更快的 numpy 向量化操作?

谢谢任何帮助和建议。

英文:

for me a Python fresh, I m working on a code for building bill of materials(BOM), which can get the customer's desired item id and purchase quantity from the order sheet in the excel file and use another BOM sheet(also in the same excel) to calculate the quantity of all raw materials needed.After subtracting the stock of raw materials from the inventory, the remaining demand is input to the dictionary output like{id_material: quantity}.
The format of BOM sheet as follow

item_id process_id process_No. IN/OUT material_id quantity_in quantity_out
A z420 1 IN 12125 100 Nan
A z420 1 OUT A-z512-2 Nan 100
A z512 2 IN A-z512-2 100 Nan
A z512 2 OUT A-z600-3 Nan 120
A z600 3 IN A-z600-3 120 Nan
A z600 3 OUT 14551 Nan -20
A z600 3 OUT A Nan 100
attr: processs_id : Process id used

attr: processs_No.: Process order in the process path. Not always consecutive or regular like natural numbers, such as (51, 60, 70, 100)

attr: IN/OUT : Indicate whether the material is a raw material or an output

the pandas.dataframe I use is also like this but two attr columns added: 'count_demand' for indicate the quantity needed and 'flag' for my function to identify the material that needs to be executed.Lets call it 'df_demand'.

I was able to complete the functions that would serve my purpose, but the speed was not satisfactory. I tested with moudles such as timeit and found some operations that took a lot of time, but I couldn't think of a way to optimize them, so I came here for help.

  1. The first thing I found unsatisfactory and the most time-consuming part of the whole process was a function that was used by my code to find the raw material of the current demand material and to calculate the demand quantity of the raw material in proportion to the BOM according to the quantity of the demand material.The code as follow
def calculate_demand_raw(row, df_demand):
    try:
       if np.isnan(row['quantity_out']):
           raise ValueError('To avoid including those recycled materials with negative outputs')

       list_index = list(df_demand['item_id'].isin([row['item_id']]) &
                         df_demand['process_No.'].isin([row['process_No.']]) &
                         df_demand['IN/OUT'].isin(['IN']))
       index = [i for i, x in enumerate(list_index) if x==True]  
       # Search to find the index of the required generation process

       df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*
                                               df_demand.loc[index, 'quantity_in']
       # calculate quantity of raw materials.
       df_demand.loc[index, 'flag'] = 1

    except ValueError:
        pass  # Prevent the query material is the base material, no process generation
    df_demand.loc[row.name, 'flag'] = 0

df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)

timeit told me that, in the function, it takes three times as long to find the index of the eligible rows as it does to calculate the quantity of the raw material, and calculate_demand_raw is also the most time-consuming function in a loop.So could anyone have a deal of search index time reduction?

  1. The second time-consuming function in a loop is a function that fill the aggregated raw material requirements into the df_demand's attr'count_demand' where they are generated for the process.
def fill_demand(row, qty_sum_demand, df_demand):
    df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
                                qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
                                'count_demand'].tolist()
    df_demand.loc[index, 'flag'] = 1

df_demand.loc[index_generated_process].apply(lambda row: 
                                fill_demand(row, qty_sum_demand, df_demand), axis=1)

Is it the conditional search in the function that makes it take so long, just like calculate_demand_raw? Is it possible to turn this operation into a faster numpy vectorization operation?

Thx a lot for any help and suggestions

答案1

得分: 1

Here's the translated code:

# 导入所需库
import pandas as pd
import io

# 创建数据框
df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
# 输出数据框
print(df)

And here's the code with the comments translated:

# 导入所需库
import pandas as pd
import io

# 创建数据框
df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))

# 输出数据框
print(df)

Please note that I've provided the translation without additional content, as requested.

英文:

A modified version of your example to show some functionality:

df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag
0       A       z420            1     IN       12125        100.0           NaN     0
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0
2       A       z512            2     IN    A-z512-2        100.0           NaN     0
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0
4       A       z600            3     IN    A-z600-2          NaN         400.0     1
5       A       z600            3     IN    A-z600-3        120.0         200.0     1 # <- only valid flag (non-na quantity_in)
6       A       z600            3     IN       14551          NaN         -20.0     0
7       A       z600            3    OUT           A          NaN         100.0     0

Here's one way you may implement calculate_demand_raw without using .apply + .loc lookups.

Generally the you want to .merge in cases like this so you have all the data "side-by-side" which allows you to then work in a "vectorized manner".

flags = df[df['flag'] == 1].dropna(subset='quantity_in')

df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))

df_m.loc[ 
   (df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'), 
   df.columns.difference(['item_id', 'process_No.']) + '_y' 
] = float('nan')

rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']

df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1

A breakdown of the steps:

Find all flag rows.

flags = df[df['flag'] == 1].dropna(subset='quantity_in')

The .dropna() is to emulate the if np.isnan(row['quantity_out']) line in your code.

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag
5       A       z600            3     IN    A-z600-3        120.0         200.0     1

left-merge with flags:

df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
5       A       z600            3     IN    A-z600-3        120.0         200.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
6       A       z600            3     IN       14551          NaN         -20.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0         z600       IN      A-z600-3          120.0           200.0     1.0

You want to discard OUT rows, it's unclear if you want to compare the flag rows with themselves so I've discarded them here.

You can reset the _y columns on the rows you want to discard back to NaN

df_m.loc[ 
   (df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'), 
   df.columns.difference(['item_id', 'process_No.']) + '_y' 
] = float('nan')
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     1          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN         -20.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

You can then perform your calculations on the rows with nonna _y values:

rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     1          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN       -2400.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

Toggle the flag values:

df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     0          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     0          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN       -2400.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

答案2

得分: 0

我不知道为什么我在编辑模式下仔细编辑的数据表会变得这么糟糕,哈哈。为了弥补这个问题,我在回复中包含了表格和预览中的属性的截图。

英文:

I dont know why the datatable I edit carefully in edit mode become such a shit lol.To compensate for this, I've included a screenshot of the table and properties from the preview in this reply
加速使用Pandas构建物料清单的函数。

huangapple
  • 本文由 发表于 2023年6月15日 10:02:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/76478616.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定