英文:
speed up my function about build bill of materials with pandas
问题
- 我发现整个过程中最令人不满意且耗时最长的部分是我的代码使用的一个函数,该函数用于查找当前需求物料的原材料,并根据需求物料的数量按照BOM中的比例计算原材料的需求量。以下是该代码:
def calculate_demand_raw(row, df_demand):
try:
if np.isnan(row['quantity_out']):
raise ValueError('为了避免包括那些具有负输出的再生材料')
list_index = list(df_demand['item_id'].isin([row['item_id']]) &
df_demand['process_No.'].isin([row['process_No.']]) &
df_demand['IN/OUT'].isin(['IN']))
index = [i for i, x in enumerate(list_index) if x==True]
# 搜索以找到所需的生成过程的索引
df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*\
df_demand.loc[index, 'quantity_in']
# 计算原材料的数量
df_demand.loc[index, 'flag'] = 1
except ValueError:
pass # 防止查询材料是基础材料,没有生成过程
df_demand.loc[row.name, 'flag'] = 0
df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)
timeit
告诉我,在该函数中,查找符合条件的行的索引比计算原材料的数量要花费三倍的时间,而 calculate_demand_raw
也是循环中最耗时的函数。是否有办法减少搜索索引的时间?
- 循环中的第二个耗时函数是一个函数,将聚合的原材料需求填充到
df_demand
的属性count_demand
中,这些需求是为了生成过程而产生的。
def fill_demand(row, qty_sum_demand, df_demand):
df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
'count_demand'].tolist()
df_demand.loc[index, 'flag'] = 1
df_demand.loc[index_generated_process].apply(lambda row:
fill_demand(row, qty_sum_demand, df_demand), axis=1)
是否是函数中的条件搜索使其花费如此长的时间,就像 calculate_demand_raw
一样?是否可能将此操作转换为更快的 numpy 向量化操作?
谢谢任何帮助和建议。
英文:
for me a Python fresh, I m working on a code for building bill of materials(BOM), which can get the customer's desired item id and purchase quantity from the order sheet in the excel file and use another BOM sheet(also in the same excel) to calculate the quantity of all raw materials needed.After subtracting the stock of raw materials from the inventory, the remaining demand is input to the dictionary output like{id_material: quantity}.
The format of BOM sheet as follow
item_id | process_id | process_No. | IN/OUT | material_id | quantity_in | quantity_out |
---|---|---|---|---|---|---|
A | z420 | 1 | IN | 12125 | 100 | Nan |
A | z420 | 1 | OUT | A-z512-2 | Nan | 100 |
A | z512 | 2 | IN | A-z512-2 | 100 | Nan |
A | z512 | 2 | OUT | A-z600-3 | Nan | 120 |
A | z600 | 3 | IN | A-z600-3 | 120 | Nan |
A | z600 | 3 | OUT | 14551 | Nan | -20 |
A | z600 | 3 | OUT | A | Nan | 100 |
attr: processs_id : Process id used |
attr: processs_No.: Process order in the process path. Not always consecutive or regular like natural numbers, such as (51, 60, 70, 100)
attr: IN/OUT : Indicate whether the material is a raw material or an output
the pandas.dataframe I use is also like this but two attr columns added: 'count_demand' for indicate the quantity needed and 'flag' for my function to identify the material that needs to be executed.Lets call it 'df_demand'.
I was able to complete the functions that would serve my purpose, but the speed was not satisfactory. I tested with moudles such as timeit and found some operations that took a lot of time, but I couldn't think of a way to optimize them, so I came here for help.
- The first thing I found unsatisfactory and the most time-consuming part of the whole process was a function that was used by my code to find the raw material of the current demand material and to calculate the demand quantity of the raw material in proportion to the BOM according to the quantity of the demand material.The code as follow
def calculate_demand_raw(row, df_demand):
try:
if np.isnan(row['quantity_out']):
raise ValueError('To avoid including those recycled materials with negative outputs')
list_index = list(df_demand['item_id'].isin([row['item_id']]) &
df_demand['process_No.'].isin([row['process_No.']]) &
df_demand['IN/OUT'].isin(['IN']))
index = [i for i, x in enumerate(list_index) if x==True]
# Search to find the index of the required generation process
df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*
df_demand.loc[index, 'quantity_in']
# calculate quantity of raw materials.
df_demand.loc[index, 'flag'] = 1
except ValueError:
pass # Prevent the query material is the base material, no process generation
df_demand.loc[row.name, 'flag'] = 0
df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)
timeit told me that, in the function, it takes three times as long to find the index of the eligible rows as it does to calculate the quantity of the raw material, and calculate_demand_raw is also the most time-consuming function in a loop.So could anyone have a deal of search index time reduction?
- The second time-consuming function in a loop is a function that fill the aggregated raw material requirements into the df_demand's attr'count_demand' where they are generated for the process.
def fill_demand(row, qty_sum_demand, df_demand):
df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
'count_demand'].tolist()
df_demand.loc[index, 'flag'] = 1
df_demand.loc[index_generated_process].apply(lambda row:
fill_demand(row, qty_sum_demand, df_demand), axis=1)
Is it the conditional search in the function that makes it take so long, just like calculate_demand_raw? Is it possible to turn this operation into a faster numpy vectorization operation?
Thx a lot for any help and suggestions
答案1
得分: 1
Here's the translated code:
# 导入所需库
import pandas as pd
import io
# 创建数据框
df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
# 输出数据框
print(df)
And here's the code with the comments translated:
# 导入所需库
import pandas as pd
import io
# 创建数据框
df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
# 输出数据框
print(df)
Please note that I've provided the translation without additional content, as requested.
英文:
A modified version of your example to show some functionality:
df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag
0 A z420 1 IN 12125 100.0 NaN 0
1 A z420 1 OUT A-z512-2 NaN 100.0 0
2 A z512 2 IN A-z512-2 100.0 NaN 0
3 A z512 2 OUT A-z600-3 NaN 120.0 0
4 A z600 3 IN A-z600-2 NaN 400.0 1
5 A z600 3 IN A-z600-3 120.0 200.0 1 # <- only valid flag (non-na quantity_in)
6 A z600 3 IN 14551 NaN -20.0 0
7 A z600 3 OUT A NaN 100.0 0
Here's one way you may implement calculate_demand_raw
without using .apply
+ .loc
lookups.
Generally the you want to .merge
in cases like this so you have all the data "side-by-side" which allows you to then work in a "vectorized manner".
flags = df[df['flag'] == 1].dropna(subset='quantity_in')
df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))
df_m.loc[
(df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'),
df.columns.difference(['item_id', 'process_No.']) + '_y'
] = float('nan')
rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']
df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1
A breakdown of the steps:
Find all flag rows.
flags = df[df['flag'] == 1].dropna(subset='quantity_in')
The .dropna()
is to emulate the if np.isnan(row['quantity_out'])
line in your code.
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag
5 A z600 3 IN A-z600-3 120.0 200.0 1
left-merge with flags:
df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 1 z600 IN A-z600-3 120.0 200.0 1.0
5 A z600 3 IN A-z600-3 120.0 200.0 1 z600 IN A-z600-3 120.0 200.0 1.0
6 A z600 3 IN 14551 NaN -20.0 0 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 z600 IN A-z600-3 120.0 200.0 1.0
You want to discard OUT
rows, it's unclear if you want to compare the flag rows with themselves so I've discarded them here.
You can reset the _y
columns on the rows you want to discard back to NaN
df_m.loc[
(df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'),
df.columns.difference(['item_id', 'process_No.']) + '_y'
] = float('nan')
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 1 NaN NaN NaN NaN NaN NaN
5 A z600 3 IN A-z600-3 120.0 200.0 1 NaN NaN NaN NaN NaN NaN
6 A z600 3 IN 14551 NaN -20.0 0 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 NaN NaN NaN NaN NaN NaN
You can then perform your calculations on the rows with nonna _y
values:
rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 1 NaN NaN NaN NaN NaN NaN
5 A z600 3 IN A-z600-3 120.0 200.0 1 NaN NaN NaN NaN NaN NaN
6 A z600 3 IN 14551 NaN -2400.0 0 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 NaN NaN NaN NaN NaN NaN
Toggle the flag values:
df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 0 NaN NaN NaN NaN NaN NaN
5 A z600 3 IN A-z600-3 120.0 200.0 0 NaN NaN NaN NaN NaN NaN
6 A z600 3 IN 14551 NaN -2400.0 1 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 NaN NaN NaN NaN NaN NaN
答案2
得分: 0
我不知道为什么我在编辑模式下仔细编辑的数据表会变得这么糟糕,哈哈。为了弥补这个问题,我在回复中包含了表格和预览中的属性的截图。
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论