英文:
dataframe merge on left adding extra rows
问题
我从CSV文件创建了一个发票数据框和多个主数据数据框。
invoice = pd.read_csv('rocaInv4.csv')
soMstr = pd.read_csv('salesOfficeMstr.csv')
custFreightMstr = pd.read_csv('customerCodeFreightMstr.csv')
ratesMstr = pd.read_csv('freightMstr.csv')
pfep = pd.read_csv('pfepMstr.csv')
然后,我根据物料主数据和客户主数据的可用性删除了一些行,并每次重新索引。
# 检查物料的可用性
invoice = invoice[invoice['Material'].isin(pfep['Material'])]
invoice = invoice.reset_index(drop=True)
# 检查客户详情的可用性
invoice = invoice[invoice['Ship to Party'].isin(custFreightMstr['Cust No'])]
invoice = invoice.reset_index(drop=True)
# 检查销售代码的有效性
invoice = invoice[invoice['Sales Office'].isin(soMstr['Code'])]
invoice = invoice.reset_index(drop=True)
invoice.shape
# (384, 22)
然后,我需要将主数据的数据复制到最终的清洁发票数据框中。我选择在选择的列上进行合并,而不是在两个数据框上进行循环操作。
invoice1 = invoice.merge(custFreightMstr[['Cust No', 'City', 'Customer Frgt Code']], left_on='Ship to Party', right_on='Cust No', how='left').drop_duplicates()
invoice1.shape
# (388, 25)
尽管我是在左侧进行合并,但最终多出了4行。我可以识别重复的行,但无法确定原因。我在这里做错了什么?
英文:
I create a invoice dataframe and a number of master dataframes from csv files
invoice=pd.read_csv('rocaInv4.csv')
soMstr=pd.read_csv('salesOfficeMstr.csv')
custFreightMstr=pd.read_csv('customerCodeFreightMstr.csv')
ratesMstr=pd.read_csv('freightMstr.csv')
pfep=pd.read_csv('pfepMstr.csv')
I drop a number of rows depending on availability in material masters and customer masters. I reindex each time.
#checking availability of material
invoice=invoice[invoice['Material'].isin(pfep['Material'])]
invoice=invoice.reset_index(drop=True)
#checking availability of customer details
invoice=invoice[invoice['Ship to Party'].isin(custFreightMstr['Cust No'])]
invoice=invoice.reset_index(drop=True)
#checking validity of sales code
invoice=invoice[invoice['Sales Office'].isin(soMstr['Code'])]
invoice=invoice.reset_index(drop=True)
invoice.shape
#(384, 22)
I then need to copy data from the masters to the final, clean Invoice DataFrame. Instead of doing a for loop over two data frames, I thought I would do a merge on select columns.
invoice1=invoice.merge(custFreightMstr[['Cust No','City','Customer Frgt Code']],left_on='Ship to Party',right_on='Cust No', how='left').drop_duplicates()
invoice1.shape
#(388, 25)
I end up with 4 extra rows even though I am merging on the left. I can identify which rows have been repeated. But I cant identify why. What am I doing wrong here?
答案1
得分: 2
代码部分不要翻译,只返回翻译好的部分:
你的代码中的合并等同于“左外连接”。如讨论所述,对于“Ship to Party”的某个值,你有多个匹配键“Cust No”。在主数据框中删除重复的键可能会有所帮助。
英文:
The merge in your code is equivalent to left outer join
. As discussed you have more than one matching keys Cust No
for a value of Ship to Party
. Remove the duplicate keys in the master dataframe. That might help.
答案2
得分: 1
我不知道主框架中重复的Cust No
哪一个是正确的。为了编码目的,我执行了以下操作:
# 在主框架中删除重复的客户编号
invoice1 = invoice.merge(custFreightMstr.drop_duplicates('Cust No', keep='last')[['Cust No', 'City', 'Customer Frgt Code']], left_on='Ship to Party', right_on='Cust No', how='left', validate='m:1')
在'Cust No'
上使用drop_duplicate
会删除所有重复项,仅保留最后一个条目。
关键字validate
确认在实际合并过程中每个客户代码只有一个。
英文:
I have no clue which of the repeated Cust No
in the master frame is correct. For coding purposes, I executed the following:
#drop duplicate cust no in the master
invoice1=invoice.merge(custFreightMstr.drop_duplicates('Cust No',keep='last')[['Cust No','City','Customer Frgt Code']],left_on='Ship to Party',right_on='Cust No', how='left',validate = 'm:1')
drop_duplicate on 'Cust No'removes all duplicates, retaining the last entry alone.
The validate keyword confirms there is only one of each cust code during actual merge.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论