数据框左合并时添加额外行

huangapple go评论84阅读模式
英文:

dataframe merge on left adding extra rows

问题

我从CSV文件创建了一个发票数据框和多个主数据数据框。

invoice = pd.read_csv('rocaInv4.csv')
soMstr = pd.read_csv('salesOfficeMstr.csv')
custFreightMstr = pd.read_csv('customerCodeFreightMstr.csv')
ratesMstr = pd.read_csv('freightMstr.csv')
pfep = pd.read_csv('pfepMstr.csv')

然后,我根据物料主数据和客户主数据的可用性删除了一些行,并每次重新索引。

# 检查物料的可用性
invoice = invoice[invoice['Material'].isin(pfep['Material'])]
invoice = invoice.reset_index(drop=True)

# 检查客户详情的可用性
invoice = invoice[invoice['Ship to Party'].isin(custFreightMstr['Cust No'])]
invoice = invoice.reset_index(drop=True)

# 检查销售代码的有效性
invoice = invoice[invoice['Sales Office'].isin(soMstr['Code'])]
invoice = invoice.reset_index(drop=True)

invoice.shape
# (384, 22)

然后,我需要将主数据的数据复制到最终的清洁发票数据框中。我选择在选择的列上进行合并,而不是在两个数据框上进行循环操作。

invoice1 = invoice.merge(custFreightMstr[['Cust No', 'City', 'Customer Frgt Code']], left_on='Ship to Party', right_on='Cust No', how='left').drop_duplicates()

invoice1.shape
# (388, 25)

尽管我是在左侧进行合并,但最终多出了4行。我可以识别重复的行,但无法确定原因。我在这里做错了什么?

英文:

I create a invoice dataframe and a number of master dataframes from csv files

invoice=pd.read_csv('rocaInv4.csv')

soMstr=pd.read_csv('salesOfficeMstr.csv')
custFreightMstr=pd.read_csv('customerCodeFreightMstr.csv')
ratesMstr=pd.read_csv('freightMstr.csv')
pfep=pd.read_csv('pfepMstr.csv')

I drop a number of rows depending on availability in material masters and customer masters. I reindex each time.

#checking availability of material
invoice=invoice[invoice['Material'].isin(pfep['Material'])]
invoice=invoice.reset_index(drop=True)

#checking availability of customer details
invoice=invoice[invoice['Ship to Party'].isin(custFreightMstr['Cust No'])]
invoice=invoice.reset_index(drop=True)

#checking validity of sales code
invoice=invoice[invoice['Sales Office'].isin(soMstr['Code'])]
invoice=invoice.reset_index(drop=True)

invoice.shape
#(384, 22)

I then need to copy data from the masters to the final, clean Invoice DataFrame. Instead of doing a for loop over two data frames, I thought I would do a merge on select columns.

invoice1=invoice.merge(custFreightMstr[['Cust No','City','Customer Frgt Code']],left_on='Ship to Party',right_on='Cust No', how='left').drop_duplicates()

invoice1.shape
#(388, 25)

I end up with 4 extra rows even though I am merging on the left. I can identify which rows have been repeated. But I cant identify why. What am I doing wrong here?

答案1

得分: 2

代码部分不要翻译,只返回翻译好的部分:

你的代码中的合并等同于“左外连接”。如讨论所述,对于“Ship to Party”的某个值,你有多个匹配键“Cust No”。在主数据框中删除重复的键可能会有所帮助。

英文:

The merge in your code is equivalent to left outer join. As discussed you have more than one matching keys Cust No for a value of Ship to Party. Remove the duplicate keys in the master dataframe. That might help.

答案2

得分: 1

我不知道主框架中重复的Cust No哪一个是正确的。为了编码目的,我执行了以下操作:

# 在主框架中删除重复的客户编号
invoice1 = invoice.merge(custFreightMstr.drop_duplicates('Cust No', keep='last')[['Cust No', 'City', 'Customer Frgt Code']], left_on='Ship to Party', right_on='Cust No', how='left', validate='m:1')

'Cust No'上使用drop_duplicate会删除所有重复项,仅保留最后一个条目。

关键字validate确认在实际合并过程中每个客户代码只有一个。

英文:

I have no clue which of the repeated Cust No in the master frame is correct. For coding purposes, I executed the following:

#drop duplicate cust no in the master
invoice1=invoice.merge(custFreightMstr.drop_duplicates('Cust No',keep='last')[['Cust No','City','Customer Frgt Code']],left_on='Ship to Party',right_on='Cust No', how='left',validate = 'm:1')

drop_duplicate on 'Cust No'removes all duplicates, retaining the last entry alone.

The validate keyword confirms there is only one of each cust code during actual merge.

huangapple
  • 本文由 发表于 2020年1月3日 13:37:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/59573674.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定