2020年1月3日 13:37:47go评论126阅读模式

英文:

dataframe merge on left adding extra rows

问题

我从CSV文件创建了一个发票数据框和多个主数据数据框。

invoice = pd.read_csv('rocaInv4.csv')
soMstr = pd.read_csv('salesOfficeMstr.csv')
custFreightMstr = pd.read_csv('customerCodeFreightMstr.csv')
ratesMstr = pd.read_csv('freightMstr.csv')
pfep = pd.read_csv('pfepMstr.csv')

然后，我根据物料主数据和客户主数据的可用性删除了一些行，并每次重新索引。

# 检查物料的可用性
invoice = invoice[invoice['Material'].isin(pfep['Material'])]
invoice = invoice.reset_index(drop=True)
# 检查客户详情的可用性
invoice = invoice[invoice['Ship to Party'].isin(custFreightMstr['Cust No'])]
invoice = invoice.reset_index(drop=True)
# 检查销售代码的有效性
invoice = invoice[invoice['Sales Office'].isin(soMstr['Code'])]
invoice = invoice.reset_index(drop=True)
invoice.shape
# (384, 22)

然后，我需要将主数据的数据复制到最终的清洁发票数据框中。我选择在选择的列上进行合并，而不是在两个数据框上进行循环操作。

invoice1 = invoice.merge(custFreightMstr[['Cust No', 'City', 'Customer Frgt Code']], left_on='Ship to Party', right_on='Cust No', how='left').drop_duplicates()
invoice1.shape
# (388, 25)

尽管我是在左侧进行合并，但最终多出了4行。我可以识别重复的行，但无法确定原因。我在这里做错了什么？

英文:

I create a invoice dataframe and a number of master dataframes from csv files

invoice=pd.read_csv(&#39;rocaInv4.csv&#39;)
soMstr=pd.read_csv(&#39;salesOfficeMstr.csv&#39;)
custFreightMstr=pd.read_csv(&#39;customerCodeFreightMstr.csv&#39;)
ratesMstr=pd.read_csv(&#39;freightMstr.csv&#39;)
pfep=pd.read_csv(&#39;pfepMstr.csv&#39;)

I drop a number of rows depending on availability in material masters and customer masters. I reindex each time.

#checking availability of material
invoice=invoice[invoice[&#39;Material&#39;].isin(pfep[&#39;Material&#39;])]
invoice=invoice.reset_index(drop=True)
#checking availability of customer details
invoice=invoice[invoice[&#39;Ship to Party&#39;].isin(custFreightMstr[&#39;Cust No&#39;])]
invoice=invoice.reset_index(drop=True)
#checking validity of sales code
invoice=invoice[invoice[&#39;Sales Office&#39;].isin(soMstr[&#39;Code&#39;])]
invoice=invoice.reset_index(drop=True)
invoice.shape
#(384, 22)

I then need to copy data from the masters to the final, clean Invoice DataFrame. Instead of doing a for loop over two data frames, I thought I would do a merge on select columns.

invoice1=invoice.merge(custFreightMstr[[&#39;Cust No&#39;,&#39;City&#39;,&#39;Customer Frgt Code&#39;]],left_on=&#39;Ship to Party&#39;,right_on=&#39;Cust No&#39;, how=&#39;left&#39;).drop_duplicates()
invoice1.shape
#(388, 25)

I end up with 4 extra rows even though I am merging on the left. I can identify which rows have been repeated. But I cant identify why. What am I doing wrong here?

答案1

得分: 2

代码部分不要翻译，只返回翻译好的部分：

你的代码中的合并等同于“左外连接”。如讨论所述，对于“Ship to Party”的某个值，你有多个匹配键“Cust No”。在主数据框中删除重复的键可能会有所帮助。

英文:

The merge in your code is equivalent to left outer join. As discussed you have more than one matching keys Cust No for a value of Ship to Party. Remove the duplicate keys in the master dataframe. That might help.

答案2

得分: 1

我不知道主框架中重复的Cust No哪一个是正确的。为了编码目的，我执行了以下操作：

# 在主框架中删除重复的客户编号
invoice1 = invoice.merge(custFreightMstr.drop_duplicates('Cust No', keep='last')[['Cust No', 'City', 'Customer Frgt Code']], left_on='Ship to Party', right_on='Cust No', how='left', validate='m:1')

在'Cust No'上使用drop_duplicate会删除所有重复项，仅保留最后一个条目。

关键字validate确认在实际合并过程中每个客户代码只有一个。

英文:

I have no clue which of the repeated Cust No in the master frame is correct. For coding purposes, I executed the following:

#drop duplicate cust no in the master
invoice1=invoice.merge(custFreightMstr.drop_duplicates(&#39;Cust No&#39;,keep=&#39;last&#39;)[[&#39;Cust No&#39;,&#39;City&#39;,&#39;Customer Frgt Code&#39;]],left_on=&#39;Ship to Party&#39;,right_on=&#39;Cust No&#39;, how=&#39;left&#39;,validate = &#39;m:1&#39;)

drop_duplicate on 'Cust No'removes all duplicates, retaining the last entry alone.

The validate keyword confirms there is only one of each cust code during actual merge.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据框左合并时添加额外行

问题

答案1

答案2

从图像中移除/擦除对象使用Python。

Matlab中是否有根据条件连接数组的函数？

cannot pickle 'PyCapsule' object error, when using pybind11 function and dask

How can we merge column headers from multiple CSVs into one dataframe, and list source file names for each file in one column?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。