英文:
Analyzing restaurant data using python. Need help merging two datasets on both check # and date
问题
我正在进行差异分析,研究了一段时间之前餐厅菜单格式的变化对业务的影响。我们改变了用于小型聚会的菜单,而用于大型聚会的菜单在整个分析期间保持不变。我基本上是将大型聚会用作比较的基准,以查看小型聚会菜单的变化是否降低了我们的人均收入。
我已经创建了每个时间段和不同菜单的数据框如下:
- dfLargePre = 菜单变更前的大型聚会
- dfSmallPre = 菜单变更前的小型聚会
- dfLargePost = 菜单变更后的大型聚会
- dfSmallPost = 菜单变更后的小型聚会
这些数据框具有"日期"、"账单号"和"总支付"列。
然而,我只想分析食品销售,不包括酒水销售(这在每个账单中占了很大比例)。由于我们的销售报告系统的工作方式,我不得不创建一个名为dfFood的新数据框,它具有"日期"、"账单号"和"总食品"列。这个数据框非常庞大,包含比我需要分析的数据更多的账单,但我需要"总食品"列在每个不同聚会大小和时间段的数据框中。
"账单号"在任何给定日期内是唯一的,但在分析的多个月时间范围内,有多个重复的账单号。因此,我需要基于"日期"和"账单号"进行合并。
有人知道我应该如何最好地完成这些合并吗?
df_merged = pd.merge(dfLargePre, dfFood, how='left', on=['Check_number', 'Date'])
我尝试了上面的代码行,但收到一个错误,说我需要使用pd.concat,但我不确定这是否是最佳行动方案,也不熟悉pd.concat的编码程序。
英文:
I am running a diff in diff analysis for my restaurant that changed the format of their menus a while back. The menus we changed are only used for smaller parties and the menus we use for larger parties have remained the same for the entire period of analysis. I am essentially using the larger parties as a point of comparison to see if the change in the smaller party menu has reduced our per person revenue.
I have made a df for each of the time frames and distinct menus used as follows:
dfLargePre = Large party in the period before menus were changed
dfSmallPre = Small party in the period before menus were changed
dfLargePost = Large party after menus were changed
dfSmallPost = Small party after menus were changed
These dfs have columns for 'Date', 'Check_number', and 'Total_payment'
However, I only want to analyze the food sales and no alcohol (which makes up a great deal of each check). Because of how our sales reporting system works, I had to create a new df called dfFood that has columns 'Date', 'Check_number', and 'Total_food'. This data frame is huge and has way more checks than the ones I need to analyze but I need the 'Total_food' column to be in each of the party size and time distinguished data frames.
'Check_number' is unique within any given day but there are several repeat check numbers within the many months time frame being analyzed. Therefore, I need to merge based on 'Date' and 'Check_number'.
Does anyone know how I can best complete these merges?
df_merged = pd.merge(dfLargePre, dfFood, how='left', on=['Check_Number', 'Date'])
I tried the above line and I got an error saying I need to use pd.concat but I am unsure if that is the best plan of action and I am not familiar with the pd.concat coding procedures.
############ EDIT ############# (Adding traceback for attempted merge)
答案1
得分: 1
你正在合并一对数据框架。
每个都有Check_Number和Date列。
现在看看每个数据框架的.dtype
。
日期应该是datetime64[ns]类型。
可能Check_Number应该是整数,
或者如果你将其建模为字符串,则可能是"object"类型。
两个数据框架中的Check_Number类型必须相同。
两个数据框架中的Date类型必须相同。
如果它们不匹配,你的.merge()
将无法成功。
英文:
You are merging a pair of dataframes.
Each has Check_Number and Date columns.
Now look at the .dtype
of each dataframe.
The Date should be of datetime64[ns].
Presumably Check_Number should be an integer,
or perhaps "object" if you're modeling that as a string.
The Check_Number type in both dataframes must be identical.
The Date type in both dataframes must be identical.
If they don't match, your .merge() won't succeed.
答案2
得分: 0
谢谢大家。我找到了错误。我在加载其中一个csv文件时没有使用parse_dates。我修复了csv上传,原始的代码行就可以完美运行了。
英文:
Thank you all. I found the error. I had not used parse_dates when loading one of my csv files. I fixed the csv upload and the original line of code worked perfectly.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论