Analyzing restaurant data using python. Need help merging two datasets on both check # and date

huangapple go评论70阅读模式
英文:

Analyzing restaurant data using python. Need help merging two datasets on both check # and date

问题

我正在进行差异分析,研究了一段时间之前餐厅菜单格式的变化对业务的影响。我们改变了用于小型聚会的菜单,而用于大型聚会的菜单在整个分析期间保持不变。我基本上是将大型聚会用作比较的基准,以查看小型聚会菜单的变化是否降低了我们的人均收入。

我已经创建了每个时间段和不同菜单的数据框如下:

  • dfLargePre = 菜单变更前的大型聚会
  • dfSmallPre = 菜单变更前的小型聚会
  • dfLargePost = 菜单变更后的大型聚会
  • dfSmallPost = 菜单变更后的小型聚会

这些数据框具有"日期"、"账单号"和"总支付"列。

然而,我只想分析食品销售,不包括酒水销售(这在每个账单中占了很大比例)。由于我们的销售报告系统的工作方式,我不得不创建一个名为dfFood的新数据框,它具有"日期"、"账单号"和"总食品"列。这个数据框非常庞大,包含比我需要分析的数据更多的账单,但我需要"总食品"列在每个不同聚会大小和时间段的数据框中。

"账单号"在任何给定日期内是唯一的,但在分析的多个月时间范围内,有多个重复的账单号。因此,我需要基于"日期"和"账单号"进行合并。

有人知道我应该如何最好地完成这些合并吗?

df_merged = pd.merge(dfLargePre, dfFood, how='left', on=['Check_number', 'Date'])

我尝试了上面的代码行,但收到一个错误,说我需要使用pd.concat,但我不确定这是否是最佳行动方案,也不熟悉pd.concat的编码程序。

英文:

I am running a diff in diff analysis for my restaurant that changed the format of their menus a while back. The menus we changed are only used for smaller parties and the menus we use for larger parties have remained the same for the entire period of analysis. I am essentially using the larger parties as a point of comparison to see if the change in the smaller party menu has reduced our per person revenue.

I have made a df for each of the time frames and distinct menus used as follows:

dfLargePre = Large party in the period before menus were changed
dfSmallPre = Small party in the period before menus were changed
dfLargePost = Large party after menus were changed
dfSmallPost = Small party after menus were changed

These dfs have columns for 'Date', 'Check_number', and 'Total_payment'

However, I only want to analyze the food sales and no alcohol (which makes up a great deal of each check). Because of how our sales reporting system works, I had to create a new df called dfFood that has columns 'Date', 'Check_number', and 'Total_food'. This data frame is huge and has way more checks than the ones I need to analyze but I need the 'Total_food' column to be in each of the party size and time distinguished data frames.

'Check_number' is unique within any given day but there are several repeat check numbers within the many months time frame being analyzed. Therefore, I need to merge based on 'Date' and 'Check_number'.

Does anyone know how I can best complete these merges?

df_merged = pd.merge(dfLargePre, dfFood, how='left', on=['Check_Number', 'Date'])

I tried the above line and I got an error saying I need to use pd.concat but I am unsure if that is the best plan of action and I am not familiar with the pd.concat coding procedures.

############ EDIT ############# (Adding traceback for attempted merge)

Traceback Screenshot

答案1

得分: 1

你正在合并一对数据框架。
每个都有Check_Number和Date列。

现在看看每个数据框架的.dtype
日期应该是datetime64[ns]类型。
可能Check_Number应该是整数,
或者如果你将其建模为字符串,则可能是"object"类型。

两个数据框架中的Check_Number类型必须相同。
两个数据框架中的Date类型必须相同。
如果它们不匹配,你的.merge() 将无法成功。

英文:

You are merging a pair of dataframes.
Each has Check_Number and Date columns.

Now look at the .dtype of each dataframe.
The Date should be of datetime64[ns].
Presumably Check_Number should be an integer,
or perhaps "object" if you're modeling that as a string.

The Check_Number type in both dataframes must be identical.
The Date type in both dataframes must be identical.
If they don't match, your .merge() won't succeed.

答案2

得分: 0

谢谢大家。我找到了错误。我在加载其中一个csv文件时没有使用parse_dates。我修复了csv上传,原始的代码行就可以完美运行了。

英文:

Thank you all. I found the error. I had not used parse_dates when loading one of my csv files. I fixed the csv upload and the original line of code worked perfectly.

huangapple
  • 本文由 发表于 2023年6月16日 02:07:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76484419.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定