时间序列预测的理想数据集结构

huangapple go评论78阅读模式
英文:

Ideal Dataset Structure for Time Series Forecasting

问题

I'm trying to do time series forecasting in Python.
在Python中,我试图进行时间序列预测。

Before I start doing it, I have some doubts in how we can Prepare source dataset.
在我开始之前,我对如何准备源数据集有一些疑问。

Just want to understand the structure of data.
只是想了解数据的结构。

Let's say I have a department and in each department there are multiple Teams, I want to time series forecasting on Total Sales By each department.
假设我有一个部门,每个部门都有多个团队,我想对每个部门的总销售额进行时间序列预测。

I can prepare the data in the below options:
我可以按照以下选项准备数据:

Most of the tutorials which I have seen online is using Option 2. But I prefer Option 1
我在网上看到的大多数教程都使用选项2。但我更喜欢选项1。

Because in future if there are more new departments coming 1, then it can be added at the row level, whereas in Option-2 I need to add more and more columns each time.
因为将来如果有更多新部门加入,可以在行级别添加它们,而在选项2中,每次都需要添加更多列。

My Question is :
我的问题是:

  1. Can I use the structure in Option-1 for preparing my dataset?
    我可以使用选项1中的结构来准备我的数据集吗?

  2. If Yes, in the Date column, I can see 1st June has 3 records for each team in a department. So is there any condition whether a row should have a date only once?
    如果可以,在日期列中,我可以看到每个部门的6月1日有3条记录。那么是否有任何条件要求行只能有一个日期?

  3. In Option-1, Let's say I want to predict total sales By department. Will adding an additional column like Team Name have any impact while preparing models for time series forecasting?
    在选项1中,假设我想预测部门的总销售额。在为时间序列预测准备模型时,是否添加类似团队名称的附加列会产生影响?

I would be really glad if someone could help. Thanks in advance.
如果有人能帮助我,我会非常高兴。提前感谢。

英文:

I'm trying to do time series forecasting in Python.

Before I start doing it, I have some doubts in how we can Prepare source dataset.

Just want to understand the structure of data.

Let's say I have a department and in each department there are multiple Teams, I want to time series forecasting on Total Sales By each department.

I can prepare the data in the below options:

enter image description here

Most of the tutorials which I have seen online is using Option 2. But I prefer Option 1

Because in future if there are more new departments coming 1, then it can be added at the row level, whereas in Option-2 I need to add more and more columns each time.

My Question is :

  1. Can I use the structure in Option-1 for preparing my dataset?

  2. If Yes, in the Date column, I can see 1st June has 3 records for each team in a department. So is there any condition whether a row should have a date only once?

  3. In Option-1, Let's say I want to predict total sales By department. Will adding a addition column like Team Name have any impact while preparing models for time series forecasting?

I would be really glad if someone could help. Thanks in advance.

答案1

得分: 0

以下是您要翻译的内容:

在进行预测时,您的数据准备将取决于您试图找到的答案(不要误解我,我并不是说您要操纵您的准备以获得所需的答案)。我的意思是,您说:“我想对每个部门的总销售额进行时间序列预测”。这意味着您不关心部门内的团队。在这种情况下,选择选项1并不理想,因为要获得任何部门的总销售额,您将不得不执行一些工作来计算它,而不是简单地读取您需要的值。

然而,将您的源数据存储在比您将要使用的数据更详细的级别中是非常常见的。关键的要点是您将使用Python来读取这些数据。将数据聚合到您需要的级别应该在Python中完成,并且在例如.csv文件中以更详细的方式存储它是完全可以的。

回答您的问题:

  1. 是的,您绝对可以使用选项1来存储您的数据,这也是我的首选方式。
  2. 在您的数据中,无论有多少列、行或重复项都没有限制。此外,您的数据越详细(列越多),可能会有更多具有重复日期值的行。
  3. 如果您只打算基于“部门”而不是“团队”进行区分,您可以在读取数据后,例如使用pandas库来在“部门”上聚合您的数据。在那个时候,保留详细的“团队”信息就没有意义了。

您提出了很好的问题,但很难对所有这些问题提供清晰和完整的答案。我的建议是尽快获得任何形式的结果,同时尽量明确您在整个过程中所做的选择。然后,当您有了结果后,可以微调和审查以前的决策。没有任何预测模型是完美的(永远不是),也不会在一次尝试中完成。

英文:

While making a forecast your data preparation will depend on what answers you are trying to find (don't get me wrong, I'm not saying you manipulate your preparation to get the answers you need). What I mean by this is, you say "I want to time series forecasting on Total Sales By each department". This would imply you don't care about the teams within a department. In that case it's not ideal to go for option-1, because to then get the total sales of any department you will have to perform some work to calculate it, instead of simply reading the value you need.

However it is very common to have your source data in a more detailed level than in which you are going to use it. The key take-away here is that you are going to use python to read this data. Aggregating data to the level you need it, should be done in Python and it is absolutely fine to store it more detailed in for example a .csv file.

To answer you questions:

  1. Yes you can definitely use Option-1 to store your data, it would also be my preferred way.
  2. There is no limitation on how many columns, rows or duplicates you can have in your data. Moreover, the more detailed your data is (more columns), the more rows will probably have a duplicate date value.
  3. If you only intend to make a distinction based on department and not on Team you can for example use the pandas library to aggregate your data on department after your read it. There is no use in keeping the detailed Team information at that point.

You have good questions, but it is difficult to give a clear and complete answer on all of them. My advice would be to get any kind of result as quickly as possible while trying to be clear about the choices you make along the way. Then when you have your result you can finetune and review previous decisions. No forecasting model is every perfect (ever) or done in one try.

huangapple
  • 本文由 发表于 2023年6月8日 13:32:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76428866.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定