2023年6月22日 04:12:55go评论67阅读模式

英文:

Difference between Stagining Area and ODS in Datawarehouse Modeling

问题

我正在进行一个BI项目，我有一个很大的困惑。

我的数据源分为两个类别：

包含一些表格的SQL脚本
一些Excel文件

这些数据应该位于一个单一的数据仓库中。

根据BI数据建模流程，我应该经过以下步骤：

数据源 => 分段区域 (Staging Area) => 操作数据存储 (ODS) => 数据仓库

对于我的情况，我没有使用ODS，而是按照以下方式操作：

对于SQL脚本中的表格，我创建了一个名为"SA"的新的空数据库连接，然后在Oracle SQL Developer中执行了SQL脚本，生成了源表格在我的分段区域中。当然，我没有进行任何转换或其他操作。

然后，我在Talend中做了一个作业，加载数据仓库的每个维度，我将分段区域的表格作为我的源，进行了一些转换，创建了派生列等... 然后将转换后的数据加载到我的维度中。

对于Excel文件，我做了两个Talend作业。
- 第一个作业，将Excel文件加载到分段区域（没有任何转换）
- 然后，我在我的第二个作业中使用分段区域加载的数据作为源，进行一些转换和清理。

我想知道我是否一切都做得正确，是否遵循最佳实践，因为我感觉自己从未理解应该在ODS和分段区域中进行的具体转换，以及ODS是否是强制性的，以及我何时应该使用它？

英文:

I'm working on a BI project and I have a big confusion.

My data source is split into two categories:

An SQL Script containing some tables
Some Excel files

These datas should be in a single datawarehouse

According to the BI Data Modeling Pipeling , I should pass by:

Data Source => Staging Area => ODS => Datawarehouse

For my case , I didn't use ODS and I did as following:

For the tables in the SQL script , I created a new empty database connection called "SA" in which I then
executed the SQL script in the Oracle SQL developer which generated the source tables in my staging area.
Of course , I didn't do any transformation or anything.

Then I did one job in Talend , to load each dimension of the datawarehouse , I put the Staging Area tables
as my sources , did some transformations , created derived columns ect... and loaded the transformed data
in my dimensions.

For the Excel files , I did two Talend jobs.
- First One , to load Excel File into ==> Staging Area (without any transfomration)
- Then , The Stagining I loaded will be the source in my second job where I'll be performing some transformation and cleaning

I was wondering if I'm doing everything right and following best practices because I feel like I never understodd about the specific transformation that should be done in ODS & Staging Area.
And if ODS is mandatory or not , like when do I exactly use it?

答案1

得分: 0

在BI数据建模管道中，典型的流程确实是数据源 -> 阶段性区域 -> 运营数据存储 (ODS) -> 数据仓库。然而，此管道的具体实现可以根据项目的需求和复杂性而变化。

看起来您采取了一种简化的方法，将阶段性区域和ODS合并为一个单一的阶段性区域。这在较小的项目或数据不需要在ODS层进行广泛转换时并不罕见。通过直接将SQL脚本表加载到阶段性区域并在那里执行转换，您有效地绕过了ODS层。

关于您处理Excel文件的方法，通常最佳实践是将数据加载到阶段性区域与转换过程分开。这允许更好地控制和重复使用数据加载步骤。首先将Excel文件加载到阶段性区域，然后在一个单独的作业中执行转换，这是一个合理的流程。

虽然ODS层不是强制性的，但它充当了阶段性区域和数据仓库之间的中间层。其目的是在将数据加载到数据仓库之前，整合和精炼来自不同来源的数据。ODS层可以根据需要执行附加的数据清洗、数据验证、合并和聚合。它充当了精炼数据的阶段性区域，然后进一步转换并加载到数据仓库中。

在您的情况下，由于您直接从阶段性区域转换数据并加载到维度中，您跳过了ODS层。如果您的转换和数据精炼过程很少并且可以在阶段性区域内有效执行，这种方法是可以接受的。

然而，在更复杂的情况下或者当涉及多个数据源时，加入ODS层可以提供额外的好处，例如更全面的数据整合、合并和验证，从而确保在将数据加载到数据仓库之前确保数据质量和一致性。

英文:

In the BI Data Modeling Pipeline, the typical flow is indeed Data Source -> Staging Area -> Operational Data Store (ODS) -> Data Warehouse. However, the specific implementation of this pipeline can vary depending on the requirements and complexity of your project.

It seems like you have taken a simplified approach by combining the Staging Area and ODS into a single staging area. This is not uncommon, especially in smaller projects or when the data does not require extensive transformations in the ODS layer. By directly loading the SQL script tables into the staging area and performing transformations there, you have effectively bypassed the ODS layer.

Regarding your approach for the Excel files, it is generally a good practice to separate the loading of data into the staging area from the transformation process. This allows for better control and reusability of the data loading step. By first loading the Excel files into the staging area and then performing transformations in a separate job, you have followed a logical flow.

While the ODS layer is not mandatory, it serves as an intermediate layer between the staging area and the data warehouse. Its purpose is to integrate and refine data from different sources before loading it into the data warehouse. The ODS layer can perform additional data cleansing, data validation, consolidation, and aggregation as required. It acts as a staging area for refined data before it is further transformed and loaded into the data warehouse.

In your case, since you have directly transformed the data from the staging area and loaded it into your dimensions, you have skipped the ODS layer. This approach can be acceptable if your transformations and data refinement processes are minimal and can be efficiently performed within the staging area itself.

However, in more complex scenarios or when dealing with multiple data sources, incorporating an ODS layer can provide additional benefits as allowing more comprehensive data integration, consolidation, and validation, enabling you to ensure data quality and consistency before loading it into the data warehouse.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Difference between Stagining Area and ODS in Datawarehouse Modeling

问题

答案1

在Django中未通过外键关系获取特定对象。

How to use ENV vars in golang to pass DB connection values dynamically when deployed in kubernetes

为什么在 Hibernate 的 @OneToMany 注释的一侧需要一个辅助方法？

MariaDB错误 4161 (HY000): 未知的数据类型

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论