I'm trying to build a Data lake using GCS, Orchestrator and Dataflow. I've Push and Pull apis from Servicenow. How should I build it?

huangapple go评论57阅读模式
英文:

I'm trying to build a Data lake using GCS, Orchestrator and Dataflow. I've Push and Pull apis from Servicenow. How should I build it?

问题

我正在尝试使用GCS、Orchestrator和Dataflow构建数据湖。我从Servicenow中使用了Push和Pull的API。我应该如何构建它?我可能会在数据湖中使用Bigquery或GCS。有人可以帮我提供详细的工作流程吗?

我尝试过这个模型,我正在寻找架构图。

英文:

I'm trying to build a Data lake using GCS, Orchestrator and Dataflow. I've Push and Pull apis from Servicenow. How should I build it? I may use Bigquery or GCS for Datalake. Can anyone help me with a detailed workflow?

I tried this model, I'm looking for architecture diagrams.

答案1

得分: 1

存储:您可以使用云存储作为数据湖,云存储非常适合作为中央存储库,有很多原因,如性能和耐用性、强一致性、成本效益、灵活处理、中央存储和安全性。

数据摄取:您可以使用Pub/Sub和Dataflow。

您可以直接将实时数据摄取并存储到云存储中,根据数据量的变化,进行横向和纵向扩展。

处理和分析:在摄取和存储数据之后,下一步是使其可供分析。

例如,如果您将传入的数据以Avro格式存储在云存储中,您可以执行以下操作:

  • 在Dataproc上使用Hive执行SQL查询。
  • 直接从BigQuery对数据发出查询。
  • 将数据加载到BigQuery然后进行查询。

数据挖掘和探索

因为存储在数据湖中的数据大部分不适合立即使用,所以您首先必须挖掘这些数据的潜在价值。

对于基于强大的基于SQL的分析,您可以使用Dataprep by Trifacta对原始数据进行转换,然后将其加载到BigQuery中。

设计和部署工作流程

使数据子集更广泛可用意味着创建有重点的数据集市。您可以通过使用编排的数据管道来使这些数据集市保持最新,这些管道将原始数据转换为下游处理和用户可以使用的格式。

  • 您可以对原始数据进行转换并加载到BigQuery中。

您可以使用提取、转换和加载(ETL)过程将数据摄取到BigQuery数据仓库中。然后,您可以使用SQL查询数据。 Dataprep是一个用于清洗和准备数据的可视化工具,非常适合简单的ETL作业,而具有Apache Beam的Dataflow提供了更多复杂ETL作业的灵活性。

有关更多信息,您可以参考这个文档

英文:

Storage: You can use Cloud Storage as the data lake, Cloud Storage is well suited to serve as the central storage repository for many reasons as Performance and durability
Strong consistency, Cost efficiency, Flexible processing, Central repository and Security

Data ingestion: You can use Pub/Sub and Dataflow.

You can ingest and store real-time data directly into Cloud Storage, scaling both in and out in response to data volume.

Processing and analytics : After you have ingested and stored data, the next step is to make it available for analysis.

For instance, if you store incoming data in Avro format in Cloud Storage, you can do the following:

  • Use Hive on Dataproc to issue SQL queries against the data.
  • Issue queries directly against the data from BigQuery.
  • Load the data into BigQuery and then query it.

Data mining and exploration:

Because a large portion of the data stored in the lake is not ready for immediate consumption, you must first mine this data for latent value.

For powerful SQL-based analysis, you can transform raw data with Dataprep by Trifacta and load it into BigQuery.

Design and deploy workflows:

Making a data subset more widely available means creating focused data marts. You can keep these data marts up to date by using orchestrated data pipelines that take raw data and transform it into a format that downstream processes and users can consume.

- You can do Transformation of raw data and load into BigQuery

You use an extract, transform, and load (ETL) process to ingest data into a BigQuery data warehouse. You can then query the data by using SQL. Dataprep, a visual tool for cleansing and preparing data, is well suited for simple ETL jobs, while Dataflow with Apache Beam provides additional flexibility for more involved ETL jobs.

For more information you can refer to this documentation.

huangapple
  • 本文由 发表于 2023年2月26日 21:20:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/75572263.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定