Snowflake:推荐如何集成来自Azure事件中心的流式数据。

huangapple go评论89阅读模式
英文:

Snowflake: Recommendation how to integrate streaming data from Azure Event Hub

问题

I'm trying to figure out how to best integrate streaming data from Azure Event Hub into Snowflake, e.g. with low latency and as cost-efficient as possible.

我正在尝试找出如何将来自Azure Event Hub的流数据最佳集成到Snowflake中,以实现低延迟和尽可能高效的成本。

Brainstorming with myself I came up with 2 possibilities that both have some quite big disadvantages:

与自己进行头脑风暴,我想出了两种可能性,它们都有一些相当大的缺点:

Option 1: Setting up "Event Hub Capture" to export Event Hub data to blob storage and import to Snowflake

选项1:设置“Event Hub Capture”以将Event Hub数据导出到Blob存储并导入到Snowflake

This seems to be the most straightforward way. The data from Event Hub would be exported to Blob Storage automatically by enabling Event Hub Capture (https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview). The generated Avro files are then imported to Snowflake via Snowpipe (external tables, orchestration using Snowflake tasks).

这似乎是最直接的方法。通过启用Event Hub Capture(https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview),可以自动将来自Event Hub的数据导出到Blob存储。然后,通过Snowpipe(外部表,使用Snowflake任务进行编排),将生成的Avro文件导入Snowflake。

Disadvantages:

  • The minimum time window for export using Event Hub Capture is 1 minute. Snowflake tasks can also be scheduled to run every 1 minute. This means a potential max. delay of ca. 2 minutes (Event Hub -> Snowflake).
  • A Snowflake warehouse would be running basically 24/7 (and this would produce quite large costs in Snowflake).

缺点:

  • 使用Event Hub Capture导出的最小时间窗口为1分钟。 Snowflake任务也可以安排在每1分钟运行一次。这意味着可能的最大延迟约为2分钟(Event Hub -> Snowflake)。
  • Snowflake仓库基本上将全天候运行(这将在Snowflake中产生相当大的成本)。

Option 2: Creating an Azure Function that listens to Event Hub events and writes the data directly to Snowflake via Snowflake API

选项2:创建一个Azure函数,监听Event Hub事件并通过Snowflake API直接将数据写入Snowflake

Azure function would listen to Event Hub events and write the event data directly to Snowflake via Snowflake API (basically: send SQL INSERTs to DB via Snowflake API).

Azure函数将监听Event Hub事件,并通过Snowflake API直接将事件数据写入Snowflake(基本上是通过Snowflake API向数据库发送SQL INSERT)。

Advantages:

  • No blob storage required
  • "quite" low-latency

优点:

  • 不需要Blob存储
  • “相对低延迟”

Disadvantages:

  • As above: a Snowflake warehouse would be running 24/7 (costs)
  • More development needed (Azure function)

缺点:

  • 如上所述:Snowflake仓库将全天候运行(成本)
  • 需要更多的开发工作(Azure函数)

Then I read the Snowflake documentation regarding streaming / Ingest SDK, so based on this:

然后我阅读了Snowflake关于流媒体/Ingest SDK的文档,基于此:

Option 3: Using Snowflake Streaming (?)

选项3:使用Snowflake Streaming(?)

Here I'm a bit confused, after reading the documentation. Disclaimer: I have no experience with Kafka or any other streaming technologies.

在阅读文档后,我有些困惑。免责声明:我没有使用Kafka或任何其他流媒体技术的经验。

The Azure Event Hub documentation states:

Azure Event Hub文档中指出:

> "Azure Event Hubs provides an Apache Kafka endpoint on an event hub,
> which enables users to connect to the event hub using the Kafka
> protocol."

“Azure Event Hubs在事件中心上提供了一个Apache Kafka终端点,允许用户使用Kafka协议连接到事件中心。”

Snowflake has a Snowflake Kafka connector. Can the Snowflake Kafka connector be used to connect to the Azure Event Hub to stream the data directly to Snowflake?

Snowflake有一个Snowflake Kafka连接器。可以使用Snowflake Kafka连接器连接到Azure Event Hub,将数据直接流式传输到Snowflake吗?

There is also a "Snowflake Ingest SDK API" as well, available for Java. Unfortunately I don't know Java. Snowflake:推荐如何集成来自Azure事件中心的流式数据。 Is there any way to use this with any other language, like Python?

还有一个“Snowflake Ingest SDK API”,可用于Java。不幸的是,我不懂Java。 Snowflake:推荐如何集成来自Azure事件中心的流式数据。 有没有办法将其与其他语言一起使用,比如Python?

I would be happy to get some feedback / best practices / real-life experience how to best implement streaming data from Azure Event Hub into Snowflake.

我很高兴听到一些反馈/最佳实践/实际经验,关于如何最好地将来自Azure Event Hub的流数据集成到Snowflake中。

英文:

I'm trying to figure out how to best integrate streaming data from Azure Event Hub into Snowflake, e.g. with low latency and as cost-efficient as possible.

Brainstorming with myself I came up with 2 possibilities that both have some quite big disadvantages:

Option 1: Setting up "Event Hub Capture" to export Event Hub data to blob storage and import to Snowflake

This seems to be the most straightforward way. The data from Event Hub would be exported to Blob Storage automatically by enabling Event Hub Capture (https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview). The genereated Avro files are then imported to Snowflake via Snowpipe (external tables, orchestration using Snowflake tasks).

Disadvantages:

  • The minimum time window for export using Event Hub Capture is 1 minute. Snowflake tasks can also be scheduled to run every 1 minute. This means a potential max. delay of ca. 2 minutes (Event Hub -> Snowflake).
  • A Snowflake warehouse would be running basically 24/7 (and this would produce quite large costs in Snowflake).

Option 2: Creating an Azure Function that listens to Event Hub events and writes the data directly to Snowflake via Snowflake API

Azure function would listen to Event Hub events and write the event data directly to Snowflake via Snowflake API (basically: send SQL INSERTs to DB via Snowflake API).

Advantages:

  • No blob storage required
  • "quite" low-latency

Disadvantages:

  • As above: a Snowflake warehouse would be running 24/7 (costs)
  • More development needed (Azure function)

Then I read the Snowflake documentation regarding streaming / Ingest SDK, so based on this:

Option 3: Using Snowflake Streaming (?)

Here I'm a bit confused, after reading the documentation. Disclaimer: I have no experience with Kafka or any other streaming technologies.

The Azure Event Hub documentation states:

> "Azure Event Hubs provides an Apache Kafka endpoint on an event hub,
> which enables users to connect to the event hub using the Kafka
> protocol."

Snowflake has a Snowflake Kafka connector. Can the Snowflake Kafka connector be used to connect to the Azure Event Hub to stream the data directly to Snowflake?

There is also a "Snowflake Ingest SDK API" aswell, available for Java. Unfortunately I don't know Java. Snowflake:推荐如何集成来自Azure事件中心的流式数据。 Is there any way to use this with any other language, like Python?

I would be happy to get some feedback / best practices / real-life experience how to best implement streaming data from Azure Event Hub into Snowflake.

答案1

得分: 1

我有另一个想法:我想优化你的第三种方法。

如果您想实现低延迟,那么Kafka可能是处理实时事件的最佳选择,只要有新行到达,就可以立即处理。一旦数据从Azure Event Hub设置到Kafka主题中。您还可以尝试Kafka Connector与Snowflake以及Kafka Connector配置文件中的Snowpipe Streaming Configuration设置,而不是使用JAVA/Python。

snowflake.ingestion.method: SNOWPIPE_STREAMING

在这种方法中,设置配置需要更多的工作。但这只是一次性的工作,用于设置您的实时流基础架构。

请查看Snowflake的链接文档。这可能会让您快速入门。

英文:

I have another thought in my mind: I want to refine your 3rd approach

If you want to achieve low latency then Kafka could be the best choice to handle real-time events as soon as a new row arrives. Once the data is setup in Kafka topic from Azure Event Hub. Instead of JAVA/Python you can also try Kafka Connector with Snowflake with the Snowpipe Streaming Configuration setting in Kafka Connector config file.

snowflake.ingestion.method: SNOWPIPE_STREAMING

In this approach, more effort is setting up the configuration. But it is only a one-time effort to setup your real-time streaming infra.

Please check the linked documentation of snowflake. This might give you a quick start.

huangapple
  • 本文由 发表于 2023年7月6日 17:49:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76627562.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定