处理Flink广播流中的大数据

huangapple go评论89阅读模式
英文:

Process large data in flink broadcast stream

问题

我正在使用一个 Flink 流处理的 Java 应用程序,输入源是 Kafka。我的应用程序中总共有 4 个流。其中一个是主数据流,另外三个用作广播流。

流 A 是主要的流,它持续地从 Kafka 中获取数据。

流 B 是一个丰富数据的数据集。流 B 是流 C、流 D 和流 E 的组合流。它很大(流 C、流 D 和流 E 的大小都很大)。

流 C、流 D 和流 E 的对象类型是不同的。(例如,一个流的类型是 Employee,另一个类型是 AttendanceDetails,另一个是 SalaryDetails 等等)。

我使用了 Either 类型将这三个广播流连接起来。我将流 B 设置为广播流,并且能够在 Broadcast Process Function 的上下文状态中接收它(即在 processBroadcastElement() 中)。

我的问题是:

  1. 是否可以在广播状态中存储大量数据?

  2. 是否可以广播大量数据?

如果可以存储大量数据,那么在广播状态中可以存储多少数据(即数据大小),并且可以应用容错和 Flink 检查点?我的 Flink 系统内存和存储大小分别为:

       内存:8 GB
       磁盘大小:20-25 GB

如何为 Flink 中的广播状态配置内存大小?

注意:根据我的理解,Flink 广播状态在运行时存储在内存中(这意味着广播状态不会存储在 rocksdb 中),并且广播流被用作低吞吐量的事件流。由于当前运算符状态不支持 RocksDB 状态后端。

英文:

I am using a Flink streaming Java application with input source as Kafka. Totally 4 streams are used in my application. One is the main data stream and another 3 three are used for a broadcast stream.

Stream A is the main stream, it flows continuously from Kafka.

Stream B is a dataset of enrichment data. Stream B is a Combined stream of Stream C , Stream D, Stream E. It's a big one (All the 3 stream size is large).

Stream C, Stream D, Stream E streams Object type is different. (For example, one stream type is Employee, Another one type is AttendanceDetails, another one is SalaryDetails, etc...).

I was joined the three broadcast streams using Either type. I have broadcast as the Stream B and able to receive in Broadcast Process Function context state (i.e in processBroadcastElement() ).

My questions are,

  1. Is it possible to store large data in Broadcast state?

  2. Is it possible for Broadcast large data?

If possible for store large data means, how much data(i.e data size) can able to store in Broadcast state and can able to apply Fault tolerance and Flink checkpoints? My Flink system memory and storage size are:

       Memory: 8 GB
       Disk Size: 20-25 GB

How to configure memory size for the Broadcast state in Flink?

Note: As per my understanding, Flink Broadcast State is kept in memory at runtime (it mean broadcast state will not be stored at rocksdb) and the broadcast stream is used as a low-throughput event stream. Since currently, the RocksDB state backend is not available for the operator state.

答案1

得分: 1

广播状态的工作副本始终位于堆上,而不是RocksDB中。因此,它的大小必须足够小,适合放入内存。此外,每个实例都将所有广播状态复制到其检查点中,因此所有检查点和保存点都将有广播状态的_n_个副本(其中_n_是并行度)。

如果您能够对此数据进行键分区,那么您可能不需要进行广播。听起来这可能是按员工ID分组的每位员工的数据。但如果不能,那么您将不得不确保其大小足够小,适合放入内存。

英文:

The working copy of broadcast state is always on the heap; not in RocksDB. So, it has to be small enough to fit in memory. Furthermore, each instance will copy all of the broadcast state into its checkpoints, so all checkpoints and savepoints will have n copies of the broadcast state (where n is the parallelism).

If you are able to key partition this data, then you might not need to broadcast it. It sounds like it might be per-employee data that could be keyed by the employeeId. But if not, then you'll have to keep it small enough to fit into memory.

huangapple
  • 本文由 发表于 2020年7月24日 14:22:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/63067982.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定