英文:
Flink Job goes too much busy with less amount of data
问题
我正在配置一个能够处理每秒近100万条数据的Flink作业,我从以下配置开始:
CPU核心数:4核
内存:2GB
任务槽:4个
但每秒只有30,000条日志。但是我的作业仍然非常繁忙,存在很多背压问题。据我所知,Flink可以处理大量的数据,但这里似乎有一些矛盾之处。我可能错过了一些配置。所以有谁可以帮助我找出问题,将不胜感激。
提前谢谢!
英文:
I am configuring a Flink Job that can handle almost 1 million data per second, I have started with below configuration
CPU: 4 cores
Memory: 2GB
Task Slots: 4
with only 30k logs per second
But my job still goes too much busy and have a much backpressure,
As far as I read that Flink can handle very large amount of data But here is some contradict, I might miss out some of the configuration, So can anybody help me to figure out it would be highly appreciate
Thank you in advance
I have tried by increasing a memory and parellelism but it didn't work for me, I want to understand that is it expected like with this configuration this result is okay or I should configure the job in any other way.
答案1
得分: 0
对于从Kafka读取数据,进行基于广播流的数据丰富,然后写入Hudi的工作流程,我获得了每核心大约13,000条记录/秒的速率。这是在使用更快的序列化/反序列化工具从Kafka解析记录等优化的情况下获得的。
因此,使用4个核心,每秒30,000条记录处于合适的范围。
请注意,增加并行性而不增加可用核心数量将不会有所帮助,通常会降低吞吐量。
英文:
For a workflow reading from Kafka, doing a broadcast-stream based enrichment, and writing to Hudi, I got a rate of about 13K records/sec/core. This is with optimizations like using faster-serde for deserializing records from Kafka, etc.
So with 4 cores, 30K records/second is in the right ballpark.
Note that increasing parallelism without increasing the number of cores available won't help, and typically hurts your throughput.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论