任务在Flink中已停止时,是否可以完成检查点?

huangapple go评论47阅读模式
英文:

Can checkpoint be completed in which task is down in Flink?

问题

I am using the flink. I have five task manager and each task manager has one task slot each.

The source is Kafka and after reading data and processing some logics, the data is sinked to the dynamodb.

Note that I use the chaining all the operators so that 'source, process and sink' are combined together into a single operator.

And also, I use the checkpoint with exactly once semantics with 5 minute time interval.

In such scenario, the kafka has rolling updates and the coordinator has changed.

During the update, the flink has warning message showing that the offset is failed to source kafka.

From this scenario, I have following questions.

  1. Is it possible for checkpoint to be completed successful? Because when I saw the flink dashboard, there are no failed checkpoint in the dashboard.

  2. If so, can it be possible that task is down but the taskmanger is not down, so that it makes the checkpoint is completed? (task down is trivial issue so it does not affect the checkpoint's made)

  3. When chaining together, the data loss can be possible even though we use the checkpoint? I mean that since the operators are combined together from source to sink, when the source is affected by external kafka, the sink will be affected because of the sink and chaining. At that time, the task is going to down and in-flight data is going to be discarded?

英文:

I am using the flink. I have five task manager and each task manager has one task slot each.

The source is Kafka and after reading data and processing some logics, the data is sinked to the dynamodb.

Note that I use the chaining all the operators so that source, process and sink are combined together into a single operator.

And also, I use the checkpoint with exactly once semantics with 5 minute time interval.

In such scenario, the kafka has rolling updates and the coordinator has changed.

During the update, the flink has warning message showing that the offset is failed to source kafka.

From this scenario, I have following questions.

  1. Is it possible for checkpoint to be completed successful?
    Because when I saw the flink dashboard, there are no failed checkpoint in the dashboard.

  2. If so, can it be possible that task is down but the taskmanger is not down, so that it makes the checkpoint is completed? (task down is trivial issue so it does not affect the checkpoint's made)

  3. When chaining together, the data loss can be possible even though we use the checkpoint? I mean that since the operators are combined together from source to sink, when the source is affected by external kafka, the sink will be affected because of the sink and chaining. At that time, the task is going to down and in-flight data is going to be discarded ?

答案1

得分: 1

  1. 是的。你看到的消息是当 Flink 使用的 Kafka 消费者尝试将偏移量保存到 Kafka 集群时出现的,但这与检查点无关。Flink 将偏移量保存到快照中,在从检查点或保存点恢复时将使用它们,因此它不需要保存到 Kafka 集群的偏移量。

  2. 是的,作业可以处于失败状态,但任务管理器仍在运行。然而,这与你第一个问题无关(这不是检查点可能完成的原因,即使你看到错误)。

  3. 当操作符从源到接收端进行链接时,即使我们使用了检查点,数据丢失也是可能的吗?我的意思是,由于操作符从外部 Kafka 接收数据,当源受到影响时,由于接收端和链接,接收端也会受到影响。在这种情况下,任务将失败,正在传输的数据将被丢弃?

在作业失败时,你总是会“丢失”正在传输的数据。但由于 Flink 将所有 Kafka 分区的偏移量保存为最后一个成功检查点的一部分,当作业从该检查点重新启动时,你将重新播放所有正在传输的数据。因此,不会发生数据丢失。

英文:
  1. Is it possible for checkpoint to be completed successful? Because when I saw the flink dashboard, there are no failed checkpoint in the dashboard.

Yes. The msg you saw is when the Kafka consumer being used by Flink tries to save offsets to the Kafka cluster, but this is not used by checkpointing. Flink saves offsets to snapshots, and will use them when restoring from a checkpoint or savepoint, so it doesn't need offsets that are saved to the Kafka cluster.

  1. If so, can it be possible that task is down but the taskmanger is not down, so that it makes the checkpoint is completed [snip]

I assume you're asking whether the job can be in a failed state, but the Task Manager is still running. Yes, but that has nothing to do with your first question (it's not why a checkpoint could be completed even though you see errors).

  1. When chaining together, the data loss can be possible even though we use the checkpoint? I mean that since the operators are combined together from source to sink, when the source is affected by external kafka, the sink will be affected because of the sink and chaining. At that time, the task is going to down and in-flight data is going to be discarded?

You always "lose" in-flight data when a job fails. But because Flink saves the offsets for all Kafka partitions as part of the last successful checkpoint, when the job is restarted from that checkpoint, you will wind up replaying all of the in-flight data. So there is no data loss.

huangapple
  • 本文由 发表于 2023年5月17日 14:36:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76269153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定