大文件处理 – 从故障点继续

huangapple go评论66阅读模式
英文:

Large File Processing - Resume from the point of failure

问题

我们需要处理大型CSV文件。我们使用Apache Camel从SFTP位置读取文件(但如果有更好的方法,我们也可以考虑基于Java的解决方案)。

其中一个要求是能够在失败点处恢复处理。也就是说,如果在处理第1000行时出现异常,我们应该从第1000行开始重新处理,而不是从头开始。我们也不应该处理两次记录。

我们正在使用Apache ActiveMQ将记录保存在队列中,并用于管理流程。但是从位置加载文件最初也可能导致失败。

为了跟踪状态,我们使用数据库,该数据库将在每个步骤中使用Apache Camel进行更新。

我们欢迎各种想法和建议。提前感谢。

英文:

We have to process large CSV files. We use Apache Camel for reading the files from an SFTP location (But we are open to Java based solutions if there are better approaches).

One of the requirement is to resume the processing from the point of failure. That is, if there is an exception happened while processing line number 1000, we should start processing from line 1000 rather than from the beginning. We should not process the record twice as well.

We are using Apache ActiveMQ to save the records in the queues and for managing the pipeline. But initial loading of the file from the location can also cause failures.

To track the state, we are using a database which will get updated at every step using Apache Camel.

We are open to ideas and suggestions. Thanks in advance.

答案1

得分: 3

据我所知,Camel File component 无法从故障点恢复。

根据您的配置(请参阅 moveFailed 选项),如果文件处理失败,可能会在下一次尝试时将其移开或重新处理(但从头开始)。

要读取 CSV 文件,您需要对单行进行分割。由于您的文件很大,应该使用 Splitter 的 streaming 选项。否则,在拆分之前会读取整个文件!

为了降低故障和整个文件重新处理的概率,您可以简单地将每个 CSV 行发送到 ActiveMQ(无需解析)。拆分器越简单,由于单个记录的问题而需要重新处理整个文件的概率就越低。

队列的分离消费者可以解析和处理 CSV 记录,而不会影响文件导入。通过这种方式,您可以处理每条记录的错误。

如果您仍然遇到文件导入故障,文件将从头开始重新处理。因此,您应该设计具有幂等性的处理流程。例如,检查是否存在记录,如果已经存在一条记录,就更新它,而不仅仅是插入每条记录。

在消息传递环境中,您必须处理至少一次交付语义。唯一的解决方案是使用具有幂等性的组件。即使 Camel 尝试在故障点继续处理,也不能保证每条记录只被读取一次。

英文:

As far as I know, Camel File component cannot resume from the point of failure.

It depends on your configuration (see moveFailed option) if a failed file is moved away or reprocessed on the next attempt (but from the beginning).

To read a CSV file, you need to split the single lines. Because your files are big, you should use the streaming option of the Splitter. Otherwise the whole file is read before splitting!

To decrease the probability of failures and reprocessing of the whole file, you can simply send every single CSV line to ActiveMQ (without parsing it). The simpler the splitter, the lower the probability that you need to reprocess the whole file because of problems in a single record.

The decoupled consumer of the queue can parse and process the CSV records without affecting the file import. Like this, you can handle errors for every single record.

If you nevertheless have file import failures, the file is reprocessed from the beginning. Therefore you should design you processing pipeline idempotent. For example check for an existing record and if there is already one, update it instead of just inserting every record.

In a messaging environment you have to deal with at-least-once delivery semantics. The only solution is to have idempotent components. Even if Camel would try to resume at the point of failure, it could not guarantee that every record is read only once.

huangapple
  • 本文由 发表于 2020年8月17日 23:57:04
  • 转载请务必保留本文链接:https://go.coder-hub.com/63454429.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定