Kotlin如果失败,重试计划的cron任务

huangapple go评论55阅读模式
英文:

Kotlin retry scheduled cron task if failed

问题

我有一个 Kotlin 调度配置文件如下。它安排了一个重要任务在每个星期一的上午11点运行。

如果服务在上午11点宕机,我需要做什么来建立可靠性或重试尝试?

这些 Spring Boot 和 Kotlin @Scheduled 作业是否可以配置为企业级的可靠性,还是我需要考虑使用类似 Kubernetes CronJobs 来实现这一点?

我还在研究 Spring Boot Quartz 调度器与一个 JobStore 作为一个选项。欢迎任何其他备选设置建议。

@Component
class CronConfig {

    private val logger = LoggerFactory.getLogger(CronConfig::class.java)

    // 在每个星期一的上午11点运行
    @Scheduled(cron = "0 0 11 * * MON")
    fun doSomething(){
        logger.info("正在执行某些操作")
    }
}
英文:

I have a Kotlin scheduling config file below. It has an important task scheduled to run at 11am each Monday.

What do I need to do for building resiliency or retry attempts in case the service is down at 11am?

Can these Spring Boot and Kotlin @Scheduled jobs be configured for enterprise level resiliency or do I need to look to use something like Kubernetes CronJobs to achieve this?

I am also looking into Spring Boot Quartz scheduler with a JobStore as an option. Any alternative setup suggestions are welcome.

@Component
class CronConfig {

    private val logger = LoggerFactory.getLogger(CronConfig::class.java)

    // Run Monday morning @ 11am
    @Scheduled(cron = "0 0 11 * * MON")
    fun doSomething(){
        logger.info("Doing something")
    }
}

答案1

得分: 3

很好,你正在考虑可能出现问题的情况。(开发者经常假设一切都会顺利进行,而不考虑和处理代码可能失败的所有方式!)

不幸的是,我认为没有标准的做法;正确的方法可能取决于你的确切情况。

也许最简单的方法就是确保你的函数不会失败,通过在函数内部进行错误处理,如果需要的话进行等待和重试。如果这样更易读,你可以将实际的处理拆分到一个单独的方法中,例如:

@Scheduled(cron = "0 0 11 * * MON")
fun doSomethingDriver() {
    while (true) { // 不断重试直到成功…
        try {
            doSomething()
            return // 成功了!
        } catch (x: Exception) {
            logger.error("无法执行 doSomething: {}. 将重试…", x.message)
            TimeUnit.SECONDS.sleep(10L)
        }
    }
}

fun doSomething() {
    logger.info("正在做某事")
    // …
}

这相当简单直接。一个缺点是它在重试之间让线程等待;由于Spring默认使用单线程调度器(参见这些 问题),这意味着它可能会延迟其他计划任务。

或者,如果你的计划函数不会不断重试,那么你需要其他方式来触发重试。

你可以进行“轮询”:存储上次成功运行的时间,将计划函数更频繁地运行,并检查是否需要运行另一个实例(即是否自上次周一上午11点以来没有成功运行)。这将更加复杂,特别是因为它需要维护状态和进行日期/时间处理。(不过,除非你将函数标记为@Async或设置了自己的调度配置,你不需要担心并发性问题。)这也会更不高效,因为会有更多额外的计划唤醒。

或者你可以捕获错误(就像上面的代码一样),但不是等待和重试,而是手动安排在将来某个时间重试,例如使用自己的TaskExecutor。这也会更加复杂。

英文:

It's good that you're thinking about what might go wrong. (Too often we developers assume everything will go right, and don't consider and handle all the ways our code could fail!)

Unfortunately, I don't think there's a standard practice for this; the right approach probably depends on your exact situation.

Perhaps the simplest approach is just to ensure that your function cannot fail, by doing error-handling, and if needed waiting and retrying, within it. You could split the actual processing out to a separate method if that makes it more readable, e.g.:

@Scheduled(cron = "0 0 11 * * MON")
fun doSomethingDriver() {
    while (true) { // Keep trying until successful…
        try {
            doSomething()
            return // It worked!
        } catch (x: Exception) {
            logger.error("Can't doSomething: {}.  Will retry…", x.message)
            TimeUnit.SECONDS.sleep(10L)
        }
    }
}

fun doSomething() {
    logger.info("Doing something")
    // …
}

That's pretty straightforward.  One disadvantage is that it keeps the thread waiting between retries; since Spring uses a single-threaded scheduler by default (see these questions), that means it could delay any other scheduled jobs.

Alternatively, if your scheduled function doesn't keep retrying, then you'll need some other way to trigger a retry.

You could ‘poll’: store the time of the last successful run, change the scheduled function to run much more frequently, and have it check whether another run is needed (i.e. whether there's been no successful run since the last 11am Monday). This will be more complex — especially as it needs to maintain state and do date/time processing. (You shouldn't need to worry about concurrency, though, unless you've made the function @Async or set up your own scheduling config.) It's also a little less efficient, due to all the extra scheduled wake-ups.

Or you could trap errors (like the code above) but instead of waiting and retrying, manually schedule a retry for a future time, e.g. using your own TaskExecutor. This would also be more complex.

答案2

得分: 1

如果您想确保不会错过星期一的任务,我的经验是,导致您的解决方案可能失败的原因只有一部分是由代码本身引起的,将解决方案限制在try/catch/retry中会忽略更广泛的原因,例如:

  • 运行环境耗尽某些资源(磁盘、内存),而且在计划任务尝试运行时,服务可能没有启动。(Kubernetes通常会帮助减少这些情况,我承认这一点)。
  • 有人选择在刚好部署新容器的那个特殊时间,导致计划任务被错过。
  • 后来您改进了服务,以便有多个进程实例,从而导致多次执行。

尽管Kubernetes很棒,但一旦Pod重新启动,通常会丢失日志文件,因此您不能轻松知道发生了什么以及您的重要进程是否运行。

针对这些情况,我建议两种方法。(其中一种与@gidds的建议相符)

1. 在受信任的后备存储中维护应用程序之外的状态。

应用程序具有@Scheduled以在指定时间运行,但也在启动时查找外部存储中的nextRunAt日期时间。如果错过了运行,那么进程和人员可以轻松知道并采取行动。您可以让Spring在启动时调用一个方法,如下所示:

@Bean
fun startUp() = CommandLineRunner {
    ...
}

当然,进程需要更新nextRunAt

2. 使用消息系统和简单的调度程序

这个更复杂的解决方案取决于您混合使用的其他基础设施。如果您有一个具有弹性消息队列系统,并且正确使用事务消息,那么在运行时会将一个"command"消息放置在队列上。一个或多个工作节点订阅这个队列。首先获取消息的工作节点将处理它,并且该工作节点需要正确确认消息已被处理。如果该工作节点没有确认,例如,如果处理线程终止,或者整个JVM等终止,然后队列管理器将在适当的超时后提供给另一个订阅者(您需要谨慎管理该超时,以确保不会因为进程仍在运行而导致双重执行)。即使您只打算有一个工作节点,这种方法也有效...只要工作节点重新上线,消息就会被提供给它,然后进程将运行。

大多数队列管理器都具有管理界面,您可以查看是否有等待的消息。

当然,您仍然需要一个进程在正确的时间将消息放入队列。队列方法为您提供了一个非常弹性的进程解决方案,但仍然存在单点故障 - 调度程序。因此,这个设计应该是您可以获得的最简单的技术,您可以合理地认为它几乎没有失败的机会。

那个"command"消息可以只是一个特定队列中的空消息;这足够了。大多数队列系统都有一个HTTP入口点来创建一个简单的消息,因此您可以想象:

  1. 一个Kubernetes CronJob(Kube团队已经使其可靠)
  2. 调用一个shell脚本(可以很容易理解这不会失败)
  3. 使用curl使用HTTP在队列上发布消息(这也应该足够简单,可以确保不会失败)
  4. 队列系统不会丢失您的消息 - 这是它的工作!
英文:

If you want to ensure you never miss that Monday task, my experience is that only some of the reasons your solution may fail will be caused by the code itself and confining solutions to a try/catch/retry will miss wider causes, e.g.:

  • the running environment runs out of some resource (disc, memory) and the service is not alive at the time the Cron schedule tries to run. (Kubernetes will generally help minimise these cases, I grant you).
  • someone choses just that special time to deploy a new container so the Cron schedule is missed
  • you later evolve the service so you have more than one instance of the process so you get multiple executions

As great as Kubernetes is, once the Pod restarts, you typically lose the log files so you cannot easily know what happened and whether your important process ran.

For these cases I suggest two approaches. (One of these matches @gidds suggestions)

1. Maintain state outside the application in a trusted backing store.

The application has the @Scheduled to run the nominated time, but also on startup to look for a nextRunAt datetime it in the external store. If a run has been missed, then it is easy for that process and humans to know and take action. You can have Spring call a method on startup in this way:

@Bean
fun startUp() = CommandLineRunner {
    ...
}

Of course, the process needs to update the nextRunAt.

2. Use a messaging system and simple scheduler

This more complex solution depends on what other infrastructure you also have in your mix. If you have a resilient Message Queuing system and with the correct use of transactional messaging, a "command" message is placed on a Queue at the run time§. One or more worker nodes subscribe to this Queue. The first to acquire the message will process it, and that worker needs to properly acknowledge the messages as being processed. If that worker does not, e.g. if the worker processing thread dies, or the whole JVM/etc dies then the Queue Manager will offer it to another subscriber after a suitable timeout (you need to manage that timeout carefully so you don't get a double-execution just because the process is still running). This approach works even if you only ever intend to have one worker... as soon as if comes back on line, the message it there for it the the process will run.

Most Queue Managers will have a management interface where you can see if there is a message waiting.

§ Of course, you still need a process to place the message on the queue at the right time. The Queue apporach gives you a very resilient process solution BUT there still a single-point of failure - the scheduler. So the design of this should the simplest technology you can get hold of which you can rationalise has very low chance of failure.

That "command" message can be just a blank message in a particular Queue; that's enough. Most Queue systems have an HTTP entry point to create a simple message, so you can imagine:

  1. a Kubernetes CronJob (the Kube people have made this reliable)
  2. that calls a shell script (easy to reason this won't fail)
  3. that uses curl to use HTTP to publish a message on a Queue (this too should be easy enough to be sure this won't fail)
  4. The Queue system won't lose your message - that's its job!

huangapple
  • 本文由 发表于 2023年6月1日 22:19:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76382898.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定