Uber Cadence – 定时器已启动

huangapple go评论107阅读模式
英文:

Uber Cadence - Timer started

问题

I am new to Uber Cadence framework and currently working on a project workflow management project using Cadence. I am seeing strange behavior all the workflows are having a timer for ~270hrs as part of the flow and I am not sure how that number is calculated and where that timer is coming from.

And the other issue is, once timer is fired, workflows are failing (not terminated) with UNHADLED_DECISION error. This exception is keep throwing and spamming the logs. Here is the stacktrace.

	at com.uber.cadence.internal.replay.DecisionsHelper.getDecision(DecisionsHelper.java:733)
	at com.uber.cadence.internal.replay.DecisionsHelper.handleTimerStarted(DecisionsHelper.java:451)
	at com.uber.cadence.internal.replay.ReplayDecider.processEvent(ReplayDecider.java:229)
	at com.uber.cadence.internal.replay.ReplayDecider.decideImpl(ReplayDecider.java:452)
	at com.uber.cadence.internal.replay.ReplayDecider.decide(ReplayDecider.java:385)
	at com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.processDecision(ReplayDecisionTaskHandler.java:145)
	at com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTaskImpl(ReplayDecisionTaskHandler.java:125)
	at com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTask(ReplayDecisionTaskHandler.java:86)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:257)
	at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
	at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)"

Uber Cadence – 定时器已启动

Can somebody explain me what's happening here and what is this timer? Is there a way to handle this error/exception gracefully and avoid spamming the logs? There are thousand of workflows like this in the test environment and is there a way to terminate them all using Cadence Web or some other? Thanks in advance

Edit:
I have two code blocks where I am using LocalDateTime and Workflow.sleep for waiting.

  1. If the workflow startTime is in the future, calculate the wait time and then sleep

    if (LocalDateTime.now(CampaignConstants.ZONE_ID_UTC).isBefore(myWorkflow.getStartDateTime())) {
    Duration waitDuration = Duration.between(LocalDateTime.now(), myWorkflow.getStartDateTime());
    Workflow.sleep(waitDuration);
    }

  2. If the workflow step is to wait for specified time period, then call Workflow.sleep with the scarified time

    Integer waitPeriod = Integer.parseInt((String) props.get("waitPeriod"));
    ChronoUnit chronoUnit = ChronoUnit.valueOf((String) props.get("waitPeriodType"));
    Workflow.sleep(Duration.of(waitPeriod, chronoUnit));

Is this right way of implementing? Seems not the right way, so what's the proper way of implementing these functionalities. Thanks

英文:

I am new to Uber Cadence framework and currently working on a project workflow management project using Cadence. I am seeing strange behavior all the workflows are having a timer for ~270hrs as part of the flow and I am not sure how that number is calculated and where that timer is coming from.

And the other issue is, once timer is fired, workflows are failing (not terminated) with UNHADLED_DECISION error. This exception is keep throwing and spamming the logs. Here is the stacktrace.

"com.uber.cadence.internal.replay.NonDeterminisicWorkflowError: Unknown DecisionId{decisionTarget=TIMER, decisionEventId=11}. The possible causes are a nondeterministic workflow definition code or an incompatible change in the workflow definition.\n\tat com.uber.cadence.internal.replay.DecisionsHelper.getDecision(DecisionsHelper.java:733)\n\tat com.uber.cadence.internal.replay.DecisionsHelper.handleTimerStarted(DecisionsHelper.java:451)\n\tat com.uber.cadence.internal.replay.ReplayDecider.processEvent(ReplayDecider.java:229)\n\tat com.uber.cadence.internal.replay.ReplayDecider.decideImpl(ReplayDecider.java:452)\n\tat com.uber.cadence.internal.replay.ReplayDecider.decide(ReplayDecider.java:385)\n\tat com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.processDecision(ReplayDecisionTaskHandler.java:145)\n\tat com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTaskImpl(ReplayDecisionTaskHandler.java:125)\n\tat com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTask(ReplayDecisionTaskHandler.java:86)\n\tat com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:257)\n\tat com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)\n\tat com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\n"

Uber Cadence – 定时器已启动

Can somebody explain me what's happening here and what is this timer? Is there a way to handle this error/exception gracefully and avoid spamming the logs? There are thousand of workflows like this in the test environment and is there a way to terminate them all using Cadence Web or some other? Thanks in advance

Edit:
I have two code blocks where I am using LocalDateTime and Workflow.sleep for waiting.

  1. If the workflow startTime is in the future, calculate the wait time and then sleep

    if (LocalDateTime.now(CampaignConstants.ZONE_ID_UTC).isBefore(myWorkflow.getStartDateTime())) {
    Duration waitDuration = Duration.between(LocalDateTime.now(), myWorkflow.getStartDateTime());
    Workflow.sleep(waitDuration);
    }

  2. If the workflow step is to wait for specified time period, then call Workflow.sleep with the scarified time

    Integer waitPeriod = Integer.parseInt((String) props.get("waitPeriod"));
    ChronoUnit chronoUnit = ChronoUnit.valueOf((String) props.get("waitPeriodType"));
    Workflow.sleep(Duration.of(waitPeriod, chronoUnit));

Is this right way of implementing? Seems not the right way, so what's the proper way of implementing these functionalities. Thanks

答案1

得分: 1

不是Cadence专家,但既然你也标记了“temporal-workflow”,可以试一试:)

TimerStarted->Fired事件似乎来自你的工作流程代码。检查一下你的代码中是否有workflow.Sleep,或者在代码中创建了一个计时器并等待它完成。

计时器触发后,你有一个决策任务,在ScheduleToStart超时时超时,这意味着任务已经放置在任务队列中,但没有被你的工作者之一接受。

然后,此任务再次放置在全局("normal")任务队列分区中,很可能你的另一个工作者接受了它(检查WorkflowTaskStarted事件中的身份字段)。这个工作者在其内存缓存中没有执行历史记录,这意味着工作者必须从服务中提取整个历史记录,然后执行内部工作流程重放,导致了非确定性错误。建议检查你的代码,看看是否在wf代码中使用系统时钟来计算睡眠时间或其他类型的非确定性内容。如果可以分享你的代码,我可以帮你查看。希望这有所帮助。

英文:

Not Cadence expert but since you tagged it also with "temporal-workflow" can give it a shot Uber Cadence – 定时器已启动

The TimerStarted->Fired events seem to come from your workflow code. Check if you have workflow.Sleep in your code or create a timer and wait for it to complete in code.

After the timer fires you have a decision task that times out on ScheduleToStart timeout meaning the task was placed on task queue but was not picked up by one of your workers.

This task in then placed again on the global ("normal") task queue partition and most likely another one of your workers picked it up (check identity field on your WorkflowTaskStarted events). This worker did not have the execution history in its in-memory cache, meaning worker had to pull whole history from service and then perform internal workflow replay, which led to a non-deterministic error. Would check your code to see if you are using maybe system clock to calculate sleep durations or some other type on non-deterministic stuff in wf code. If you can share your code, could take a look. Hope this helps.

答案2

得分: 1

你的代码是非确定性的,因为它使用了LocalDateTime.now(CampaignConstants.ZONE_ID_UTC)而不是Workflow.currentTimeMillis。工作流程按设计不会失败,而是会暂停(通过在日志中看到的工作流程任务的重试)。

第一个工作流任务的超时(SCHEDULE_TO_START)是由于将工作流任务分派到粘性任务队列(用于工作流缓存)引起的。这个问题在Temporal中已经修复。

英文:

Your code is nondeterminstic as it uses LocalDateTime.now(CampaignConstants.ZONE_ID_UTC) instead of Workflow.currentTimeMillis. The workflow is by design doesn't fail, but is paused (by retrying workflow task which you see in the logs).

The first workflow task timeout (SCHEDULE_TO_START) is caused by disptaching the worklfow task on a sticky task queue (used for workflow caching). This issue is already fixed in Temporal.

huangapple
  • 本文由 发表于 2023年3月3日 19:48:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626709.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定