数据流数据新鲜度指标定义

huangapple go评论80阅读模式
英文:

DataFlow Data freshness metric definition

问题

我正在努力获取数据新鲜度的确切定义,因为在几个地方它有不同的定义。

这是来自数据流 UI 的提示:

"自最近水印以来的秒数。" - 很容易理解,就像现在减去输出水印。

扩展文档中写着:

"数据新鲜度度量显示了数据元素上的时间戳与事件在管道中处理的时间之间的差异(以秒为单位)。数据元素在发生事件(例如在网站上单击事件或通过 Pub/Sub 进行摄取)时会接收到一个时间戳。输出水印是数据被处理的时间。"

这很令人困惑。"输出水印是数据被处理的时间" 看起来并不合乎逻辑。基本上,如果输出水印等于处理时间,那意味着队列是空的。

总的来说,处理事件的时间戳减去输出水印并不能提供有关输出数据的多少信息,而更多地提供了关于积压数据的信息。

我以为我理解错了,但我看到他们在文档中稍微低一些的位置使用了一些关于数据新鲜度峰值的附加示例,并对其进行了描述:

"在上图中,突出显示的区域显示了事件时间和输出水印时间之间的显着差异,表明操作速度较慢。"

有没有人对此有清晰的理解?

英文:

I'm struggling to get the exact definition of data freshness because it is defined differently in several places.
This is a hint from the dataflow UI:

"The number of seconds since the most recent watermark." - It's easy, like now minus the output watermark.

But extended doc says:

"The data freshness metric shows the difference in seconds between the timestamp on the data element and the time that the event is processed in your pipeline. The data element receives a timestamp when an event occurs on the element, such as a click event on a website or ingestion by Pub/Sub. The output watermark is the time that the data is processed."

And this is confusing. It doesn't seem logical that "output watermark is the time that the data is processed." basically if output watermark is equal to processing time it means the queue is empty.

And in general processing event timestamp minus output watermark doesn't say much about output data but more about backlog data.

I thought I understood it wrong but I see they make additional examples with data freshness spikes a bit lower in the doc and describe it:

"In the preceding image, the highlighted area shows a substantial difference between the event time and the output watermark time, indicating a slow operation."

Does anybody have a clear understanding of it?

答案1

得分: 1

数据新鲜度指标基本上是 Dataflow UI 中的一个图表,显示了管道中数据的延迟。它与系统延迟高度相关,主要用作管道性能和健康状态的指标。数据新鲜度图中的高峰会指示存在某种延迟(操作速度较慢)或管道中的任何瓶颈,这可能会增加数据积压。

输出水印是 Dataflow 管道中一个或多个转换阶段的处理时间。有关详细解释,请参阅 Dataflow Eng 团队成员在 视频中提供的讲解。

您还可以参考一个类似的主题 此处

英文:

Data freshness metric is basically a graph in Dataflow UI which shows the latency of your data in the pipeline. It very much correlates with system latency and is mainly is used as an indicator of performance and health of the pipeline. High spikes in the data freshness graph would indicate there is some delay (operations are slow) or any bottleneck in the pipeline. Which can increase the data backlog.<br><br>Output watermark is the processing time of a tranformation stage or stages in a Dataflow pipeline. A detailed explanation can be found in this talk given by a member of Dataflow Eng team.<br><br>One similar thread you can refer to is here.<br>

huangapple
  • 本文由 发表于 2023年7月3日 17:18:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76603414.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定