2023年6月13日 17:17:41go评论63阅读模式

英文:

Watermark not showing correct output in spark

问题

I am sending streaming data to spark using netcat server:

我正在使用netcat服务器将流数据发送到Spark：

I am sending data in the following format:

我以以下格式发送数据：

In spark, I am splitting them and performing a groupby operation. Here is my code:

在Spark中，我将它们拆分并执行groupby操作。以下是我的代码：

The issue which I am facing is this:

我面临的问题是：

When I give the input, say, 10:00:00,5 it gives this output.

当我输入，比如，10:00:00,5时，它会产生以下输出。

Now, at this point of time, max event time is 10:00:00 and I have specified watermark as 10 minutes, so any event before (10:00:00-00:10:00) i.e. 09:50:00 should be rejected. However, when I give input say 09:48:00,10 it gives this output:

现在，在这个时间点，最大事件时间是10:00:00，我已经指定了水印为10分钟，因此任何在（10:00:00-00:10:00）之前的事件，即09:50:00之前的事件，都应该被拒绝。然而，当我输入09:48:00,10时，它会产生以下输出：

Which seems incorrect to me because the data is already too late, it should be rejected by Spark, but Spark is considering it. What am I missing here?

这对我来说似乎是不正确的，因为数据已经太晚了，它应该被Spark拒绝，但Spark正在考虑它。我在这里漏掉了什么？

英文:

I am sending streaming data to spark using netcat server:

nc -lk 9999

I am sending data in following format:

Time,number

In spark, I am splitting them and performing a groupby operation. Here is my code:

package org.example;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;

import java.util.concurrent.TimeoutException;
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.streaming.Trigger;


public class SampleProgram {
    public static void main(String args[]) {
        SparkSession spark = SparkSession
                .builder()
                .appName(&quot;Spark-Kafka-Integration&quot;)
                .config(&quot;spark.master&quot;, &quot;local&quot;)
                .getOrCreate();

        spark.sparkContext().setLogLevel(&quot;ERROR&quot;);

        Dataset&lt;Row&gt; lines = spark
                .readStream()
                .format(&quot;socket&quot;)
                .option(&quot;host&quot;, &quot;localhost&quot;)
                .option(&quot;port&quot;, 9999)
                .load();

        lines.printSchema();

       Dataset&lt;Row&gt; temp_data = lines.selectExpr(&quot;split(value,&#39;,&#39;)[0] as timestamp&quot;,&quot;split(value,&#39;,&#39;)[1] as value&quot;);
       Dataset&lt;Row&gt; data = temp_data.selectExpr(&quot;CAST(timestamp AS TIMESTAMP)&quot;, &quot;CAST(value AS INT)&quot;);

        Dataset&lt;Row&gt; windowedCounts = data
                .withWatermark(&quot;timestamp&quot;, &quot;10 minutes&quot;)
                .groupBy(
                    functions.window(data.col(&quot;timestamp&quot;), &quot;5 minutes&quot;),
                        col(&quot;value&quot;)
                ) .count();

        StreamingQuery query = null;
        try {
            query = windowedCounts.writeStream()
                    .outputMode(&quot;update&quot;)
                    .option(&quot;truncate&quot;, &quot;false&quot;)
                    .format(&quot;console&quot;)
                    .trigger(Trigger.ProcessingTime(&quot; 45 seconds&quot;))
                    .start();
        } catch (TimeoutException e) {
            throw new RuntimeException(e);
        }

        try {
            query.awaitTermination();
        } catch (StreamingQueryException e) {
            throw new RuntimeException(e);
        }


    }
}

The issue which I am facing is this -

When I give the input, say, 10:00:00,5 it gives this output.

Which seems incorrect to me because the data is already too late, it should be rejected by spark, but spark is considering it. What am I missing here ?

答案1

得分: 0

按照以下方式编写groupby：

.groupBy(
     window(col("timestamp"), "5 minutes"),
     col("value")
).count()

英文:

Write groupby in this way

.groupBy(
     window(col(&quot;timestamp&quot;),&quot;5 minutes&quot;),
     col(&quot;value&quot;)
).count();

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

水印在Spark中未显示正确的输出。

问题

答案1

访问Spring Bean构造函数中的运行时参数和其他Bean。

春季通道拦截器 – 如何在不抛出错误的情况下返回空消息

在Java Swing中更改按钮文本时遇到问题。

SnakeYaml 错误无法创建属性错误

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论