问题

import avro.shaded.com.google.common.collect.ImmutableMap;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.io.kafka.KafkaIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;

import java.util.ArrayList;
import java.util.List;

public class KafkaBeamSqlTest {

    private static DateTimeFormatter dtf = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss");

    public static void main(String[] args) {
        PipelineOptions kafkaOption = PipelineOptionsFactory.fromArgs(args)
                .withoutStrictParsing()
                .as(PipelineOptions.class);
        Pipeline pipeline = Pipeline.create(kafkaOption);

        KafkaIO.Read<String, String> kafkaRead =
                KafkaIO.<String, String>read()
                        .withBootstrapServers("127.0.0.1:9092")
                        .withTopic("beamKafkaTest")
                        .withConsumerConfigUpdates(ImmutableMap.of("group.id", "client-1"))
                        .withReadCommitted()
                        .withKeyDeserializer(StringDeserializer.class)
                        .withValueDeserializer(StringDeserializer.class)
                        .commitOffsetsInFinalize();

        PCollection<KV<String, String>> messages = pipeline.apply(kafkaRead.withoutMetadata());

        Schema inputSchema = Schema.builder()
                .addStringField("word")
                .addDateTimeField("event_time")
                .addInt32Field("cnt")
                .build();

        PCollection<Row> rows = messages.apply(ParDo.of(new DoFn<KV<String, String>, Row>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                String jsonData = c.element().getValue();

                JSONObject jsonObject = JSON.parseObject(jsonData);

                List<Object> list = new ArrayList<>();
                list.add(jsonObject.get("word"));
                list.add(dtf.parseDateTime((String) jsonObject.get("event_time")));
                list.add(jsonObject.get("cnt"));
                Row row = Row.withSchema(inputSchema)
                        .addValues(list)
                        .build();

                c.output(row);
            }
        }))
                .setRowSchema(inputSchema);

        PCollection<Row> result = PCollectionTuple.of(new TupleTag<>("t_count_stats"), rows)
                .apply(
                        SqlTransform.query(
                                "SELECT word, SUM(cnt), TUMBLE_START(event_time, INTERVAL '1' MINUTE) as window_start\n" +
                                        "FROM t_count_stats\n" +
                                        "GROUP BY word, TUMBLE(event_time, INTERVAL '1' MINUTE)"
                        )
                );

        result.apply(MapElements.via(new SimpleFunction<Row, KV<String, String>>() {
            @Override
            public KV<String, String> apply(Row input) {
                return KV.of(input.getString("word"), "result=" + input.getValues());
            }
        }))
                .apply(KafkaIO.<String, String>write()
                        .withBootstrapServers("127.0.0.1:9092")
                        .withTopic("beamPrint")
                        .withKeySerializer(StringSerializer.class)
                        .withValueSerializer(StringSerializer.class));

        pipeline.run();
    }
}

(Note: This is the code portion you provided with the HTML entities like " replaced with actual quotation marks " for improved code readability. However, please be aware that this code alone might not work standalone as-is, as it's part of a larger system and could require proper setup and integration with Kafka, Beam, and any necessary dependencies.)

英文:

First, we have a kafka input source in JSON format:

{&quot;event_time&quot;: &quot;2020-08-23 18:36:10&quot;, &quot;word&quot;: &quot;apple&quot;, &quot;cnt&quot;: 1}
{&quot;event_time&quot;: &quot;2020-08-23 18:36:20&quot;, &quot;word&quot;: &quot;banana&quot;, &quot;cnt&quot;: 1}
{&quot;event_time&quot;: &quot;2020-08-23 18:37:30&quot;, &quot;word&quot;: &quot;apple&quot;, &quot;cnt&quot;: 2}
{&quot;event_time&quot;: &quot;2020-08-23 18:37:40&quot;, &quot;word&quot;: &quot;apple&quot;, &quot;cnt&quot;: 1}
... ...

What I'm trying to do is to aggregate the sum of the count of each word by every minute:

+---------+----------+---------------------+
| word    | SUM(cnt) |   window_start      |
+---------+----------+---------------------+
| apple   |    1     | 2020-08-23 18:36:00 |
+---------+----------+---------------------+
| banana  |    1     | 2020-08-23 18:36:00 |
+---------+----------+---------------------+
| apple   |    3     | 2020-08-23 18:37:00 |
+---------+----------+---------------------+

So this case would be perfectly fit the following Beam SQL statement:

SELECT word, SUM(cnt), TUMBLE_START(event_time, INTERVAL &#39;1&#39; MINUTE) as window_start
FROM t_count_stats
GROUP BY word, TUMBLE(event_time, INTERVAL &#39;1&#39; MINUTE)

And below is my current working code using Beam's Java SDK to execute this streaming SQL query:

import avro.shaded.com.google.common.collect.ImmutableMap;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.io.kafka.KafkaIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.*;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.kafka.common.serialization.StringSerializer;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;
import java.util.ArrayList;
import java.util.List;
public class KafkaBeamSqlTest {
private static DateTimeFormatter dtf = DateTimeFormat.forPattern(&quot;yyyy-MM-dd HH:mm:ss&quot;);
public static void main(String[] args) {
// create pipeline
PipelineOptions kafkaOption = PipelineOptionsFactory.fromArgs(args)
.withoutStrictParsing()
.as(PipelineOptions.class);
Pipeline pipeline = Pipeline.create(kafkaOption);
// create kafka IO
KafkaIO.Read&lt;String, String&gt; kafkaRead =
KafkaIO.&lt;String, String&gt;read()
.withBootstrapServers(&quot;127.0.0.1:9092&quot;)
.withTopic(&quot;beamKafkaTest&quot;)
.withConsumerConfigUpdates(ImmutableMap.of(&quot;group.id&quot;, &quot;client-1&quot;))
.withReadCommitted()
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.commitOffsetsInFinalize();
// read from kafka
PCollection&lt;KV&lt;String, String&gt;&gt; messages = pipeline.apply(kafkaRead.withoutMetadata());
// build input schema
Schema inputSchema = Schema.builder()
.addStringField(&quot;word&quot;)
.addDateTimeField(&quot;event_time&quot;)
.addInt32Field(&quot;cnt&quot;)
.build();
// convert kafka message to Row
PCollection&lt;Row&gt; rows = messages.apply(ParDo.of(new DoFn&lt;KV&lt;String, String&gt;, Row&gt;(){
@ProcessElement
public void processElement(ProcessContext c) {
String jsonData = c.element().getValue();
// parse json
JSONObject jsonObject = JSON.parseObject(jsonData);
// build row
List&lt;Object&gt; list = new ArrayList&lt;&gt;();
list.add(jsonObject.get(&quot;word&quot;));
list.add(dtf.parseDateTime((String) jsonObject.get(&quot;event_time&quot;)));
list.add(jsonObject.get(&quot;cnt&quot;));
Row row = Row.withSchema(inputSchema)
.addValues(list)
.build();
System.out.println(row);
// emit row
c.output(row);
}
}))
.setRowSchema(inputSchema);
// sql query
PCollection&lt;Row&gt; result = PCollectionTuple.of(new TupleTag&lt;&gt;(&quot;t_count_stats&quot;), rows)
.apply(
SqlTransform.query(
&quot;SELECT word, SUM(cnt), TUMBLE_START(event_time, INTERVAL &#39;1&#39; MINUTE) as window_start\n&quot; +
&quot;FROM t_count_stats\n&quot; +
&quot;GROUP BY word, TUMBLE(event_time, INTERVAL &#39;1&#39; MINUTE)&quot;
)
);
// sink results back to another kafka topic
result.apply(MapElements.via(new SimpleFunction&lt;Row, KV&lt;String, String&gt;&gt;() {
@Override
public KV&lt;String, String&gt; apply(Row input) {
System.out.println(&quot;result: &quot; + input.getValues());
return KV.of(input.getValue(&quot;word&quot;), &quot;result=&quot; + input.getValues());
}
}))
.apply(KafkaIO.&lt;String, String&gt;write()
.withBootstrapServers(&quot;127.0.0.1:9092&quot;)
.withTopic(&quot;beamPrint&quot;)
.withKeySerializer(StringSerializer.class)
.withValueSerializer(StringSerializer.class));
// run
pipeline.run();
}
}

My problem is that when I run this code and feed some messages into Kafka, there's no exception throwing out and it has received some messages from Kafka, but I cannot see it trigger the process for the windowing aggregation. And there's no result coming out as expected (like the table I shown before).

So does Beam SQL currently support the window syntax on unbounded Kafka input source? If it does, what's wrong with my current code? How can I debug and fix it? And is there any code example that integrates Beam SQL with KafkaIO?

Please help me! Thanks a lot!!

答案1

得分: 1

似乎在 https://lists.apache.org/thread.html/rea75c0eb665f90b8483e64bee96740ebb01942c606f065066c2ecc56%40%3Cuser.beam.apache.org%3E 得到了答复。

英文:

Looks like this was answered at https://lists.apache.org/thread.html/rea75c0eb665f90b8483e64bee96740ebb01942c606f065066c2ecc56%40%3Cuser.beam.apache.org%3E

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将Beam SQL窗口查询与KafkaIO集成？

问题

答案1

问题出在`.requestMatchers().permitAll()`。

Arch测试失败STANDARD_STREAMS。

How can I handle the response from the retrofit (Here my response not showing the data, it only shows the code and status of the call)

如何在滚动时为列表视图上方的图像添加动画效果？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论