2023年3月15日 20:16:53go评论93阅读模式

英文:

How can we read historical data using databricks from kinesis or kafka by specifying starting and ending time stamp?

问题

以下是要翻译的代码部分：

spark.readStream.format("kinesis").option("streamName", kinesisStreamName).option("region", kinesisRegion).option("initialPosition", '{"at_timestamp": "03/08/2023 00:00:00 PDT", "format": "MM/dd/yyyy HH:mm:ss ZZZ"}').option("awsAccessKey", awsAccessKeyId).option("awsSecretKey", awsSecretKey).load()

英文:

Lets says I'd like to read data arrived in the period between 8th mar 2023 to 14th mar 2023

Is there a way we can define ending position along with initialPosition in below.

答案1

得分: 2

I think what you are looking for is a Batch processing not a Stream processing, since you desire like a backfill job.

Unfortunately, you can't set like endPosition config to Spark Streaming app to read Kafka or Kinesis.

Some suggestions:

1- If you have a chance like changing Kinesis to Kafka then you can use spark.read("kafka") method instead spark.readStream("kafka"). So, you can use below parameters.

    .option("startingOffsets", start_offset) \
    .option("endingOffsets", end_offset) \

2- If Kinesis usage is required, then you can feed a s3 path with your this Kinesis Stream. Then you can consume its data files with Spark by setting a start-end where condition. (I would recommend AWS-Glue pushdown_predicate feature not to read all data).

Thanks.

英文:

I think what you are looking for is a Batch processing not a Stream processing, since you desire like a backfill job.

Unfortunately, you can't set like endPosition config to Spark Streaming app to read Kafka or Kinesis.

Some suggestions:

1- If you have a chance like changing Kinesis to Kafka then you can use spark.read("kafka") method instead spark.readStream("kafka"). So, you can use below parameters.

    .option(&quot;startingOffsets&quot;, start_offset) \
    .option(&quot;endingOffsets&quot;, end_offset) \

Thanks.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can we read historical data using databricks from kinesis or kafka by specifying starting and ending time stamp?

问题

答案1

如何按名称对未绑定的PySpark列列表进行排序？

EMR 无服务器 – 在控制台中传递 JAR 文件

SparkException cause by java.lang.NoClassDefFoundError: org/apache/htrace/core/HTraceConfiguration

Pyspark表名与时间戳

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。