问题

我能够使用序列函数（从Spark 2.4版本开始提供）生成在两个日期之间发生的日期列的时间序列。
我的生产系统使用的是Spark 2.3版本。如何在Spark 2.3中实现相同的功能？
以下是使用序列函数的代码片段。

data1 = [
    (1, "2022-09-01", "2023-01-01", 1),
    (1, "2022-09-01", "2023-02-01", 1),
    (1, "2022-09-11", "2023-01-01", 2),
    (1, "2022-09-01", "2023-01-01", 2),
    (1, "2022-09-21", "2023-01-01", 1),
]
df1 = spark.createDataFrame(
    data1, ["item", "start_d", "activation_d", "dept_id"]
)
df1 = df1.withColumn(
    "week_start",
    SF.explode(SF.expr("sequence(start_d, activation_d, interval 7 day)")),
)

如何在不使用序列函数的情况下实现相同的功能。

英文:

I am able to generate a time series of date column that occurs between 2 dates using sequence function(available from spark 2.4)
My production system has spark 2.3. How can I achieve same thing using spark 2.3
Below is the code snippet using sequence function.

data1 = [
    (1, &quot;2022-09-01&quot;, &quot;2023-01-01&quot;, 1),
    (1, &quot;2022-09-01&quot;, &quot;2023-02-01&quot;, 1),
    (1, &quot;2022-09-11&quot;, &quot;2023-01-01&quot;, 2),
    (1, &quot;2022-09-01&quot;, &quot;2023-01-01&quot;, 2),
    (1, &quot;2022-09-21&quot;, &quot;2023-01-01&quot;, 1),
]
df1 = spark.createDataFrame(
    data1, [&quot;item&quot;, &quot;start_d&quot;, &quot;activation_d&quot;, &quot;dept_id&quot;]
)
df1 = df1.withColumn(
    &quot;week_start&quot;,
    SF.explode(SF.expr(&quot;sequence(start_d, activation_d, interval 7 day)&quot;)),
)

How do the same thing without using sequence function.

答案1

得分: 0

我可以使用udf函数来实现相同的功能。以下是udf函数及其用法供参考：

def generate_dates(start_date, end_date, interval):
    dates = []
    current_date = start_date
    while current_date <= end_date:
        dates.append(current_date)
        current_date += timedelta(days=interval)
    return dates

spark.udf.register("generate_dates", generate_dates, ArrayType(DateType()))

data1 = [
    (1, "2022-09-01", "2023-01-01", 1),
    (1, "2022-09-01", "2023-02-01", 1),
    (1, "2022-09-11", "2023-01-01", 2),
    (1, "2022-09-01", "2023-01-01", 2),
    (1, "2022-09-21", "2023-01-01", 1),
]
df1 = spark.createDataFrame(
    data1, ["item", "start_date", "activation_date", "dept_id"]
)

df1 = df1.withColumn(
    "week_start",
    SF.explode(SF.expr("generate_dates(start_date, activation_date, 7)")),
)

英文:

I was able to achieve the same thing using udf function. Here is the udf function and its use for reference

def generate_dates(start_date, end_date, interval):
    dates = []
    current_date = start_date
    while current_date &lt;= end_date:
        dates.append(current_date)
        current_date += timedelta(days=interval)
    return dates

spark.udf.register(&quot;generate_dates&quot;, generate_dates, ArrayType(DateType()))

data1 = [
    (1, &quot;2022-09-01&quot;, &quot;2023-01-01&quot;, 1),
    (1, &quot;2022-09-01&quot;, &quot;2023-02-01&quot;, 1),
    (1, &quot;2022-09-11&quot;, &quot;2023-01-01&quot;, 2),
    (1, &quot;2022-09-01&quot;, &quot;2023-01-01&quot;, 2),
    (1, &quot;2022-09-21&quot;, &quot;2023-01-01&quot;, 1),
]
df1 = spark.createDataFrame(
    data1, [&quot;item&quot;, &quot;start_d&quot;, &quot;activation_d&quot;, &quot;dept_id&quot;]
)

df1 = df1.withColumn(
    &quot;week_start&quot;,
    SF.explode(SF.expr(&quot;generate_dates(start_date, activation_date, 7)&quot;)),
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark 2.3 中的 Pyspark 序列等效部分

问题

答案1

How can I filter an rows in column of ArrayType(StringType) against items in another column in a separate dataframe using pyspark?

当我们删除Spark管理的表时会发生什么？

如何使用多列作为嵌套字典的映射，以创建新的数据框列？

PySpark 3高阶函数用于提取到列中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论