英文:
Pyspark sequence equivalent in spark 2.3
问题
我能够使用序列函数(从Spark 2.4版本开始提供)生成在两个日期之间发生的日期列的时间序列。
我的生产系统使用的是Spark 2.3版本。如何在Spark 2.3中实现相同的功能?
以下是使用序列函数的代码片段。
data1 = [
(1, "2022-09-01", "2023-01-01", 1),
(1, "2022-09-01", "2023-02-01", 1),
(1, "2022-09-11", "2023-01-01", 2),
(1, "2022-09-01", "2023-01-01", 2),
(1, "2022-09-21", "2023-01-01", 1),
]
df1 = spark.createDataFrame(
data1, ["item", "start_d", "activation_d", "dept_id"]
)
df1 = df1.withColumn(
"week_start",
SF.explode(SF.expr("sequence(start_d, activation_d, interval 7 day)")),
)
如何在不使用序列函数的情况下实现相同的功能。
英文:
I am able to generate a time series of date column that occurs between 2 dates using sequence function(available from spark 2.4)
My production system has spark 2.3. How can I achieve same thing using spark 2.3
Below is the code snippet using sequence function.
data1 = [
(1, "2022-09-01", "2023-01-01", 1),
(1, "2022-09-01", "2023-02-01", 1),
(1, "2022-09-11", "2023-01-01", 2),
(1, "2022-09-01", "2023-01-01", 2),
(1, "2022-09-21", "2023-01-01", 1),
]
df1 = spark.createDataFrame(
data1, ["item", "start_d", "activation_d", "dept_id"]
)
df1 = df1.withColumn(
"week_start",
SF.explode(SF.expr("sequence(start_d, activation_d, interval 7 day)")),
)
How do the same thing without using sequence function.
答案1
得分: 0
我可以使用udf函数来实现相同的功能。以下是udf函数及其用法供参考:
def generate_dates(start_date, end_date, interval):
dates = []
current_date = start_date
while current_date <= end_date:
dates.append(current_date)
current_date += timedelta(days=interval)
return dates
spark.udf.register("generate_dates", generate_dates, ArrayType(DateType()))
data1 = [
(1, "2022-09-01", "2023-01-01", 1),
(1, "2022-09-01", "2023-02-01", 1),
(1, "2022-09-11", "2023-01-01", 2),
(1, "2022-09-01", "2023-01-01", 2),
(1, "2022-09-21", "2023-01-01", 1),
]
df1 = spark.createDataFrame(
data1, ["item", "start_date", "activation_date", "dept_id"]
)
df1 = df1.withColumn(
"week_start",
SF.explode(SF.expr("generate_dates(start_date, activation_date, 7)")),
)
英文:
I was able to achieve the same thing using udf function. Here is the udf function and its use for reference
def generate_dates(start_date, end_date, interval):
dates = []
current_date = start_date
while current_date <= end_date:
dates.append(current_date)
current_date += timedelta(days=interval)
return dates
spark.udf.register("generate_dates", generate_dates, ArrayType(DateType()))
data1 = [
(1, "2022-09-01", "2023-01-01", 1),
(1, "2022-09-01", "2023-02-01", 1),
(1, "2022-09-11", "2023-01-01", 2),
(1, "2022-09-01", "2023-01-01", 2),
(1, "2022-09-21", "2023-01-01", 1),
]
df1 = spark.createDataFrame(
data1, ["item", "start_d", "activation_d", "dept_id"]
)
df1 = df1.withColumn(
"week_start",
SF.explode(SF.expr("generate_dates(start_date, activation_date, 7)")),
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论