英文:
Spark is unable to handle a particular date format
问题
我正在尝试使用Pyspark从字符串类型字段中转换多个日期格式。当我使用以下日期格式时,它可以正常工作。
def custom_to_date(col):
formats = ("MM/dd/yyyy", "yyyy-MM-dd", "dd/MM/yyyy", "MM/yy","dd/M/yyyy")
return coalesce(*[to_date(col, f) for f in formats])
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()
上面的代码给出了正确的输出。但是当我使用单个数字表示月份时,代码失败。
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"), (3, "24/7/2006")], ("id", "dt"))
我得到了以下错误消息。
org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 2 in stage 2.0 failed 4 times, most recent failure:
Lost task 2.3 in stage 2.0 (TID 10) (10.13.82.55 executor 0):
org.apache.spark.SparkUpgradeException:
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER]
You may get a different result due to the upgrading to Spark >= 3.0:
英文:
I am trying to cast multiple date formats from string type field using Pyspark. When I am using below date format it is working fine.
def custom_to_date(col):
formats = ("MM/dd/yyyy", "yyyy-MM-dd", "dd/MM/yyyy", "MM/yy","dd/M/yyyy")
return coalesce(*[to_date(col, f) for f in formats])
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()
Above code gives the correct output.
But when I use the month in single digit as below, the code fails.
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"),(3,"24/7/2006")], ("id", "dt"))
I got the below error message.
org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 2 in stage 2.0 failed 4 times, most recent failure:
Lost task 2.3 in stage 2.0 (TID 10) (10.13.82.55 executor 0):
org.apache.spark.SparkUpgradeException:
[INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER]
You may get a different result due to the upgrading to Spark >= 3.0:
答案1
得分: 2
由于评论和其他答案没有涵盖行为,我添加一个答案。解决方案不是添加新的格式。因为格式本身可以更好地定义。
在 Spark 3.0 中,M
支持 01, 1. January, Jan
。所以你不需要 MM
。
Spark 参考链接 - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
def custom_to_date(col):
formats = ("M/d/yyyy", "yyyy-M-d", "d/M/y", "M/y")
return coalesce(*[to_date(col, f) for f in formats])
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"), (3, "12/2023")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()
结果 -
或者,如果你想要遗留行为,你可以使用
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
或
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
英文:
Adding an answer since the comments and others answer doesn't cover the behaviour. The solution is not to add new formats. Since the formats itself can be better defined.
with spark 3.0 M
supports 01, 1. January, Jan
.
So you don't need MM
spark reference - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
def custom_to_date(col):
formats = ("M/d/yyyy", "yyyy-M-d", "d/M/y", "M/y")
return coalesce(*[to_date(col, f) for f in formats])
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"),(3,"12/2023")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()
Results -
Alternatively, if you want legacy behavior then you can use
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
答案2
得分: 0
我认为你需要做的是使用单个数字来提前匹配模式,否则当尝试使用两个数字的模式解析时会失败。
from pyspark.sql.functions import coalesce, to_date
def custom_to_date(col):
formats = ("MM/dd/yyyy", "yyyy-MM-d", "yyyy-MM-dd", "dd/M/yyyy", "dd/MM/yyyy", "MM/yy")
return coalesce(*[to_date(col, f) for f in formats])
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01"), (3, "01/22/2010"), (4, "2018-12-1"), (5,"24/7/2006")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()
+---+----------+----------+
| id| dt| pdt|
+---+----------+----------+
| 1|01/22/2010|2010-01-22|
| 2|2018-12-01|2018-12-01|
| 3|01/22/2010|2010-01-22|
| 4| 2018-12-1|2018-12-01|
| 5| 24/7/2006|2006-07-24|
+---+----------+----------+
英文:
I think what you need to do is to advance the patterns with single digit, otherwise it will fail when it tries to parse with the two digits pattern
from pyspark.sql.functions import coalesce, to_date
def custom_to_date(col):
formats = ("MM/dd/yyyy", "yyyy-MM-d", "yyyy-MM-dd", "dd/M/yyyy", "dd/MM/yyyy", "MM/yy")
return coalesce(*[to_date(col, f) for f in formats])
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01"), (3, "01/22/2010"), (4, "2018-12-1"), (5,"24/7/2006")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()
+---+----------+----------+
| id| dt| pdt|
+---+----------+----------+
| 1|01/22/2010|2010-01-22|
| 2|2018-12-01|2018-12-01|
| 3|01/22/2010|2010-01-22|
| 4| 2018-12-1|2018-12-01|
| 5| 24/7/2006|2006-07-24|
+---+----------+----------+
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论