Spark无法处理特定的日期格式。

huangapple go评论65阅读模式
英文:

Spark is unable to handle a particular date format

问题

我正在尝试使用Pyspark从字符串类型字段中转换多个日期格式。当我使用以下日期格式时,它可以正常工作。

def custom_to_date(col):
    formats = ("MM/dd/yyyy", "yyyy-MM-dd", "dd/MM/yyyy", "MM/yy","dd/M/yyyy")
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()

上面的代码给出了正确的输出。但是当我使用单个数字表示月份时,代码失败。

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"), (3, "24/7/2006")], ("id", "dt"))

我得到了以下错误消息。

org.apache.spark.SparkException:
  Job aborted due to stage failure:
    Task 2 in stage 2.0 failed 4 times, most recent failure: 
      Lost task 2.3 in stage 2.0 (TID 10) (10.13.82.55 executor 0): 
        org.apache.spark.SparkUpgradeException: 
        [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] 
        You may get a different result due to the upgrading to Spark >= 3.0:
英文:

I am trying to cast multiple date formats from string type field using Pyspark. When I am using below date format it is working fine.

def custom_to_date(col):
    formats = ("MM/dd/yyyy", "yyyy-MM-dd", "dd/MM/yyyy", "MM/yy","dd/M/yyyy")
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()

Above code gives the correct output.
But when I use the month in single digit as below, the code fails.

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"),(3,"24/7/2006")], ("id", "dt"))

I got the below error message.

org.apache.spark.SparkException:
  Job aborted due to stage failure:
    Task 2 in stage 2.0 failed 4 times, most recent failure: 
      Lost task 2.3 in stage 2.0 (TID 10) (10.13.82.55 executor 0): 
        org.apache.spark.SparkUpgradeException: 
        [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] 
        You may get a different result due to the upgrading to Spark >= 3.0:

答案1

得分: 2

由于评论和其他答案没有涵盖行为,我添加一个答案。解决方案不是添加新的格式。因为格式本身可以更好地定义。

在 Spark 3.0 中,M 支持 01, 1. January, Jan。所以你不需要 MM

Spark 参考链接 - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

def custom_to_date(col):
    formats = ("M/d/yyyy", "yyyy-M-d", "d/M/y", "M/y")
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"), (3, "12/2023")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()

结果 -

Spark无法处理特定的日期格式。

或者,如果你想要遗留行为,你可以使用

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

英文:

Adding an answer since the comments and others answer doesn't cover the behaviour. The solution is not to add new formats. Since the formats itself can be better defined.

with spark 3.0 M supports 01, 1. January, Jan.
So you don't need MM

spark reference - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

def custom_to_date(col):
    formats = ("M/d/yyyy", "yyyy-M-d", "d/M/y", "M/y")
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"),(3,"12/2023")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()

Results -

Spark无法处理特定的日期格式。

Alternatively, if you want legacy behavior then you can use

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

答案2

得分: 0

我认为你需要做的是使用单个数字来提前匹配模式,否则当尝试使用两个数字的模式解析时会失败。

from pyspark.sql.functions import coalesce, to_date

def custom_to_date(col):
    formats = ("MM/dd/yyyy", "yyyy-MM-d", "yyyy-MM-dd", "dd/M/yyyy", "dd/MM/yyyy", "MM/yy")
    return coalesce(*[to_date(col, f) for f in formats])

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01"), (3, "01/22/2010"), (4, "2018-12-1"), (5,"24/7/2006")], ("id", "dt"))

df.withColumn("pdt", custom_to_date("dt")).show()
+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
|  3|01/22/2010|2010-01-22|
|  4| 2018-12-1|2018-12-01|
|  5| 24/7/2006|2006-07-24|
+---+----------+----------+
英文:

I think what you need to do is to advance the patterns with single digit, otherwise it will fail when it tries to parse with the two digits pattern

from pyspark.sql.functions import coalesce, to_date

def custom_to_date(col):
    formats = ("MM/dd/yyyy", "yyyy-MM-d", "yyyy-MM-dd", "dd/M/yyyy", "dd/MM/yyyy", "MM/yy")
    return coalesce(*[to_date(col, f) for f in formats])


df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01"), (3, "01/22/2010"), (4, "2018-12-1"), (5,"24/7/2006")], ("id", "dt"))

df.withColumn("pdt", custom_to_date("dt")).show()
+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
|  3|01/22/2010|2010-01-22|
|  4| 2018-12-1|2018-12-01|
|  5| 24/7/2006|2006-07-24|
+---+----------+----------+

huangapple
  • 本文由 发表于 2023年6月1日 22:15:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76382858.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定