2023年6月1日 22:15:11go评论65阅读模式

英文:

Spark is unable to handle a particular date format

问题

我正在尝试使用Pyspark从字符串类型字段中转换多个日期格式。当我使用以下日期格式时，它可以正常工作。

def custom_to_date(col):
    formats = ("MM/dd/yyyy", "yyyy-MM-dd", "dd/MM/yyyy", "MM/yy","dd/M/yyyy")
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()

上面的代码给出了正确的输出。但是当我使用单个数字表示月份时，代码失败。

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"), (3, "24/7/2006")], ("id", "dt"))

我得到了以下错误消息。

org.apache.spark.SparkException:
  Job aborted due to stage failure:
    Task 2 in stage 2.0 failed 4 times, most recent failure: 
      Lost task 2.3 in stage 2.0 (TID 10) (10.13.82.55 executor 0): 
        org.apache.spark.SparkUpgradeException: 
        [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] 
        You may get a different result due to the upgrading to Spark >= 3.0:

英文:

I am trying to cast multiple date formats from string type field using Pyspark. When I am using below date format it is working fine.

def custom_to_date(col):
    formats = (&quot;MM/dd/yyyy&quot;, &quot;yyyy-MM-dd&quot;, &quot;dd/MM/yyyy&quot;, &quot;MM/yy&quot;,&quot;dd/M/yyyy&quot;)
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, &quot;01/22/2010&quot;), (2, &quot;2018-12-01&quot;)], (&quot;id&quot;, &quot;dt&quot;))
df.withColumn(&quot;pdt&quot;, custom_to_date(&quot;dt&quot;)).show()

Above code gives the correct output.
But when I use the month in single digit as below, the code fails.

df = spark.createDataFrame([(1, &quot;01/22/2010&quot;), (2, &quot;2018-12-1&quot;),(3,&quot;24/7/2006&quot;)], (&quot;id&quot;, &quot;dt&quot;))

I got the below error message.

org.apache.spark.SparkException:
  Job aborted due to stage failure:
    Task 2 in stage 2.0 failed 4 times, most recent failure: 
      Lost task 2.3 in stage 2.0 (TID 10) (10.13.82.55 executor 0): 
        org.apache.spark.SparkUpgradeException: 
        [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] 
        You may get a different result due to the upgrading to Spark &gt;= 3.0:

答案1

得分: 2

由于评论和其他答案没有涵盖行为，我添加一个答案。解决方案不是添加新的格式。因为格式本身可以更好地定义。

在 Spark 3.0 中，M 支持 01, 1. January, Jan。所以你不需要 MM。

Spark 参考链接 - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

def custom_to_date(col):
    formats = ("M/d/yyyy", "yyyy-M-d", "d/M/y", "M/y")
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-1"), (3, "12/2023")], ("id", "dt"))
df.withColumn("pdt", custom_to_date("dt")).show()

结果 -

或者，如果你想要遗留行为，你可以使用

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
或
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

英文:

Adding an answer since the comments and others answer doesn't cover the behaviour. The solution is not to add new formats. Since the formats itself can be better defined.

with spark 3.0 M supports 01, 1. January, Jan.
So you don't need MM

spark reference - https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

def custom_to_date(col):
    formats = (&quot;M/d/yyyy&quot;, &quot;yyyy-M-d&quot;, &quot;d/M/y&quot;, &quot;M/y&quot;)
    return coalesce(*[to_date(col, f) for f in formats])
    
from pyspark.sql.functions import coalesce, to_date
df = spark.createDataFrame([(1, &quot;01/22/2010&quot;), (2, &quot;2018-12-1&quot;),(3,&quot;12/2023&quot;)], (&quot;id&quot;, &quot;dt&quot;))
df.withColumn(&quot;pdt&quot;, custom_to_date(&quot;dt&quot;)).show()

Results -

Alternatively, if you want legacy behavior then you can use

spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

答案2

得分: 0

我认为你需要做的是使用单个数字来提前匹配模式，否则当尝试使用两个数字的模式解析时会失败。

from pyspark.sql.functions import coalesce, to_date

def custom_to_date(col):
    formats = ("MM/dd/yyyy", "yyyy-MM-d", "yyyy-MM-dd", "dd/M/yyyy", "dd/MM/yyyy", "MM/yy")
    return coalesce(*[to_date(col, f) for f in formats])

df = spark.createDataFrame([(1, "01/22/2010"), (2, "2018-12-01"), (3, "01/22/2010"), (4, "2018-12-1"), (5,"24/7/2006")], ("id", "dt"))

df.withColumn("pdt", custom_to_date("dt")).show()

+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
|  3|01/22/2010|2010-01-22|
|  4| 2018-12-1|2018-12-01|
|  5| 24/7/2006|2006-07-24|
+---+----------+----------+

英文:

I think what you need to do is to advance the patterns with single digit, otherwise it will fail when it tries to parse with the two digits pattern

from pyspark.sql.functions import coalesce, to_date

def custom_to_date(col):
    formats = (&quot;MM/dd/yyyy&quot;, &quot;yyyy-MM-d&quot;, &quot;yyyy-MM-dd&quot;, &quot;dd/M/yyyy&quot;, &quot;dd/MM/yyyy&quot;, &quot;MM/yy&quot;)
    return coalesce(*[to_date(col, f) for f in formats])


df = spark.createDataFrame([(1, &quot;01/22/2010&quot;), (2, &quot;2018-12-01&quot;), (3, &quot;01/22/2010&quot;), (4, &quot;2018-12-1&quot;), (5,&quot;24/7/2006&quot;)], (&quot;id&quot;, &quot;dt&quot;))

df.withColumn(&quot;pdt&quot;, custom_to_date(&quot;dt&quot;)).show()

+---+----------+----------+
| id|        dt|       pdt|
+---+----------+----------+
|  1|01/22/2010|2010-01-22|
|  2|2018-12-01|2018-12-01|
|  3|01/22/2010|2010-01-22|
|  4| 2018-12-1|2018-12-01|
|  5| 24/7/2006|2006-07-24|
+---+----------+----------+

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Spark无法处理特定的日期格式。

问题

答案1

答案2

如何检测Spark Graphframes中的循环？

Java Spark SQL: 合并和覆盖具有相同模式的数据集

连接两个数据库表以生成第三个数据

Databricks、存储帐户和VNet对等连接

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论