How to convert string like "yyyy-MM-ddThh:mm:ss+XXXX" to proper date_format in Spark?

huangapple go评论149阅读模式
英文:

How to convert string like "yyyy-MM-ddThh:mm:ss+XXXX" to proper date_format in Spark?

问题

我有一列包含字符串数据,类似于"2023-03-13T15:18:14+0700"。我的最终目标是将其转换为正确的日期格式,如"2023-03-13 15:18:14"。最好将时间转换为GMT+7(我的位置),然后删除"T"和"+XXXX"部分。但如果这太难或不可能做到,我只需要删除"T"和"+0700",因为我的大部分数据都是"+0700"。

我阅读了许多关于SOF的帖子,但到目前为止都没有运气。例如,这里这里,最接近的是这个,但没有运气,因为它们的格式与我的略有不同。

以下是我从最新帖子中获得的内容:

  1. object test extends App {
  2. val spark = SparkSession.builder().master("local[*]").getOrCreate()
  3. import spark.implicits._
  4. val df = Seq("2023-03-13T15:18:14+0700").toDF("time")
  5. val result = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ss.SSSXXX").alias("newtime"))
  6. result.show(truncate = false) // Null
  7. val result1 = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ssXXX").alias("newtime"))
  8. result1.show(truncate = false) // Null
  9. }
  1. <details>
  2. <summary>英文:</summary>
  3. I have a column containing string data like &quot;**2023-03-13T15:18:14+0700**&quot;. My final goal is to convert it to a proper date format like &quot;**2023-03-13 15:18:14**&quot;. It&#39;s best to convert the time to GMT+7 (my location) and then remove the &quot;T&quot; and &quot;+XXXX&quot; part. But if it&#39;s too hard or impossible to do, I just need to remove the &quot;T&quot; and &quot;+0700&quot; since most of my data is &quot;+0700&quot;.
  4. I read many posts on SOF but had no luck so far. For example, [here][1], [here][2], and the closest one is [this][3] but no luck since their format is a bit different from mine.
  5. Below is what I got from the latest post:
  6. object test extends App {
  7. val spark = SparkSession.builder().master(&quot;local[*]&quot;).getOrCreate()
  8. import spark.implicits._
  9. val df = Seq(&quot;2023-03-13T15:18:14+0700&quot;).toDF(&quot;time&quot;)
  10. val result = df.select(to_timestamp(col(&quot;time&quot;), &quot;yyyy-MM-dd&#39;T&#39;hh:mm:ss.SSSXXX&quot;).alias(&quot;newtime&quot;))
  11. result.show(truncate = false) // Null
  12. val result1 = df.select(to_timestamp(col(&quot;time&quot;), &quot;yyyy-MM-dd&#39;T&#39;hh:mm:ssXXX&quot;).alias(&quot;newtime&quot;))
  13. result1.show(truncate = false) // Null
  14. }
  15. [1]: https://stackoverflow.com/questions/70394354/why-i-cant-parse-this-date-format-yyyy-mm-ddthhmmss-sssz
  16. [2]: https://stackoverflow.com/questions/60857315/how-to-handle-t-and-z-in-the-date-format-using-pyspark-functions
  17. [3]: https://stackoverflow.com/questions/54690305/spark-javahow-to-convert-dataset-string-column-of-format-yyyy-mm-ddthhmmss-s
  18. </details>
  19. # 答案1
  20. **得分**: 1
  21. 使用 ```cast()``` 转换

from pyspark.sql.functions import col
from pyspark.sql.types import TimestampType

df = df.withColumn("time", col("time").cast(TimestampType()))

df.show()

  1. ***输出***

+-------------------+
| time|
+-------------------+
|2023-03-13 08:18:14|
+-------------------+

  1. ***模式***

root
|-- time: timestamp (nullable = true)

  1. <details>
  2. <summary>英文:</summary>
  3. Use ```cast()``` transformation

from pyspark.sql.functions import col
from pyspark.sql.types import TimestampType

df = df.withColumn("time", col("time").cast(TimestampType()))

df.show()

  1. ***Output***

+-------------------+
| time|
+-------------------+
|2023-03-13 08:18:14|
+-------------------+

  1. ***Schema***

root
|-- time: timestamp (nullable = true)

  1. </details>
  2. # 答案2
  3. **得分**: 1
  4. 您没有使用正确的格式,您提供的日期是ISO 8601类型的,正确的格式是 `yyyy-MM-dd'T'HH:mm:ssZ`,以下是如何使用 `to_timestamp` 函数来进行转换:
  5. ```python
  6. spark = SparkSession.builder.master("local[*]").getOrCreate()
  7. df = spark.createDataFrame([["2023-03-13T15:18:14+0700"]], ['time'])
  8. df = df.withColumn("timestamp_utc", to_timestamp("time", "yyyy-MM-dd'T'HH:mm:ssZ"))
  9. df.show(truncate=False)
  10. df.printSchema()

输出如下:

  1. +------------------------+-------------------+
  2. |time |timestamp_utc |
  3. +------------------------+-------------------+
  4. |2023-03-13T15:18:14+0700|2023-03-13 09:18:14|
  5. +------------------------+-------------------+
  6. root
  7. |-- time: string (nullable = true)
  8. |-- timestamp_utc: timestamp (nullable = true)
英文:

You are not using the right format, the date you have is of type ISO 8601, the right format is yyyy-MM-dd&#39;T&#39;HH:mm:ssZ, here's how to use it using to_timestamp function:

  1. spark = SparkSession.builder.master(&quot;local[*]&quot;).getOrCreate()
  2. df = spark.createDataFrame([[&quot;2023-03-13T15:18:14+0700&quot;]], [&#39;time&#39;])
  3. df = df.withColumn(&quot;timestamp_utc&quot;, to_timestamp(&quot;time&quot;, &quot;yyyy-MM-dd&#39;T&#39;HH:mm:ssZ&quot;))
  4. df.show(truncate=False)
  5. df.printSchema()
  6. +------------------------+-------------------+
  7. |time |timestamp_utc |
  8. +------------------------+-------------------+
  9. |2023-03-13T15:18:14+0700|2023-03-13 09:18:14|
  10. +------------------------+-------------------+
  11. root
  12. |-- time: string (nullable = true)
  13. |-- timestamp_utc: timestamp (nullable = true)

huangapple
  • 本文由 发表于 2023年3月15日 18:10:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75743254.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定