How to convert string like "yyyy-MM-ddThh:mm:ss+XXXX" to proper date_format in Spark?

huangapple go评论130阅读模式
英文:

How to convert string like "yyyy-MM-ddThh:mm:ss+XXXX" to proper date_format in Spark?

问题

我有一列包含字符串数据,类似于"2023-03-13T15:18:14+0700"。我的最终目标是将其转换为正确的日期格式,如"2023-03-13 15:18:14"。最好将时间转换为GMT+7(我的位置),然后删除"T"和"+XXXX"部分。但如果这太难或不可能做到,我只需要删除"T"和"+0700",因为我的大部分数据都是"+0700"。

我阅读了许多关于SOF的帖子,但到目前为止都没有运气。例如,这里这里,最接近的是这个,但没有运气,因为它们的格式与我的略有不同。

以下是我从最新帖子中获得的内容:

object test extends App {
  val spark = SparkSession.builder().master("local[*]").getOrCreate()
  import spark.implicits._
  val df = Seq("2023-03-13T15:18:14+0700").toDF("time")

  val result = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ss.SSSXXX").alias("newtime"))
  result.show(truncate = false) // Null

  val result1 = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ssXXX").alias("newtime"))
  result1.show(truncate = false) // Null
}

<details>
<summary>英文:</summary>

I have a column containing string data like &quot;**2023-03-13T15:18:14+0700**&quot;. My final goal is to convert it to a proper date format like &quot;**2023-03-13 15:18:14**&quot;. It&#39;s best to convert the time to GMT+7 (my location) and then remove the &quot;T&quot; and &quot;+XXXX&quot; part. But if it&#39;s too hard or impossible to do, I just need to remove the &quot;T&quot; and &quot;+0700&quot; since most of my data is &quot;+0700&quot;.


I read many posts on SOF but had no luck so far. For example, [here][1], [here][2], and the closest one is [this][3] but no luck since their format is a bit different from mine.

Below is what I got from the latest post:

    object test extends App {
      val spark = SparkSession.builder().master(&quot;local[*]&quot;).getOrCreate()
      import spark.implicits._
      val df = Seq(&quot;2023-03-13T15:18:14+0700&quot;).toDF(&quot;time&quot;)
    
      val result = df.select(to_timestamp(col(&quot;time&quot;), &quot;yyyy-MM-dd&#39;T&#39;hh:mm:ss.SSSXXX&quot;).alias(&quot;newtime&quot;))
      result.show(truncate = false) // Null

      val result1 = df.select(to_timestamp(col(&quot;time&quot;), &quot;yyyy-MM-dd&#39;T&#39;hh:mm:ssXXX&quot;).alias(&quot;newtime&quot;))
      result1.show(truncate = false) // Null
    }


  [1]: https://stackoverflow.com/questions/70394354/why-i-cant-parse-this-date-format-yyyy-mm-ddthhmmss-sssz
  [2]: https://stackoverflow.com/questions/60857315/how-to-handle-t-and-z-in-the-date-format-using-pyspark-functions
  [3]: https://stackoverflow.com/questions/54690305/spark-javahow-to-convert-dataset-string-column-of-format-yyyy-mm-ddthhmmss-s

</details>


# 答案1
**得分**: 1

使用 ```cast()``` 转换

from pyspark.sql.functions import col
from pyspark.sql.types import TimestampType

df = df.withColumn("time", col("time").cast(TimestampType()))

df.show()


***输出***

+-------------------+
| time|
+-------------------+
|2023-03-13 08:18:14|
+-------------------+


***模式***

root
|-- time: timestamp (nullable = true)


<details>
<summary>英文:</summary>

Use ```cast()``` transformation

from pyspark.sql.functions import col
from pyspark.sql.types import TimestampType

df = df.withColumn("time", col("time").cast(TimestampType()))

df.show()

***Output***

+-------------------+
| time|
+-------------------+
|2023-03-13 08:18:14|
+-------------------+

***Schema***

root
|-- time: timestamp (nullable = true)


</details>



# 答案2
**得分**: 1

您没有使用正确的格式,您提供的日期是ISO 8601类型的,正确的格式是 `yyyy-MM-dd'T'HH:mm:ssZ`,以下是如何使用 `to_timestamp` 函数来进行转换:

```python
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([["2023-03-13T15:18:14+0700"]], ['time'])
df = df.withColumn("timestamp_utc", to_timestamp("time", "yyyy-MM-dd'T'HH:mm:ssZ"))
df.show(truncate=False)
df.printSchema()

输出如下:

+------------------------+-------------------+
|time                    |timestamp_utc      |
+------------------------+-------------------+
|2023-03-13T15:18:14+0700|2023-03-13 09:18:14|
+------------------------+-------------------+

root
 |-- time: string (nullable = true)
 |-- timestamp_utc: timestamp (nullable = true)
英文:

You are not using the right format, the date you have is of type ISO 8601, the right format is yyyy-MM-dd&#39;T&#39;HH:mm:ssZ, here's how to use it using to_timestamp function:

spark = SparkSession.builder.master(&quot;local[*]&quot;).getOrCreate()
df = spark.createDataFrame([[&quot;2023-03-13T15:18:14+0700&quot;]], [&#39;time&#39;])
df = df.withColumn(&quot;timestamp_utc&quot;, to_timestamp(&quot;time&quot;, &quot;yyyy-MM-dd&#39;T&#39;HH:mm:ssZ&quot;))
df.show(truncate=False)
df.printSchema()

+------------------------+-------------------+
|time                    |timestamp_utc      |
+------------------------+-------------------+
|2023-03-13T15:18:14+0700|2023-03-13 09:18:14|
+------------------------+-------------------+

root
 |-- time: string (nullable = true)
 |-- timestamp_utc: timestamp (nullable = true)

huangapple
  • 本文由 发表于 2023年3月15日 18:10:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/75743254.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定