英文:
How to convert string like "yyyy-MM-ddThh:mm:ss+XXXX" to proper date_format in Spark?
问题
我有一列包含字符串数据,类似于"2023-03-13T15:18:14+0700"。我的最终目标是将其转换为正确的日期格式,如"2023-03-13 15:18:14"。最好将时间转换为GMT+7(我的位置),然后删除"T"和"+XXXX"部分。但如果这太难或不可能做到,我只需要删除"T"和"+0700",因为我的大部分数据都是"+0700"。
我阅读了许多关于SOF的帖子,但到目前为止都没有运气。例如,这里,这里,最接近的是这个,但没有运气,因为它们的格式与我的略有不同。
以下是我从最新帖子中获得的内容:
object test extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val df = Seq("2023-03-13T15:18:14+0700").toDF("time")
val result = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ss.SSSXXX").alias("newtime"))
result.show(truncate = false) // Null
val result1 = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ssXXX").alias("newtime"))
result1.show(truncate = false) // Null
}
<details>
<summary>英文:</summary>
I have a column containing string data like "**2023-03-13T15:18:14+0700**". My final goal is to convert it to a proper date format like "**2023-03-13 15:18:14**". It's best to convert the time to GMT+7 (my location) and then remove the "T" and "+XXXX" part. But if it's too hard or impossible to do, I just need to remove the "T" and "+0700" since most of my data is "+0700".
I read many posts on SOF but had no luck so far. For example, [here][1], [here][2], and the closest one is [this][3] but no luck since their format is a bit different from mine.
Below is what I got from the latest post:
object test extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val df = Seq("2023-03-13T15:18:14+0700").toDF("time")
val result = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ss.SSSXXX").alias("newtime"))
result.show(truncate = false) // Null
val result1 = df.select(to_timestamp(col("time"), "yyyy-MM-dd'T'hh:mm:ssXXX").alias("newtime"))
result1.show(truncate = false) // Null
}
[1]: https://stackoverflow.com/questions/70394354/why-i-cant-parse-this-date-format-yyyy-mm-ddthhmmss-sssz
[2]: https://stackoverflow.com/questions/60857315/how-to-handle-t-and-z-in-the-date-format-using-pyspark-functions
[3]: https://stackoverflow.com/questions/54690305/spark-javahow-to-convert-dataset-string-column-of-format-yyyy-mm-ddthhmmss-s
</details>
# 答案1
**得分**: 1
使用 ```cast()``` 转换
from pyspark.sql.functions import col
from pyspark.sql.types import TimestampType
df = df.withColumn("time", col("time").cast(TimestampType()))
df.show()
***输出***
+-------------------+
| time|
+-------------------+
|2023-03-13 08:18:14|
+-------------------+
***模式***
root
|-- time: timestamp (nullable = true)
<details>
<summary>英文:</summary>
Use ```cast()``` transformation
from pyspark.sql.functions import col
from pyspark.sql.types import TimestampType
df = df.withColumn("time", col("time").cast(TimestampType()))
df.show()
***Output***
+-------------------+
| time|
+-------------------+
|2023-03-13 08:18:14|
+-------------------+
***Schema***
root
|-- time: timestamp (nullable = true)
</details>
# 答案2
**得分**: 1
您没有使用正确的格式,您提供的日期是ISO 8601类型的,正确的格式是 `yyyy-MM-dd'T'HH:mm:ssZ`,以下是如何使用 `to_timestamp` 函数来进行转换:
```python
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([["2023-03-13T15:18:14+0700"]], ['time'])
df = df.withColumn("timestamp_utc", to_timestamp("time", "yyyy-MM-dd'T'HH:mm:ssZ"))
df.show(truncate=False)
df.printSchema()
输出如下:
+------------------------+-------------------+
|time |timestamp_utc |
+------------------------+-------------------+
|2023-03-13T15:18:14+0700|2023-03-13 09:18:14|
+------------------------+-------------------+
root
|-- time: string (nullable = true)
|-- timestamp_utc: timestamp (nullable = true)
英文:
You are not using the right format, the date you have is of type ISO 8601, the right format is yyyy-MM-dd'T'HH:mm:ssZ
, here's how to use it using to_timestamp
function:
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame([["2023-03-13T15:18:14+0700"]], ['time'])
df = df.withColumn("timestamp_utc", to_timestamp("time", "yyyy-MM-dd'T'HH:mm:ssZ"))
df.show(truncate=False)
df.printSchema()
+------------------------+-------------------+
|time |timestamp_utc |
+------------------------+-------------------+
|2023-03-13T15:18:14+0700|2023-03-13 09:18:14|
+------------------------+-------------------+
root
|-- time: string (nullable = true)
|-- timestamp_utc: timestamp (nullable = true)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论