英文:
PySpark cannot infer timestamp even with timestampFormat
问题
Here is the translated code portion:
我有这个 JSON 文件
```json
{"created_at":"2022-01-02 12:17:43.399 UTC","updated_at":"2022-01-02 12:17:43.399 UTC"}
尝试将其读取为
read_df = spark \
.read \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
.option("inferSchema", "true") \
.json(path)
但是推断的模式会返回给我
root
|-- created_at: string (nullable = true)
|-- updated_at: string (nullable = true)
我尝试通过 withColumn("timestamp",to_timestamp(col("created_at"), "yyyy-MM-dd HH:mm:ss.SSS 'UTC'"))
强制转换它,并且它有效。
我不想提供模式,而是让它自动推断,因为我有不同模式的不同文件,并且想要重复使用读取函数。
我不确定出了什么问题。
Spark 版本:3.3.2
<details>
<summary>英文:</summary>
I have this json file
```json
{"created_at":"2022-01-02 12:17:43.399 UTC","updated_at":"2022-01-02 12:17:43.399 UTC"}
Trying to read it as
read_df = spark \
.read \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
.option("inferSchema", "true") \
.json(path)
but the inferred schema gives me back
root
|-- created_at: string (nullable = true)
|-- updated_at: string (nullable = true)
I've tried to forced it via withColumn("timestamp",to_timestamp(col("created_at"), "yyyy-MM-dd HH:mm:ss.SSS 'UTC'"))
and it works.
I don't want to provide myself the schema but let infer it because I have different files with different schemas and want to re-use the function for reading.
I'm not sure what's wrong.
Spark versionL 3.3.2
答案1
得分: 4
自版本3.0.1起,默认情况下已禁用时间戳类型推断。将JSON选项
inferTimestamp
设置为true以启用此类类型推断。
read_df = spark \
.read \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
.option("inferSchema", "true") \
.option("inferTimestamp", "true") \
.json(path)
返回两个时间戳列。
英文:
The inference of timestamps has to enabled explictly (docs, code):
> Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option inferTimestamp to true to enable such type inference.
read_df = spark \
.read \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
.option("inferSchema", "true") \
.option("inferTimestamp", "true") \
.json(path)
returns two timestamp columns.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论