PySpark 无法推断时间戳,即使使用了 timestampFormat。

huangapple go评论58阅读模式
英文:

PySpark cannot infer timestamp even with timestampFormat

问题

Here is the translated code portion:

我有这个 JSON 文件

```json
{"created_at":"2022-01-02 12:17:43.399 UTC","updated_at":"2022-01-02 12:17:43.399 UTC"}

尝试将其读取为

read_df = spark \
            .read \
            .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
            .option("inferSchema", "true") \
            .json(path)

但是推断的模式会返回给我

root
 |-- created_at: string (nullable = true)
 |-- updated_at: string (nullable = true)

我尝试通过 withColumn("timestamp",to_timestamp(col("created_at"), "yyyy-MM-dd HH:mm:ss.SSS 'UTC'")) 强制转换它,并且它有效。

我不想提供模式,而是让它自动推断,因为我有不同模式的不同文件,并且想要重复使用读取函数。

我不确定出了什么问题。

Spark 版本:3.3.2


<details>
<summary>英文:</summary>

I have this json file

```json
{&quot;created_at&quot;:&quot;2022-01-02 12:17:43.399 UTC&quot;,&quot;updated_at&quot;:&quot;2022-01-02 12:17:43.399 UTC&quot;}

Trying to read it as

read_df = spark \
            .read \
            .option(&quot;timestampFormat&quot;, &quot;yyyy-MM-dd HH:mm:ss.SSS &#39;UTC&#39;&quot;) \
            .option(&quot;inferSchema&quot;, &quot;true&quot;) \
            .json(path)

but the inferred schema gives me back

root
 |-- created_at: string (nullable = true)
 |-- updated_at: string (nullable = true)

I've tried to forced it via withColumn(&quot;timestamp&quot;,to_timestamp(col(&quot;created_at&quot;), &quot;yyyy-MM-dd HH:mm:ss.SSS &#39;UTC&#39;&quot;)) and it works.

I don't want to provide myself the schema but let infer it because I have different files with different schemas and want to re-use the function for reading.

I'm not sure what's wrong.

Spark versionL 3.3.2

答案1

得分: 4

时间戳的推断必须显式启用(文档代码):

自版本3.0.1起,默认情况下已禁用时间戳类型推断。将JSON选项inferTimestamp设置为true以启用此类类型推断。

read_df = spark \
            .read \
            .option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSS 'UTC'") \
            .option("inferSchema", "true") \
            .option("inferTimestamp", "true") \
            .json(path)

返回两个时间戳列。

英文:

The inference of timestamps has to enabled explictly (docs, code):
> Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option inferTimestamp to true to enable such type inference.

read_df = spark \
            .read \
            .option(&quot;timestampFormat&quot;, &quot;yyyy-MM-dd HH:mm:ss.SSS &#39;UTC&#39;&quot;) \
            .option(&quot;inferSchema&quot;, &quot;true&quot;) \
            .option(&quot;inferTimestamp&quot;, &quot;true&quot;) \
            .json(path)

returns two timestamp columns.

huangapple
  • 本文由 发表于 2023年4月17日 03:12:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76029862.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定