`pyarrow` 在 Parquet 文件上的时间戳数据类型错误。

huangapple go评论60阅读模式
英文:

pyarrow timestamp datatype error on parquet file

问题

我在使用pyarrow读取和计算pandas中的记录时遇到了这个错误。我不希望pyarrow将时间戳转换为timestamp[ns],而是保持在timestamp[us],是否有选项可以保持时间戳不变?我正在使用pyarrow 11.0.0和Python 3.10,请提供建议。

代码部分:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd

# 将Parquet文件读取为PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()

print(len(table))

错误信息:

pyarrow.lib.ArrowInvalid: 从`timestamp[us]`转换为`timestamp[ns]`会导致时间戳超出范围101999952000000000

不要翻译代码部分,只翻译文本内容。

英文:

I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please advise

code:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd

# Read the Parquet file into a PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()

print(len(table))

error

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

答案1

得分: 1

I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?

我不希望pyarrow将其转换为timestamp[ns],可以保留为timestamp[us]吗?是否有保持时间戳不变的选项?

At the moment, pandas only support nanosecond timestamp.

目前,pandas仅支持纳秒级的时间戳。

If you insist on keeping us precision you have a few options:

如果您坚持要保持微秒精度,您有几种选择:

  1. not use pandas, stick to pyarrow which supports microseconds:

  2. 不使用pandas,继续使用支持微秒的pyarrow:

table = pq.read_table("data.parquet")
len(table)
  1. Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)

  2. 在您的数据框中使用datetime.datetime代替pd.Timestamp(速度非常慢)

table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)
  1. Ignore the loss of precision for the timestamps that are out of range

  2. 忽略超出范围的时间戳的精度损失

table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)

But the original timestamp that was 5202-04-02 becomes 1694-12-04

但原始的时间戳5202-04-02变成了1694-12-04

  1. If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas

  2. 如果你有冒险精神,可以使用pandas 2.0和pyarrow作为pandas的后端

pip install pandas==2.0.0rc1
pd.read_parquet("data.parquet", dtype_backend="pyarrow")
  1. Fix the data using pyarrow

  2. 使用pyarrow修复数据

Surely 5202-04-02 is a typo. See this question

肯定是5202-04-02写错了。查看这个问题

英文:

> I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?

At the moment, pandas only support nanosecond timestamp.

If you insist on keeping us precision you have a few options:

  1. not use pandas, stick to pyarrow which supports microseconds:
table = pq.read_table("data.parquet")
len(table)
  1. Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)
table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)
  1. Ignore the loss of precision for the timestamps that are out of range
table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)

But the original timestamp that was 5202-04-02 becomes 1694-12-04

  1. If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas
pip install  pandas==2.0.0rc1
pd.read_parquet("data.parquet", dtype_backend="pyarrow")
  1. Fix the data using pyarrow

Surely 5202-04-02 is a typo. See this question

huangapple
  • 本文由 发表于 2023年3月31日 19:14:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897897.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定