英文:
pyarrow timestamp datatype error on parquet file
问题
我在使用pyarrow读取和计算pandas中的记录时遇到了这个错误。我不希望pyarrow将时间戳转换为timestamp[ns]
,而是保持在timestamp[us]
,是否有选项可以保持时间戳不变?我正在使用pyarrow 11.0.0和Python 3.10,请提供建议。
代码部分:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd
# 将Parquet文件读取为PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()
print(len(table))
错误信息:
pyarrow.lib.ArrowInvalid: 从`timestamp[us]`转换为`timestamp[ns]`会导致时间戳超出范围:101999952000000000
不要翻译代码部分,只翻译文本内容。
英文:
I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please advise
code:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd
# Read the Parquet file into a PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()
print(len(table))
error
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000
答案1
得分: 1
I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?
我不希望pyarrow将其转换为timestamp[ns],可以保留为timestamp[us]吗?是否有保持时间戳不变的选项?
At the moment, pandas
only support nanosecond timestamp.
目前,pandas
仅支持纳秒级的时间戳。
If you insist on keeping us precision you have a few options:
如果您坚持要保持微秒精度,您有几种选择:
-
not use pandas, stick to pyarrow which supports microseconds:
-
不使用pandas,继续使用支持微秒的pyarrow:
table = pq.read_table("data.parquet")
len(table)
-
Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)
-
在您的数据框中使用datetime.datetime代替pd.Timestamp(速度非常慢)
table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)
-
Ignore the loss of precision for the timestamps that are out of range
-
忽略超出范围的时间戳的精度损失
table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)
But the original timestamp that was 5202-04-02
becomes 1694-12-04
但原始的时间戳5202-04-02
变成了1694-12-04
-
If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas
-
如果你有冒险精神,可以使用pandas 2.0和pyarrow作为pandas的后端
pip install pandas==2.0.0rc1
pd.read_parquet("data.parquet", dtype_backend="pyarrow")
-
Fix the data using pyarrow
-
使用pyarrow修复数据
Surely 5202-04-02 is a typo. See this question
肯定是5202-04-02写错了。查看这个问题
英文:
> I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?
At the moment, pandas
only support nanosecond timestamp.
If you insist on keeping us precision you have a few options:
- not use pandas, stick to pyarrow which supports microseconds:
table = pq.read_table("data.parquet")
len(table)
- Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)
table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)
- Ignore the loss of precision for the timestamps that are out of range
table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)
But the original timestamp that was 5202-04-02
becomes 1694-12-04
- If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas
pip install pandas==2.0.0rc1
pd.read_parquet("data.parquet", dtype_backend="pyarrow")
- Fix the data using pyarrow
Surely 5202-04-02 is a typo. See this question
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论