2023年3月31日 19:14:50go评论71阅读模式

英文:

pyarrow timestamp datatype error on parquet file

问题

我在使用pyarrow读取和计算pandas中的记录时遇到了这个错误。我不希望pyarrow将时间戳转换为timestamp[ns]，而是保持在timestamp[us]，是否有选项可以保持时间戳不变？我正在使用pyarrow 11.0.0和Python 3.10，请提供建议。

代码部分：

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd

# 将Parquet文件读取为PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()

print(len(table))

错误信息：

pyarrow.lib.ArrowInvalid: 从`timestamp[us]`转换为`timestamp[ns]`会导致时间戳超出范围：101999952000000000

不要翻译代码部分，只翻译文本内容。

英文:

I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please advise

code:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd

# Read the Parquet file into a PyArrow Table
table = pq.read_table(&#39;/Users/abc/Downloads/LOAD.parquet&#39;).to_pandas()

print(len(table))

error

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

答案1

得分: 1

I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?

我不希望pyarrow将其转换为timestamp[ns]，可以保留为timestamp[us]吗？是否有保持时间戳不变的选项？

At the moment, pandas only support nanosecond timestamp.

目前，pandas仅支持纳秒级的时间戳。

If you insist on keeping us precision you have a few options:

如果您坚持要保持微秒精度，您有几种选择：

not use pandas, stick to pyarrow which supports microseconds:
不使用pandas，继续使用支持微秒的pyarrow：

table = pq.read_table("data.parquet")
len(table)

Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)
在您的数据框中使用datetime.datetime代替pd.Timestamp（速度非常慢）

table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)

Ignore the loss of precision for the timestamps that are out of range
忽略超出范围的时间戳的精度损失

table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)

But the original timestamp that was 5202-04-02 becomes 1694-12-04

但原始的时间戳5202-04-02变成了1694-12-04

If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas
如果你有冒险精神，可以使用pandas 2.0和pyarrow作为pandas的后端

pip install pandas==2.0.0rc1

pd.read_parquet("data.parquet", dtype_backend="pyarrow")

Fix the data using pyarrow
使用pyarrow修复数据

Surely 5202-04-02 is a typo. See this question

肯定是5202-04-02写错了。查看这个问题

英文:

> I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?

At the moment, pandas only support nanosecond timestamp.

If you insist on keeping us precision you have a few options:

not use pandas, stick to pyarrow which supports microseconds:

table = pq.read_table(&quot;data.parquet&quot;)
len(table)

Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)

table = pq.read_table(&quot;data.parquet&quot;)
df = table.to_pandas(timestamp_as_object=True)

Ignore the loss of precision for the timestamps that are out of range

table = pq.read_table(&quot;data.parquet&quot;)
df = table.to_pandas(safe=False)

But the original timestamp that was 5202-04-02 becomes 1694-12-04

If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas

pip install  pandas==2.0.0rc1

pd.read_parquet(&quot;data.parquet&quot;, dtype_backend=&quot;pyarrow&quot;)

Fix the data using pyarrow

Surely 5202-04-02 is a typo. See this question

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

`pyarrow` 在 Parquet 文件上的时间戳数据类型错误。

问题

答案1

Javascript位运算在Python中产生不同的结果。

使用内置的切片函数来切片一个二维数组。

在Python中使用Selenium点击按钮时出错。

在Python中跨继承使用类变量

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论