英文:
Pyarrow schema with Timestamp unit 's' when written to Parquet changed to 'ms' upon reloaded
问题
如下所示,“dob”字段在使用pq.write_metadata
写入Parquet格式时的类型为timestamp(
,但在重新读取元数据时,类型变为了timestamp[ms]
。
这是因为Parquet格式不支持秒为单位的时间戳吗?
在这种情况下,我如何使模式完全相同?
英文:
As seen below, the "dob" field was of type timestamp(
when written to Parquet format with pq.write_metadata
. But upon rereading the metadata, the type changed to timestamp[ms]
Python 3.11.1 (main, Jan 26 2023, 10:38:20) [GCC 8.5.0 20210514 (Red Hat 8.5.0-18)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa, pyarrow.parquet as pq
>>> schema = pa.schema([ pa.field("dob", pa.timestamp('s')) ])
>>> schema
dob: timestamp展开收缩
>>> pq.write_metadata(schema, '_common_schema')
>>> reloaded_schema = pq.read_schema('_common_schema')
>>> reloaded_schema
dob: timestamp[ms]
>>>
Is this because Parquet format does not support Timestamp of unit second?
How can I make the schema exactly the same in this case?
答案1
得分: 0
你观察到的行为很可能是由于Pyarrow中的默认Timestamp单位为微秒(us
),而Parquet中的默认Timestamp单位为毫秒(ms
)导致的。当你将一个Pyarrow模式(schema)写入Parquet文件时,带有秒
单位的Timestamp字段会在存储时自动转换为毫秒
单位。当重新加载文件时,将使用存储的毫秒
单位,因此重新加载的模式将显示Timestamp字段的毫秒
单位。为了避免这种行为,你可以在写入Parquet时在Pyarrow中指定Timestamp单位为ms
,然后在读取文件时确保使用相同的单位。
Parquet格式不支持秒
(s
)单位的Timestamp。相反,Parquet中Timestamp的默认单位是毫秒
(ms
)。这意味着当将一个Pyarrow模式中带有秒
单位的Timestamp字段写入Parquet文件时,它会在存储时自动转换为毫秒
单位的Timestamp字段。当重新加载文件时,将使用存储的毫秒
单位,因此重新加载的模式将显示Timestamp字段的毫秒
单位。
你可以使用以下代码:
import pyarrow as pa
import pyarrow.parquet as pq
# 指定Timestamp单位为毫秒
schema = pa.schema([ pa.field("dob", pa.timestamp('ms')) ])
# 将模式写入Parquet元数据文件
pq.write_metadata(schema, '_common_schema')
# 从元数据文件中读取模式
reloaded_schema = pq.read_schema('_common_schema')
# 重新加载的模式现在应该显示Timestamp字段的毫秒单位
print(reloaded_schema)
这将产生预期的行为,Timestamp字段在写入Parquet文件和从文件重新加载时都会正确表示为毫秒单位。
还有一些其他数据类型在Arrow和Parquet中可以以不同方式表示。以下是一些需要注意的内容:
-
Decimal: Arrow和Parquet中Decimal字段的精度和标度可以不同。从Arrow转换到Parquet时,Decimal类型会四舍五入到具有相同标度的可表示的最接近的Decimal类型。从Parquet转换到Arrow时,Decimal类型会向上四舍五入到具有相同精度的可表示的最接近的Decimal类型。
-
Timestamp: 如前所述,Arrow中Timestamp的默认单位为微秒(us),而Parquet中Timestamp的默认单位为毫秒(ms)。在两种格式之间转换时,应确保指定正确的单位。
-
Time: Arrow中Time的默认单位是微秒(us),而Parquet中Time的默认单位是毫秒(ms)。在两种格式之间转换时,应确保指定正确的单位。
-
嵌套结构: Arrow支持嵌套结构,如数组和结构,而Parquet只支持扁平结构。从Arrow转换到Parquet时,任何嵌套结构必须被展平。从Parquet转换到Arrow时,必须重新构建扁平结构以形成嵌套结构。
这些是在Arrow和Parquet数据格式之间转换时需要注意的一些主要差异。确保数据在两种格式中正确表示以避免意外行为和数据丢失非常重要。
英文:
The behavior you're observing is likely due to the fact that the default Timestamp unit in Pyarrow is microseconds (us
), whereas the default Timestamp unit in Parquet is milliseconds (ms
). When you write a Pyarrow schema with a Timestamp unit of s
to a Parquet file, it gets converted to ms
upon storage. When you reload the file, the stored ms
unit is used, so the schema gets reloaded as ms
. To avoid this behavior, you can specify the Timestamp unit in Pyarrow as ms
when writing to Parquet and then ensure that the same unit is used when reading the file back.
The Parquet format does not support Timestamp of unit second (s
). Instead, the default unit for Timestamp in Parquet is milliseconds (ms
). This means that when a Pyarrow schema with a Timestamp field of unit second is written to a Parquet file, it is automatically converted to a Timestamp field of unit milliseconds upon storage. When the file is reloaded, the stored Timestamp unit of milliseconds is used, so the reloaded schema will show the Timestamp field as having a unit of milliseconds.
You can use:
import pyarrow as pa
import pyarrow.parquet as pq
# Specify the Timestamp unit as milliseconds
schema = pa.schema([ pa.field("dob", pa.timestamp('ms')) ])
# Write the schema to a Parquet metadata file
pq.write_metadata(schema, '_common_schema')
# Read the schema back from the metadata file
reloaded_schema = pq.read_schema('_common_schema')
# The reloaded schema should now show the Timestamp field as having a unit of milliseconds
print(reloaded_schema)
This should result in the expected behavior, where the Timestamp field is correctly represented as having a unit of milliseconds, both when written to the Parquet file and when reloaded from the file.
there are a few other data types that can be represented differently in Arrow and Parquet. Here are some to be aware of:
Decimal: The precision and scale of Decimal fields in Arrow and Parquet can be different. When converting from Arrow to Parquet, the decimal type is rounded to the nearest representable decimal with the same scale. When converting from Parquet to Arrow, the decimal type is rounded up to the nearest representable decimal with the same precision.
Timestamp: As we have seen, the default unit for Timestamps in Arrow is microseconds (us), whereas the default unit for Timestamps in Parquet is milliseconds (ms). You should ensure that the correct unit is specified when converting between the two formats.
Time: The default unit for Time in Arrow is microseconds (us), whereas the default unit for Time in Parquet is milliseconds (ms). You should ensure that the correct unit is specified when converting between the two formats.
Nested structures: Arrow supports nested structures, such as arrays and structs, whereas Parquet only supports flat structures. When converting from Arrow to Parquet, any nested structures must be flattened. When converting from Parquet to Arrow, the flat structure must be reconstructed into nested structures.
These are some of the main differences to be aware of when converting between Arrow and Parquet data formats. It's important to ensure that the data is correctly represented in both formats to avoid unexpected behavior and data loss
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论