使用PyArrow处理混合时区数据计算日期特征

huangapple go评论113阅读模式
英文:

Computing date features using PyArrow on mixed timezone data

问题

I have timeseries data stored as (series_id,timestamp,value) in postgres. The timestamp is stored in UTC and there's a separate metadata table containing (series_id,timezone).

I need to compute date features (i.e. weekday/weekend/holiday etc) that require the timestamp to be converted into local time. I'm investigating options to scale this, one of which is pyarrow. But I'm having issues with the timezone, for example if I have three timestamps in different timezones

import pandas as pd
df = pd.DataFrame.from_dict({
    "timestamp":[
        pd.Timestamp(year=2023, month=3, day=7, tz='Australia/Brisbane'),
        pd.Timestamp(year=2023, month=3, day=7, tz='Australia/Sydney'),
        pd.Timestamp(year=2023, month=3, day=7, tz='Australia/Adelaide'),
    ]
})

Pandas stores this as dtype('O') (i.e. object) and each Timestamp retains its associated timezone.

But Arrow appears to represent a timestamp column as having a consistent timezone (or none).

table = pa.Table.from_pandas(df)

pyarrow.Table timestamp: timestamp[us, tz=Australia/Brisbane]
---- timestamp: [[2023-03-06 14:00:00.000000,2023-03-06 13:00:00.000000,2023-03-06 13:30:00.000000]]

It's storing everything as UTC but assigning the datatype based on the timezone from the first entry (i.e. Brisbane). The resulting computations are then incorrect as they're all done relative to Australia/Brisbane

pa.compute.day(table['timestamp'])

<pyarrow.lib.ChunkedArray object at 0x7f381a04e070> [ [
7,
6,
6 ] ]

The correct results should be [7,7,7]

I cannot find much on best practices for performing computation on arrow timestamps - I understand best practice for storage is usually to convert to UTC and store the tzinfo, but even if I have a timestamp and tzinfo array, something like pa.compute.day(table['timestamp_utc'], table['tzinfo']) doesn't seem to exist.

My pandas solution simply processes the data series by series in a loop. All timestamps for a single series have the same tzinfo, so I just convert to localtime. Do I have to do the same with Arrow, or is there another approach?

英文:

I have timeseries data stored as (series_id,timestamp,value) in postgres. The timestamp is stored in UTC and there's a separate metadata table containing (series_id,timezone).

I need to compute date features (i.e. weekday/weekend/holiday etc) that require the timestamp to be converted into local time. I'm investigating options to scale this, one of which is pyarrow. But I'm having issues with the timezone, for example if I have three timestamps in different timezones

import pandas as pd
df = pd.DataFrame.from_dict({
    &quot;timestamp&quot;:[
        pd.Timestamp(year=2023, month=3, day=7, tz=&#39;Australia/Brisbane&#39;),
        pd.Timestamp(year=2023, month=3, day=7, tz=&#39;Australia/Sydney&#39;),
        pd.Timestamp(year=2023, month=3, day=7, tz=&#39;Australia/Adelaide&#39;),
    ]
})

Pandas stores this as dtype(&#39;O&#39;) (i.e. object) and each Timestamp retains it's associated timezone.

But Arrow appears to represent a timestamp column as having a consistent timezone (or none).

table = pa.Table.from_pandas(df)

> pyarrow.Table timestamp: timestamp[us, tz=Australia/Brisbane]
> ---- timestamp: [[2023-03-06 14:00:00.000000,2023-03-06 13:00:00.000000,2023-03-06 13:30:00.000000]]

It's storing everything as UTC but assigning the datatype based on the timezone from the first entry (i.e. Brisbane). The resulting computations are then incorrect as they're all done relative to Australia/Brisbane

pa.compute.day(table[&#39;timestamp&#39;])

> <pyarrow.lib.ChunkedArray object at 0x7f381a04e070> [ [
> 7,
> 6,
> 6 ] ]

The correct results should be [7,7,7]

I cannot find much on best practices for performing computation on arrow timestamps - I understand best practice for storage is usually to convert to UTC and store the tzinfo but I even if I have a timestamp and tzinfo array something like pa.compute.day(table[&#39;timestamp_utc&#39;], table[&#39;tzinfo&#39;]) doesn't seem to exist

My pandas solution simply processes the data series by series in a loop. All timestamps for a single series have the same tzinfo so I just convert to localtime. Do I have to do the same with arrow or is there another approach?

答案1

得分: 2

我认为你是对的,Arrow 目前不能在这里提供太多帮助。Arrow 在“列”级别定义时区(例如数组的数据类型),而不是在“单元格”级别。因此,数组中的所有值必须具有相同的时间戳。

你可以在导入数据时将所有值转换为UTC,但这将导致时区信息丢失。因此,如果你只想对“瞬间”进行排序和比较,那就没问题。

但是,如果你想要“特征”(例如,周六/周日发生了多少事件),那么你将需要每个事件的时区信息。在我看来,Arrow 中存储这种信息的“正确”方式(据我所知,官方没有明确立场)是具有一个“瞬间”(UTC时间戳)列和一个时区字符串列。

不幸的是,要处理这些数据,你需要一个类似于get_day_of_week(instants, time_zones)的函数,而Arrow今天并没有这个函数(唯一存在的函数是get_day_of_week(instants),它使用列的时区)。

编辑:明确一下,我的回答是针对pyarrow(完整免责声明,我在pyarrow上工作)。还有其他Arrow计算资源(例如datafusion、duckdb等),它们可能提供不同的功能(尽管没有“带有各自时区的时间戳”的Arrow数组类型,因此支持这一点的任何引擎都至少需要两列)。

英文:

I think you're right, that Arrow is not currently able to help much here. Arrow defines timezone at the "column" level (e.g. the data type of the array) and not at the "cell" level. So all values in an array must have the same timestamp.

You could convert all values to UTC as you import them but this would cause the time zone information to be lost. So, if all you wanted to do was sort and compare "instants" then you'd be fine.

However, if you want "features" (e.g. how many events occurred on a Saturday/Sunday) then you will need the time zone for each event. The "correct" (in my opinion, there is no official stance on this as far as I know) way to store this in Arrow would be to have one column of "instants" (UTC timestamps) and one column of time zone strings.

Unfortunately, in order to process this data, you would need a function like get_day_of_week(instants, time_zones) which does not exist in Arrow today (the only function that exists is get_day_of_week(instants) and it uses the time zone of the column).

EDIT: To be clear, my answer was in regards to pyarrow (full disclaimer, I work on pyarrow). There are other Arrow compute resources (e.g. datafusion, duckdb, etc.) and they may offer different capabilities (though there is no Arrow array type for "timestamps with individual time zones" and so any engine that supports this will need at least two columns)

huangapple
  • 本文由 发表于 2023年3月7日 07:07:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75656639.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定