在Python Polars中过滤带有时区信息的日期时间时的偏移量。

huangapple go评论93阅读模式
英文:

Offset when filtering timezone aware datetimes in python polars

问题

我有一个包含时区感知日期时间的Dataframe,列名为"datetime"。原始时区是UTC。

from zoneinfo import ZoneInfo
(
    pl.DataFrame(
        {
            "datetime": [
                "[01/Aug/2023:00:00:02 +0200]",
                "[01/Aug/2023:02:00:02 +0200]",
                "[03/Aug/2023:01:00:02 +0200]",
            ]
        }
    )
    .with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
    .with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
    .filter(
        pl.col("datetime")
        .cast(pl.Date)
        .is_between(
            datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
            datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
        )
    )
)

转换为CEST("Europe/Berlin")是有效的。然而,当我过滤datetime时,存在2小时的偏移。

datetime
datetime[μs, Europe/Berlin]
2023-08-01 02:00:02 CEST
2023-08-03 01:00:02 CEST

原始数据集的第一行不在列表中,但应该在列表中。第三行在列表中,但不应该在列表中。

这看起来像是UTC和CEST之间的差异。如果Python的datetime对象是无时区的(例如,只是datetime(2023,8,1)),结果是相同的。

在过滤时,如何让polars考虑时区?

英文:

I have a Dataframe with timezone aware datetimes in the column "datetime". The original timezone is UTC.

from zoneinfo import ZoneInfo
(
    pl.DataFrame(
        {
            "datetime": [
                "[01/Aug/2023:00:00:02 +0200]",
                "[01/Aug/2023:02:00:02 +0200]",
                "[03/Aug/2023:01:00:02 +0200]",
            ]
        }
    )
    .with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
    .with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
    .filter(
        pl.col("datetime")
        .cast(pl.Date)
        .is_between(
            datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
            datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
        )
    )
)

The conversion into CEST ("Europe/Berlin") works. However, when I filter for datetime there is a 2 hour offset.

datetime
datetime[μs, Europe/Berlin]
2023-08-01 02:00:02 CEST
2023-08-03 01:00:02 CEST

The first row from the original dataset is not in the list but it should. The third row is in the list but it should not.

This looks like the difference between UTC and CEST. The result is the same if the python datetime object is naive (e. g. just datetime(2023,8,1)).

How do I get polars to take the timezone into account when filtering.

答案1

得分: 1

由于您正确地将日期时间本地化到时区,因此您的过滤器应该反映这一点,即也使用带有时区信息的日期时间,而不是将其转换回无时区的日期时间(使用.cast(pl.Date)):

from zoneinfo import ZoneInfo
import polars as pl

df = (
    pl.DataFrame(
        {
            "datetime": [
                "[01/Aug/2023:00:00:02 +0200]",
                "[01/Aug/2023:02:00:02 +0200]",
                "[03/Aug/2023:01:00:02 +0200]",
            ]
        }
    )
    .with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
    .with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
)

print(
    df.filter(
        pl.col("datetime")
        .is_between(
            datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
            datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
        )
    )
)
┌─────────────────────────────┐
 datetime                    
 ---                         
 datetime[μs, Europe/Berlin] 
╞═════════════════════════════╡
 2023-08-01 00:00:02 CEST    
 2023-08-01 02:00:02 CEST    
└─────────────────────────────┘

为了观察转换回pl.Date(或pl.Datetime,以更好地说明)的效果,例如运行以下代码:

df = (
    pl.DataFrame(
        {
            "datetime": [
                "[01/Aug/2023:00:00:02 +0200]",
                "[01/Aug/2023:02:00:02 +0200]",
                "[03/Aug/2023:01:00:02 +0200]",
            ]
        }
    )
    .with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
    .with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
    .with_columns(pl.col("datetime").cast(pl.Datetime))
)

print(df["datetime"])
[
    2023-07-31 22:00:02
    2023-08-01 00:00:02
    2023-08-02 23:00:02
]

Polars中的无时区日期时间类似于UTC。

英文:

Since you correctly localize your datetimes to a time zone, your filter should reflect that, i.e. also use the aware datetime, without casting back to a naive date with .cast(pl.Date):

from zoneinfo import ZoneInfo
import polars as pl

df = (
    pl.DataFrame(
        {
            "datetime": [
                "[01/Aug/2023:00:00:02 +0200]",
                "[01/Aug/2023:02:00:02 +0200]",
                "[03/Aug/2023:01:00:02 +0200]",
            ]
        }
    )
    .with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
    .with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
)

print(
    df.filter(
        pl.col("datetime")
        .is_between(
            datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
            datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
        )
    )
)
┌─────────────────────────────┐
 datetime                    
 ---                         
 datetime[μs, Europe/Berlin] 
╞═════════════════════════════╡
 2023-08-01 00:00:02 CEST    
 2023-08-01 02:00:02 CEST    
└─────────────────────────────┘

To observe the effect of casting back to pl.Date (or pl.Datetime, for better illustration), run for example

df = (
    pl.DataFrame(
        {
            "datetime": [
                "[01/Aug/2023:00:00:02 +0200]",
                "[01/Aug/2023:02:00:02 +0200]",
                "[03/Aug/2023:01:00:02 +0200]",
            ]
        }
    )
    .with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
    .with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
    .with_columns(pl.col("datetime").cast(pl.Datetime))
)

print(df["datetime"])
[
	2023-07-31 22:00:02
	2023-08-01 00:00:02
	2023-08-02 23:00:02
]

Naive datetime in polars resembles UTC.

huangapple
  • 本文由 发表于 2023年8月9日 14:52:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76865272.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定