英文:
Offset when filtering timezone aware datetimes in python polars
问题
我有一个包含时区感知日期时间的Dataframe,列名为"datetime"。原始时区是UTC。
from zoneinfo import ZoneInfo
(
pl.DataFrame(
{
"datetime": [
"[01/Aug/2023:00:00:02 +0200]",
"[01/Aug/2023:02:00:02 +0200]",
"[03/Aug/2023:01:00:02 +0200]",
]
}
)
.with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
.with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
.filter(
pl.col("datetime")
.cast(pl.Date)
.is_between(
datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
)
)
)
转换为CEST("Europe/Berlin")是有效的。然而,当我过滤datetime时,存在2小时的偏移。
datetime
datetime[μs, Europe/Berlin]
2023-08-01 02:00:02 CEST
2023-08-03 01:00:02 CEST
原始数据集的第一行不在列表中,但应该在列表中。第三行在列表中,但不应该在列表中。
这看起来像是UTC和CEST之间的差异。如果Python的datetime对象是无时区的(例如,只是datetime(2023,8,1)),结果是相同的。
在过滤时,如何让polars考虑时区?
英文:
I have a Dataframe with timezone aware datetimes in the column "datetime". The original timezone is UTC.
from zoneinfo import ZoneInfo
(
pl.DataFrame(
{
"datetime": [
"[01/Aug/2023:00:00:02 +0200]",
"[01/Aug/2023:02:00:02 +0200]",
"[03/Aug/2023:01:00:02 +0200]",
]
}
)
.with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
.with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
.filter(
pl.col("datetime")
.cast(pl.Date)
.is_between(
datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
)
)
)
The conversion into CEST ("Europe/Berlin") works. However, when I filter for datetime there is a 2 hour offset.
datetime
datetime[μs, Europe/Berlin]
2023-08-01 02:00:02 CEST
2023-08-03 01:00:02 CEST
The first row from the original dataset is not in the list but it should. The third row is in the list but it should not.
This looks like the difference between UTC and CEST. The result is the same if the python datetime object is naive (e. g. just datetime(2023,8,1)).
How do I get polars to take the timezone into account when filtering.
答案1
得分: 1
由于您正确地将日期时间本地化到时区,因此您的过滤器应该反映这一点,即也使用带有时区信息的日期时间,而不是将其转换回无时区的日期时间(使用.cast(pl.Date)
):
from zoneinfo import ZoneInfo
import polars as pl
df = (
pl.DataFrame(
{
"datetime": [
"[01/Aug/2023:00:00:02 +0200]",
"[01/Aug/2023:02:00:02 +0200]",
"[03/Aug/2023:01:00:02 +0200]",
]
}
)
.with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
.with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
)
print(
df.filter(
pl.col("datetime")
.is_between(
datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
)
)
)
┌─────────────────────────────┐
│ datetime │
│ --- │
│ datetime[μs, Europe/Berlin] │
╞═════════════════════════════╡
│ 2023-08-01 00:00:02 CEST │
│ 2023-08-01 02:00:02 CEST │
└─────────────────────────────┘
为了观察转换回pl.Date(或pl.Datetime,以更好地说明)的效果,例如运行以下代码:
df = (
pl.DataFrame(
{
"datetime": [
"[01/Aug/2023:00:00:02 +0200]",
"[01/Aug/2023:02:00:02 +0200]",
"[03/Aug/2023:01:00:02 +0200]",
]
}
)
.with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
.with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
.with_columns(pl.col("datetime").cast(pl.Datetime))
)
print(df["datetime"])
[
2023-07-31 22:00:02
2023-08-01 00:00:02
2023-08-02 23:00:02
]
Polars中的无时区日期时间类似于UTC。
英文:
Since you correctly localize your datetimes to a time zone, your filter should reflect that, i.e. also use the aware datetime, without casting back to a naive date with .cast(pl.Date)
:
from zoneinfo import ZoneInfo
import polars as pl
df = (
pl.DataFrame(
{
"datetime": [
"[01/Aug/2023:00:00:02 +0200]",
"[01/Aug/2023:02:00:02 +0200]",
"[03/Aug/2023:01:00:02 +0200]",
]
}
)
.with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
.with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
)
print(
df.filter(
pl.col("datetime")
.is_between(
datetime(2023, 8, 1, tzinfo=ZoneInfo("Europe/Berlin")),
datetime(2023, 8, 3, tzinfo=ZoneInfo("Europe/Berlin")),
)
)
)
┌─────────────────────────────┐
│ datetime │
│ --- │
│ datetime[μs, Europe/Berlin] │
╞═════════════════════════════╡
│ 2023-08-01 00:00:02 CEST │
│ 2023-08-01 02:00:02 CEST │
└─────────────────────────────┘
To observe the effect of casting back to pl.Date (or pl.Datetime, for better illustration), run for example
df = (
pl.DataFrame(
{
"datetime": [
"[01/Aug/2023:00:00:02 +0200]",
"[01/Aug/2023:02:00:02 +0200]",
"[03/Aug/2023:01:00:02 +0200]",
]
}
)
.with_columns(pl.col("datetime").str.to_datetime("[%d/%b/%Y:%H:%M:%S %z]"))
.with_columns(pl.col("datetime").dt.convert_time_zone("Europe/Berlin"))
.with_columns(pl.col("datetime").cast(pl.Datetime))
)
print(df["datetime"])
[
2023-07-31 22:00:02
2023-08-01 00:00:02
2023-08-02 23:00:02
]
Naive datetime in polars resembles UTC.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论