英文:
Polars replace_time_zone function throws error of "no such local time"
问题
这是我们要处理的测试数据:
import polars as pl
import pandas as pd
from datetime import date, time, datetime
df = pl.DataFrame(
pl.date_range(
low=date(2022, 1, 3),
high=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
我需要replace_time_zone
实际更改底层时间戳,但相同的时区适用于 convert_time_zone
,并且使用 replace_time_zone
会失败。
df.select(
pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)
输出如下:
形状:(77761, 1)
┌────────────────────────────────┐
│ US │
│ --- │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-02 19:00:00 EST │
│ 2022-01-02 19:05:00 EST │
│ 2022-01-02 19:10:00 EST │
│ 2022-01-02 19:15:00 EST │
│ … │
│ 2022-09-29 19:45:00 EDT │
│ 2022-09-29 19:50:00 EDT │
│ 2022-09-29 19:55:00 EDT │
│ 2022-09-29 20:00:00 EDT │
```python
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
错误输出如下:
线程'<unnamed>' 在 /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/chrono-0.4.23/src/offset/mod.rs:186:34 处崩溃,报告'没有这样的本地时间'。
英文:
here's our test data to work with:
import polars as pl
import pandas as pd
from datetime import date, time, datetime
df = pl.DataFrame(
pl.date_range(
low=date(2022, 1, 3),
high=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
I specifically need replace_time_zone
to actually change the underlying timestamp but the same timezone works with convert_time_zone
, and faild with replace_time_zone
.
df.select(
pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)
# output
shape: (77761, 1)
┌────────────────────────────────┐
│ US │
│ --- │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-02 19:00:00 EST │
│ 2022-01-02 19:05:00 EST │
│ 2022-01-02 19:10:00 EST │
│ 2022-01-02 19:15:00 EST │
│ … │
│ 2022-09-29 19:45:00 EDT │
│ 2022-09-29 19:50:00 EDT │
│ 2022-09-29 19:55:00 EDT │
│ 2022-09-29 20:00:00 EDT │
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
# error output
thread '<unnamed>' panicked at 'No such local time', /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/chrono-0.4.23/src/offset/mod.rs:186:34
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Cell In[78], line 1
----> 1 df.select(
2 pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
3 )
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/polars/dataframe/frame.py:6432, in DataFrame.select(self, exprs, *more_exprs, **named_exprs)
6324 def select(
6325 self,
6326 exprs: IntoExpr | Iterable[IntoExpr] | None = None,
6327 *more_exprs: IntoExpr,
6328 **named_exprs: IntoExpr,
6329 ) -> Self:
6330 """
6331 Select columns from this DataFrame.
6332
(...)
6429
6430 """
6431 return self._from_pydf(
-> 6432 self.lazy()
6433 .select(exprs, *more_exprs, **named_exprs)
6434 .collect(no_optimization=True)
6435 ._df
6436 )
File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/polars/lazyframe/frame.py:1443, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
1432 common_subplan_elimination = False
1434 ldf = self._ldf.optimization_toggle(
1435 type_coercion,
1436 predicate_pushdown,
(...)
1441 streaming,
1442 )
-> 1443 return pli.wrap_df(ldf.collect())
PanicException: No such local time
答案1
得分: 2
不可以用带有夏令时转换的时区替换UTC时间序列中的时区,否则会出现不存在和/或缺失的日期时间。错误可能需要更多的信息,但我认为这与polars无关。
以下是一个示例。"America/New_York" 在2022年3月13日进行了夏令时转换。在那一天,"2点"并不存在...所以这段代码可以正常工作:
import polars as pl
from datetime import date
df = pl.DataFrame(
pl.date_range(
low=date(2022, 3, 11),
high=date(2022, 3, 13),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
print(
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
)
# shape: (289, 1)
# ┌────────────────────────────────┐
# │ US │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-03-11 00:00:00 EST │
# │ 2022-03-11 00:05:00 EST │
# │ 2022-03-11 00:10:00 EST │
# │ 2022-03-11 00:15:00 EST │
# │ … │
但这段代码不会工作:
df = pl.DataFrame(
pl.date_range(
low=date(2022, 3, 13),
high=date(2022, 3, 15),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
print(
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
)
# PanicException: No such local time
解决方法 可以将UTC转换为所需的时区,然后添加其UTC偏移。例如:
df = pl.DataFrame(
pl.date_range(
low=date(2022, 1, 3),
high=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
df = df.with_columns(
pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)
df = df.with_columns(
(pl.col("US")+(pl.col("UTC")-pl.col("US").dt.replace_time_zone(time_zone="UTC")))
.alias("US_fakeUTC")
)
print(df.select(pl.col("US_fakeUTC")))
# shape: (77761, 1)
# ┌────────────────────────────────┐
# │ US_fakeUTC │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-01-03 00:00:00 EST │
# │ 2022-01-03 00:05:00 EST │
# │ 2022-01-03 00:10:00 EST │
# │ 2022-01-03 00:15:00 EST │
# │ … │
英文:
You cannot replace the timezone in a UTC time series with a timezone that has DST transitions - you'll end up with non-existing and/or missing datetimes. The error could be a bit more informative, but I do not think this is specific to polars.
Here's an illustration. "America/New_York" had a DST transition on Mar 13. 2 am
did not exist on that day... so this works fine:
import polars as pl
from datetime import date
df = pl.DataFrame(
pl.date_range(
low=date(2022, 3, 11),
high=date(2022, 3, 13),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
print(
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
)
# shape: (289, 1)
# ┌────────────────────────────────┐
# │ US │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-03-11 00:00:00 EST │
# │ 2022-03-11 00:05:00 EST │
# │ 2022-03-11 00:10:00 EST │
# │ 2022-03-11 00:15:00 EST │
# │ … │
while this doesn't:
df = pl.DataFrame(
pl.date_range(
low=date(2022, 3, 13),
high=date(2022, 3, 15),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
print(
df.select(
pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)
)
# PanicException: No such local time
Workaround you could use is to convert UTC to the desired timezone, then add its UTC offset. Ex:
df = pl.DataFrame(
pl.date_range(
low=date(2022, 1, 3),
high=date(2022, 9, 30),
interval="5m",
time_unit="ns",
time_zone="UTC",
).alias("UTC")
)
df = df.with_columns(
pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)
df = df.with_columns(
(pl.col("US")+(pl.col("UTC")-pl.col("US").dt.replace_time_zone(time_zone="UTC")))
.alias("US_fakeUTC")
)
print(df.select(pl.col("US_fakeUTC")))
# shape: (77761, 1)
# ┌────────────────────────────────┐
# │ US_fakeUTC │
# │ --- │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-01-03 00:00:00 EST │
# │ 2022-01-03 00:05:00 EST │
# │ 2022-01-03 00:10:00 EST │
# │ 2022-01-03 00:15:00 EST │
# │ … │
答案2
得分: 1
你需要直接将时区传递给 date_range
:
In [4]: import polars as pl
...: import pandas as pd
...: from datetime import date, time, datetime
...:
...: df = pl.DataFrame(
...: pl.date_range(
...: low=date(2022, 1, 3),
...: high=date(2022, 9, 30),
...: interval="5m",
...: time_unit="ns",
...: time_zone="America/New_York",
...: ).alias("America/New_York")
...: )
In [5]: df
Out[5]:
shape: (77749, 1)
┌────────────────────────────────┐
│ America/New_York │
│ --- │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-03 00:00:00 EST │
│ 2022-01-03 00:05:00 EST │
│ 2022-01-03 00:10:00 EST │
│ 2022-01-03 00:15:00 EST │
│ … │
│ 2022-09-29 23:45:00 EDT │
│ 2022-09-29 23:50:00 EDT │
│ 2022-09-29 23:55:00 EDT │
│ 2022-09-30 00:00:00 EDT │
└────────────────────────────────┘
然后它将正常工作,因为 polars
只需从起始时间开始并持续添加5分钟,这总是明确定义的。如果您首先创建一个UTC日期范围,然后替换时区,那么您可能会得到模糊或不存在的日期时间(由于夏令时的影响)。
英文:
You need to pass the time zone to date_range
directly:
In [4]: import polars as pl
...: import pandas as pd
...: from datetime import date, time, datetime
...:
...: df = pl.DataFrame(
...: pl.date_range(
...: low=date(2022, 1, 3),
...: high=date(2022, 9, 30),
...: interval="5m",
...: time_unit="ns",
...: time_zone="America/New_York",
...: ).alias("America/New_York")
...: )
In [5]: df
Out[5]:
shape: (77749, 1)
┌────────────────────────────────┐
│ America/New_York │
│ --- │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-03 00:00:00 EST │
│ 2022-01-03 00:05:00 EST │
│ 2022-01-03 00:10:00 EST │
│ 2022-01-03 00:15:00 EST │
│ … │
│ 2022-09-29 23:45:00 EDT │
│ 2022-09-29 23:50:00 EDT │
│ 2022-09-29 23:55:00 EDT │
│ 2022-09-30 00:00:00 EDT │
└────────────────────────────────┘
Then, it'll work, because polars
can just start at the start time and keep adding 5 minutes, which is always well-defined.
If you try to first make a UTC date range and then replace the time zone, then you will have ended up with ambiguous or non-existent datetimes (due to DST).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论