Polars的replace_time_zone函数引发错误:“没有这样的本地时间”

huangapple go评论67阅读模式
英文:

Polars replace_time_zone function throws error of "no such local time"

问题

这是我们要处理的测试数据:

import polars as pl
import pandas as pd
from datetime import date, time, datetime

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 1, 3),
        high=date(2022, 9, 30),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
    ).alias("UTC")
)

我需要replace_time_zone 实际更改底层时间戳,但相同的时区适用于 convert_time_zone,并且使用 replace_time_zone 会失败。

df.select(
    pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)

输出如下:
形状:(77761, 1)
┌────────────────────────────────┐
│ US │
│ --- │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-02 19:00:00 EST │
│ 2022-01-02 19:05:00 EST │
│ 2022-01-02 19:10:00 EST │
│ 2022-01-02 19:15:00 EST │
│ … │
│ 2022-09-29 19:45:00 EDT │
│ 2022-09-29 19:50:00 EDT │
│ 2022-09-29 19:55:00 EDT │
│ 2022-09-29 20:00:00 EDT │


```python
df.select(
   pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
)

错误输出如下:
线程'<unnamed>' 在 /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/chrono-0.4.23/src/offset/mod.rs:186:34 处崩溃,报告'没有这样的本地时间'。

英文:

here's our test data to work with:

import polars as pl
import pandas as pd
from datetime import date, time, datetime

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 1, 3),
        high=date(2022, 9, 30),
        interval=&quot;5m&quot;,
        time_unit=&quot;ns&quot;,
        time_zone=&quot;UTC&quot;,
    ).alias(&quot;UTC&quot;)
)

I specifically need replace_time_zone to actually change the underlying timestamp but the same timezone works with convert_time_zone, and faild with replace_time_zone.

df.select(
    pl.col(&quot;UTC&quot;).dt.convert_time_zone(time_zone=&quot;America/New_York&quot;).alias(&quot;US&quot;)
)

# output
shape: (77761, 1)
┌────────────────────────────────┐
│ US                             │
│ ---                            │
│ datetime[ns, America/New_York] │
╞════════════════════════════════╡
│ 2022-01-02 19:00:00 EST        │
│ 2022-01-02 19:05:00 EST        │
│ 2022-01-02 19:10:00 EST        │
│ 2022-01-02 19:15:00 EST        │
│ …                              │
│ 2022-09-29 19:45:00 EDT        │
│ 2022-09-29 19:50:00 EDT        │
│ 2022-09-29 19:55:00 EDT        │
│ 2022-09-29 20:00:00 EDT        │

df.select(
   pl.col(&quot;UTC&quot;).dt.replace_time_zone(time_zone=&quot;America/New_York&quot;).alias(&quot;US&quot;)
)

  # error output
  thread &#39;&lt;unnamed&gt;&#39; panicked at &#39;No such local time&#39;, /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/chrono-0.4.23/src/offset/mod.rs:186:34
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[78], line 1
----&gt; 1 df.select(
      2     pl.col(&quot;UTC&quot;).dt.replace_time_zone(time_zone=&quot;America/New_York&quot;).alias(&quot;US&quot;)
      3     )

File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/polars/dataframe/frame.py:6432, in DataFrame.select(self, exprs, *more_exprs, **named_exprs)
   6324 def select(
   6325     self,
   6326     exprs: IntoExpr | Iterable[IntoExpr] | None = None,
   6327     *more_exprs: IntoExpr,
   6328     **named_exprs: IntoExpr,
   6329 ) -&gt; Self:
   6330     &quot;&quot;&quot;
   6331     Select columns from this DataFrame.
   6332 
   (...)
   6429 
   6430     &quot;&quot;&quot;
   6431     return self._from_pydf(
-&gt; 6432         self.lazy()
   6433         .select(exprs, *more_exprs, **named_exprs)
   6434         .collect(no_optimization=True)
   6435         ._df
   6436     )

File ~/Live-usb-storage/projects/python/alpha/lib/python3.10/site-packages/polars/lazyframe/frame.py:1443, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1432     common_subplan_elimination = False
   1434 ldf = self._ldf.optimization_toggle(
   1435     type_coercion,
   1436     predicate_pushdown,
   (...)
   1441     streaming,
   1442 )
-&gt; 1443 return pli.wrap_df(ldf.collect())

PanicException: No such local time

答案1

得分: 2

不可以用带有夏令时转换的时区替换UTC时间序列中的时区,否则会出现不存在和/或缺失的日期时间。错误可能需要更多的信息,但我认为这与polars无关。

以下是一个示例。"America/New_York" 在2022年3月13日进行了夏令时转换。在那一天,"2点"并不存在...所以这段代码可以正常工作:

import polars as pl
from datetime import date

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 3, 11),
        high=date(2022, 3, 13),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
    ).alias("UTC")
)

print(
    df.select(
       pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
    )
)
# shape: (289, 1)
# ┌────────────────────────────────┐
# │ US                             │
# │ ---                            │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-03-11 00:00:00 EST        │
# │ 2022-03-11 00:05:00 EST        │
# │ 2022-03-11 00:10:00 EST        │
# │ 2022-03-11 00:15:00 EST        │
# │ …                              │

但这段代码不会工作:

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 3, 13),
        high=date(2022, 3, 15),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
    ).alias("UTC")
)

print(
    df.select(
       pl.col("UTC").dt.replace_time_zone(time_zone="America/New_York").alias("US")
    )
)
# PanicException: No such local time

解决方法 可以将UTC转换为所需的时区,然后添加其UTC偏移。例如:

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 1, 3),
        high=date(2022, 9, 30),
        interval="5m",
        time_unit="ns",
        time_zone="UTC",
    ).alias("UTC")
)

df = df.with_columns(
       pl.col("UTC").dt.convert_time_zone(time_zone="America/New_York").alias("US")
)

df = df.with_columns(
    (pl.col("US")+(pl.col("UTC")-pl.col("US").dt.replace_time_zone(time_zone="UTC")))
    .alias("US_fakeUTC")
    )

print(df.select(pl.col("US_fakeUTC")))
# shape: (77761, 1)
# ┌────────────────────────────────┐
# │ US_fakeUTC                     │
# │ ---                            │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-01-03 00:00:00 EST        │
# │ 2022-01-03 00:05:00 EST        │
# │ 2022-01-03 00:10:00 EST        │
# │ 2022-01-03 00:15:00 EST        │
# │ …                              │
英文:

You cannot replace the timezone in a UTC time series with a timezone that has DST transitions - you'll end up with non-existing and/or missing datetimes. The error could be a bit more informative, but I do not think this is specific to polars.

Here's an illustration. "America/New_York" had a DST transition on Mar 13. 2 am did not exist on that day... so this works fine:

import polars as pl
from datetime import date

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 3, 11),
        high=date(2022, 3, 13),
        interval=&quot;5m&quot;,
        time_unit=&quot;ns&quot;,
        time_zone=&quot;UTC&quot;,
    ).alias(&quot;UTC&quot;)
)

print(
    df.select(
       pl.col(&quot;UTC&quot;).dt.replace_time_zone(time_zone=&quot;America/New_York&quot;).alias(&quot;US&quot;)
    )
)
# shape: (289, 1)
# ┌────────────────────────────────┐
# │ US                             │
# │ ---                            │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-03-11 00:00:00 EST        │
# │ 2022-03-11 00:05:00 EST        │
# │ 2022-03-11 00:10:00 EST        │
# │ 2022-03-11 00:15:00 EST        │
# │ …                              │

while this doesn't:

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 3, 13),
        high=date(2022, 3, 15),
        interval=&quot;5m&quot;,
        time_unit=&quot;ns&quot;,
        time_zone=&quot;UTC&quot;,
    ).alias(&quot;UTC&quot;)
)

print(
    df.select(
       pl.col(&quot;UTC&quot;).dt.replace_time_zone(time_zone=&quot;America/New_York&quot;).alias(&quot;US&quot;)
    )
)
# PanicException: No such local time

Workaround you could use is to convert UTC to the desired timezone, then add its UTC offset. Ex:

df = pl.DataFrame(
    pl.date_range(
        low=date(2022, 1, 3),
        high=date(2022, 9, 30),
        interval=&quot;5m&quot;,
        time_unit=&quot;ns&quot;,
        time_zone=&quot;UTC&quot;,
    ).alias(&quot;UTC&quot;)
)

df = df.with_columns(
       pl.col(&quot;UTC&quot;).dt.convert_time_zone(time_zone=&quot;America/New_York&quot;).alias(&quot;US&quot;)
)

df = df.with_columns(
    (pl.col(&quot;US&quot;)+(pl.col(&quot;UTC&quot;)-pl.col(&quot;US&quot;).dt.replace_time_zone(time_zone=&quot;UTC&quot;)))
    .alias(&quot;US_fakeUTC&quot;)
    )

print(df.select(pl.col(&quot;US_fakeUTC&quot;)))
# shape: (77761, 1)
# ┌────────────────────────────────┐
# │ US_fakeUTC                     │
# │ ---                            │
# │ datetime[ns, America/New_York] │
# ╞════════════════════════════════╡
# │ 2022-01-03 00:00:00 EST        │
# │ 2022-01-03 00:05:00 EST        │
# │ 2022-01-03 00:10:00 EST        │
# │ 2022-01-03 00:15:00 EST        │
# │ …                              │

答案2

得分: 1

你需要直接将时区传递给 date_range

In [4]: import polars as pl
   ...: import pandas as pd
   ...: from datetime import date, time, datetime
   ...:
   ...: df = pl.DataFrame(
   ...:     pl.date_range(
   ...:         low=date(2022, 1, 3),
   ...:         high=date(2022, 9, 30),
   ...:         interval="5m",
   ...:         time_unit="ns",
   ...:         time_zone="America/New_York",
   ...:     ).alias("America/New_York")
   ...: )

In [5]: df
Out[5]:
shape: (77749, 1)
┌────────────────────────────────┐
 America/New_York               
 ---                            
 datetime[ns, America/New_York] 
╞════════════════════════════════╡
 2022-01-03 00:00:00 EST        
 2022-01-03 00:05:00 EST        
 2022-01-03 00:10:00 EST        
 2022-01-03 00:15:00 EST        
                               
 2022-09-29 23:45:00 EDT        
 2022-09-29 23:50:00 EDT        
 2022-09-29 23:55:00 EDT        
 2022-09-30 00:00:00 EDT        
└────────────────────────────────┘

然后它将正常工作,因为 polars 只需从起始时间开始并持续添加5分钟,这总是明确定义的。如果您首先创建一个UTC日期范围,然后替换时区,那么您可能会得到模糊或不存在的日期时间(由于夏令时的影响)。

英文:

You need to pass the time zone to date_range directly:

In [4]: import polars as pl
   ...: import pandas as pd
   ...: from datetime import date, time, datetime
   ...:
   ...: df = pl.DataFrame(
   ...:     pl.date_range(
   ...:         low=date(2022, 1, 3),
   ...:         high=date(2022, 9, 30),
   ...:         interval=&quot;5m&quot;,
   ...:         time_unit=&quot;ns&quot;,
   ...:         time_zone=&quot;America/New_York&quot;,
   ...:     ).alias(&quot;America/New_York&quot;)
   ...: )

In [5]: df
Out[5]:
shape: (77749, 1)
┌────────────────────────────────┐
 America/New_York               
 ---                            
 datetime[ns, America/New_York] 
╞════════════════════════════════╡
 2022-01-03 00:00:00 EST        
 2022-01-03 00:05:00 EST        
 2022-01-03 00:10:00 EST        
 2022-01-03 00:15:00 EST        
                               
 2022-09-29 23:45:00 EDT        
 2022-09-29 23:50:00 EDT        
 2022-09-29 23:55:00 EDT        
 2022-09-30 00:00:00 EDT        
└────────────────────────────────┘

Then, it'll work, because polars can just start at the start time and keep adding 5 minutes, which is always well-defined.

If you try to first make a UTC date range and then replace the time zone, then you will have ended up with ambiguous or non-existent datetimes (due to DST).

huangapple
  • 本文由 发表于 2023年3月21日 01:01:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/75793219.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定