英文:
DST temporal feature from timestamp using polars
问题
我正在将代码从pandas迁移到polars。我有一个包含时间戳和值列的时间序列数据,我需要计算一些特征。例如:
df = pl.DataFrame({
"timestamp": pl.date_range(
datetime(2017, 1, 1),
datetime(2018, 1, 1),
timedelta(minutes=15),
time_zone="Australia/Sydney",
time_unit="ms", eager=True),
})
value = np.random.normal(0, 1, len(df))
df = df.with_columns([pl.Series(value).alias("value")])
我需要生成一个包含指示时间戳是否为标准时间或夏令时的列。我目前正在使用apply
,因为据我所见,没有Temporal Expr(时间表达式)。也就是说,我当前的代码是:
def dst(timestamp:datetime):
return int(timestamp.dst().total_seconds()!=0)
df = df.with_columns(pl.struct(["timestamp"]).apply(lambda x: dst(**x)).alias("dst"))
(这使用了一个有效地检查tzinfo.dst(dt)
偏移是否为零的技巧)
是否有使用polars expressions
而不是(慢的)apply
来执行这个操作的(快速)方法?
英文:
I'm migrating code to polars from pandas. I have time-series data consisting of a timestamp and value column and I need to compute a bunch of features. i.e.
df = pl.DataFrame({
"timestamp": pl.date_range(
datetime(2017, 1, 1),
datetime(2018, 1, 1),
timedelta(minutes=15),
time_zone="Australia/Sydney",
time_unit="ms", eager=True),
})
value = np.random.normal(0, 1, len(df))
df = df.with_columns([pl.Series(value).alias("value")])
I need to generate a column containing an indicator if the timestamp is standard or daylight time. I'm currently using apply
because as far as I can see the isn't a Temporal Expr, i.e. my current code is
def dst(timestamp:datetime):
return int(timestamp.dst().total_seconds()!=0)
df = df.with_columns(pl.struct(["timestamp"]).apply(lambda x: dst(**x)).alias("dst"))
(this uses a trick that effectively checks if the tzinfo.dst(dt)
offset is zero or not)
Is there a (fast) way of doing this using polars expressions
rather than (slow) apply
?
答案1
得分: 1
你可以利用 strftime
来实现这个功能。
(
df
.with_columns(
dst=pl.when(pl.col('timestamp').dt.strftime("%Z").str.contains("(DT$)"))
.then(True)
.otherwise(False)
)
)
它依赖于本地时区以 "DT" 结尾来确定夏令时的状态。这在这里可以工作,并且也适用于美国的时区(例如 EST/EDT、CST/CDT 等),但是有许多不适用的示例。
或者,您可以使用UTC偏移量,但这会更加复杂。
(
df
.with_columns(
tzoff=pl.col('timestamp').dt.strftime("%z").cast(pl.Int64())
)
.join(
df
.select(
tzoff=pl.col('timestamp').dt.strftime("%z").cast(pl.Int64())
)
.unique('tzoff')
.sort('tzoff')
.with_columns(
dst=pl.lit([False, True])
),
on='tzoff')
.drop('tzoff')
)
这个方法假设时区只有2个偏移量,较小的是标准时间,较大的是夏令时。
英文:
You can exploit strftime
for this.
(
df
.with_columns(
dst=pl.when(pl.col('timestamp').dt.strftime("%Z").str.contains("(DT$)"))
.then(True)
.otherwise(False)
)
)
It relies on the local time zone ending in "DT" to determine the dst status. That works here and would work for US time zones (ie EST/EDT, CST/CDT, etc) but examples that wouldn't work are numerous.
Alternatively you could use the utc offset but it's a lot more convoluted.
(
df
.with_columns(
tzoff=pl.col('timestamp').dt.strftime("%z").cast(pl.Int64())
)
.join(
df
.select(
tzoff=pl.col('timestamp').dt.strftime("%z").cast(pl.Int64())
)
.unique('tzoff')
.sort('tzoff')
.with_columns(
dst=pl.lit([False, True])
),
on='tzoff')
.drop('tzoff')
)
This one assumes that the timezone only has 2 offsets and that the smaller of the two is standard time and the bigger one is daylight savings.
答案2
得分: 1
使用polars>=0.18.5
,以下代码可以正常工作:
df = df.with_columns((pl.col("timestamp").dt.dst_offset()==0).cast(pl.Int32).alias("dst"))
英文:
With polars>=0.18.5
the following works
df = df.with_columns((pl.col("timestamp").dt.dst_offset()==0).cast(pl.Int32).alias("dst"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论