使用strptime()转换持续时间字符串时遇到问题

huangapple go评论69阅读模式
英文:

Trouble with strptime() conversion of duration time string

问题

我有一些持续时间类型的数据(圈速),作为pl.Utf8,使用strptime无法成功转换,而常规日期时间按预期工作。

分钟(冒号前的部分)和秒(小数点前的部分)始终填充为两位数,毫秒始终为3位数。

圈速始终小于2分钟。

df = pl.DataFrame({
    "lap_time": ["01:14.007", "00:53.040", "01:00.123"]
})

df = df.with_columns(
    [
        pl.col('lap_time').str.strptime(pl.Time, fmt="%M:%S.%3f").cast(pl.Duration), # 失败
    ]
)

因此,我使用了来自https://docs.rs/chrono/latest/chrono/format/strftime/index.html的chrono格式说明符定义,这是根据polarsstrptime文档使用的。

第二个转换(对于lap_time)总是失败的,无论我使用.%f.%3f%.3f。显然,strptime不允许直接创建pl.Duration,所以我尝试使用pl.Time,但它失败并显示错误:

ComputeError: strict conversion to dates failed, maybe set strict=False

但将strict=False设置为整个Series产生所有的null值。

我是不是漏掉了什么,还是这是chronopython-polars的一些奇怪行为?

英文:

I have some duration type data (lap times) as pl.Utf8 that fails to convert using strptime, whereas regular datetimes work as expected.

Minutes (before 使用strptime()转换持续时间字符串时遇到问题 and Seconds (before .) are always padded to two digits, Milliseconds are always 3 digits.

Lap times are always < 2 min.

<!-- language: python -->

df = pl.DataFrame({
    &quot;lap_time&quot;: [&quot;01:14.007&quot;, &quot;00:53.040&quot;, &quot;01:00.123&quot;]
})

df = df.with_columns(
    [
        # pl.col(&#39;release_date&#39;).str.strptime(pl.Date, fmt=&quot;%B %d, %Y&quot;), # works
        pl.col(&#39;lap_time&#39;).str.strptime(pl.Time, fmt=&quot;%M:%S.%3f&quot;).cast(pl.Duration), # fails
    ]
)

So I used the chrono format specifier definitions from https://docs.rs/chrono/latest/chrono/format/strftime/index.html which are used as per the polars docs of strptime

the second conversion (for lap_time) always fails, no matter whether I use .%f, .%3f, %.3f. Apparently, strptime doesn't allow creating a pl.Duration directly, so I tried with pl.Time but it fails with error:

ComputeError: strict conversion to dates failed, maybe set strict=False

but setting strict=False yields all null values for the whole Series.

Am I missing something or this some weird behavior on chrono's or python-polars part?

答案1

得分: 3

通用情况

如果您有可能超过24小时的 duration,可以使用 regex 模式从字符串中提取数据(分钟、秒等)。例如:

df = pl.DataFrame({
    &quot;time&quot;: [&quot;+01:14.007&quot;, &quot;100:20.000&quot;, &quot;-05:00.000&quot;]
})

df.with_columns(
    pl.col(&quot;time&quot;).str.extract_all(r&quot;([+-]?\d+)&quot;)
    #                                /
    #                 你将获得长度为3的数组
    #                 [&quot;分&quot;, &quot;秒&quot;, &quot;毫秒&quot;]
).with_columns(
    pl.duration(
        minutes=pl.col(&quot;time&quot;).arr.get(0),
        seconds=pl.col(&quot;time&quot;).arr.get(1),
        milliseconds=pl.col(&quot;time&quot;).arr.get(2)
    ).alias(&quot;time&quot;)
)
┌──────────────┐
│ time         │
│ ---          │
│ duration[ns] │
╞══════════════╡
│ 1m 14s 7ms   │
│ 1h 40m 20s   │
│ -5m          │
└──────────────┘

关于 pl.Time

要将数据转换为 pl.Time,您需要指定小时。当您将 00 小时添加到您的时间时,代码将正常工作:

df = pl.DataFrame({&quot;str_time&quot;: [&quot;01:14.007&quot;, &quot;01:18.880&quot;]})

df.with_columns(
    duration = (pl.lit(&quot;00:&quot;) + pl.col(&quot;str_time&quot;))\
        .str.strptime(pl.Time, fmt=&quot;%T%.3f&quot;)\
        .cast(pl.Duration)
)
┌───────────┬──────────────┐
│ str_time  ┆ duration     │
│ ---       ┆ ---          │
│ str       ┆ duration[μs] │
╞═══════════╪══════════════╡
│ 01:14.007 ┆ 1m 14s 7ms   │
│ 01:18.880 ┆ 1m 18s 880ms │
└───────────┴──────────────┘
英文:

General case

In case you have duration that may exceed 24 hours, you can extract data (minutes, seconds and so on) from string using regex pattern. For example:

df = pl.DataFrame({
    &quot;time&quot;: [&quot;+01:14.007&quot;, &quot;100:20.000&quot;, &quot;-05:00.000&quot;]
})

df.with_columns(
    pl.col(&quot;time&quot;).str.extract_all(r&quot;([+-]?\d+)&quot;)
    #                                /
    #                 you will get array of length 3
    #                 [&quot;min&quot;, &quot;sec&quot;, &quot;ms&quot;]
).with_columns(
    pl.duration(
        minutes=pl.col(&quot;time&quot;).arr.get(0),
        seconds=pl.col(&quot;time&quot;).arr.get(1),
        milliseconds=pl.col(&quot;time&quot;).arr.get(2)
    ).alias(&quot;time&quot;)
)
┌──────────────┐
│ time         │
│ ---          │
│ duration[ns] │
╞══════════════╡
│ 1m 14s 7ms   │
│ 1h 40m 20s   │
│ -5m          │
└──────────────┘

About pl.Time

To convert data to pl.Time, you need to specify hours as well. When you add 00 hours to your time, code will work:

df = pl.DataFrame({&quot;str_time&quot;: [&quot;01:14.007&quot;, &quot;01:18.880&quot;]})

df.with_columns(
    duration = (pl.lit(&quot;00:&quot;) + pl.col(&quot;str_time&quot;))\
        .str.strptime(pl.Time, fmt=&quot;%T%.3f&quot;)\
        .cast(pl.Duration)
)
┌───────────┬──────────────┐
│ str_time  ┆ duration     │
│ ---       ┆ ---          │
│ str       ┆ duration[μs] │
╞═══════════╪══════════════╡
│ 01:14.007 ┆ 1m 14s 7ms   │
│ 01:18.880 ┆ 1m 18s 880ms │
└───────────┴──────────────┘

答案2

得分: 1

创建自己的解析器 - strptime 仅适用于日期时间戳,不适用于时间差。 接受的答案是不良实践,因为对于像80分钟的持续时间或负持续时间等合理输入会失败。

您可以使用 pl.Series.str.extract() 来创建自己的正则表达式解析器,并在将其传递到 Duration 构造函数之前提取您想要的值。

据我所知,在Rust中没有“持续时间戳”解析器。 如果有人看到这个,也许这是一个不错的想法,可以创建一个 crate。 语法可以类似于 strptime,但要处理如负持续时间、不包装最重要的“数字”/子单位的情况,比如对于“分钟持续时间戳”,您会将秒包装在60秒,但不包装分钟。 特别要确保61保持为61。

英文:

Create your own parser - strptime works for DateTime stamps only, not for time deltas. The accepted answer is bad practice as it fails for reasonable inputs like durations of 80 minutes, or negative durations.

You can use pl.Series.str.extract() to make your own regex parser and extract the values you want before passing them into the Duration constructor.

As far as I'm aware there is no "duration stamp" parser in Rust. Maybe good idea for a crate if anyone is reading this. Syntax could be similar to strptime but handle cases like: negative duration, non-wrapping for the most significant "digit"/subunit, in this case where it's a "minute duration stamp" you would wrap seconds at 60 but not minutes. Especially making sure that 61 remains 61.

答案3

得分: 0

代码修改自 [glebcom的回答][1]:

```python
df = df.with_columns(
    [
        # pl.col('release_date').str.strptime(pl.Date, fmt="%B %d, %Y"), # works
        pl.duration(
            minutes=pl.col("lap_time").str.slice(0,2),
            seconds=pl.col("lap_time").str.slice(3,2),
            milliseconds=pl.col("lap_time").str.slice(6,3)
        ).alias('lap_time'),
    ]
)

此答案作为对问题https://stackoverflow.com/questions/75654140/trouble-with-strptime-conversion-of-duration-time-string的编辑发布,由问题提问者Dorian按照CC BY-SA 4.0发布。


<details>
<summary>英文:</summary>

Code adapted from [glebcom&#39;s answer][1]:


```python
df = df.with_columns(
    [
        # pl.col(&#39;release_date&#39;).str.strptime(pl.Date, fmt=&quot;%B %d, %Y&quot;), # works
        pl.duration(
            minutes=pl.col(&quot;lap_time&quot;).str.slice(0,2),
            seconds=pl.col(&quot;lap_time&quot;).str.slice(3,2),
            milliseconds=pl.col(&quot;lap_time&quot;).str.slice(6,3)
        ).alias(&#39;lap_time&#39;),
    ]
)

<sub>This answer was posted as an edit to the question https://stackoverflow.com/questions/75654140/trouble-with-strptime-conversion-of-duration-time-string by the OP Dorian under CC BY-SA 4.0.</sub>

huangapple
  • 本文由 发表于 2023年3月7日 01:47:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75654140.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定