2023年4月17日 16:57:52go评论51阅读模式

英文:

How can I vectorize a for-loop running over a pandas Periodindex where I need to sort datetimes into appropriate period?

问题

timeseries["period"] = on[on.get_indexer(timeseries.index)]

英文:

I have a Dataframe "timeseries" which has datetimes as its index and I have a PeriodIndex "on":

import numpy as np
import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                &quot;2000-01-01 12:00:00Z&quot;,
                &quot;2000-01-01 13:00:00Z&quot;,
                &quot;2000-01-01 14:00:00Z&quot;,
                &quot;2000-01-02 13:00:00Z&quot;,
                &quot;2000-01-02 18:00:00Z&quot;,
                &quot;2000-01-03 14:00:00Z&quot;,
                &quot;2000-01-03 20:00:00Z&quot;,
                &quot;2000-01-04 13:00:00Z&quot;,
            ]
        ),
        data={
            &quot;value1&quot;: [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    [&quot;2000-01-01&quot;, &quot;2000-01-02&quot;, &quot;2000-01-04&quot;, &quot;2000-01-05&quot;], freq=&quot;D&quot;
    )

I would like to add a column to "timeseries" that contains the period in "on" that each respective datetime is in:

                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaN
2000-01-03 20:00:00+00:00     5.0         NaN
2000-01-04 13:00:00+00:00     3.0  2000-01-04

So far I have achieved this with a for-loop:

    timeseries[&quot;period&quot;] = np.NaN
    for period in on:
        datetimes_in_period = timeseries.index[
            (timeseries.index &gt;= period.start_time.tz_localize(&quot;UTC&quot;))
            &amp; (timeseries.index &lt;= period.end_time.tz_localize(&quot;UTC&quot;))
        ]
        timeseries[&quot;period&quot;].loc[datetimes_in_period] = period

For efficiency's sake I want to avoid loops in Python. How can I vectorize this code?

答案1

得分: 0

这是您的解决方案的翻译：

import pandas as pd

# 创建一个时间序列
timeseries = pd.DataFrame(
    index=pd.DatetimeIndex(
        [
            "2000-01-01 12:00:00Z",
            "2000-01-01 13:00:00Z",
            "2000-01-01 14:00:00Z",
            "2000-01-02 13:00:00Z",
            "2000-01-02 18:00:00Z",
            "2000-01-03 14:00:00Z",
            "2000-01-03 20:00:00Z",
            "2000-01-04 13:00:00Z",
        ]
    ),
    data={"value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0]},
)

# 创建一个日期索引
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
)

# 使用merge_asof合并数据
merge = (
    pd.merge_asof(
        timeseries.index.to_frame(),
        on.to_timestamp().to_frame(),
        right_index=True,
        left_index=True,
    )
    .drop('0_x', axis=1)
    .reset_index()
    .rename({'0_y': 'date', 'index': 'period'}, axis=1)
)

# 提取在`on`中没有匹配日期的日期
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT

# 使用groupby和agg获取分组数据
merge = merge.groupby('date').agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns

# 连接匹配和不匹配的日期
merge = pd.concat((merge, unmatched_periods))
merge

这是您关于PeriodIndex和merge_asof的解释：

从未使用过PeriodIndex，并且被迫将其转换为DateTimeIndex以使用to_timestamp。从文档中看，PeriodIndex似乎旨在以编程方式创建日期/周期（例如，在两个日期之间每隔X天），这似乎不是在这里使用的方式。

无论如何，解决方案的核心是使用merge_asof，它类似于merge，但不需要相等的键，而是寻找最接近的键。默认情况下，它会向后查找，这正是我们想要的（在timeseries日期之前的on中最接近的日期）。

然后，我们使用groupby和agg来获取分组数据。

我们还需要获取on中没有在timeseries中找到匹配的日期（在本例中是2000-01-05）。

注意：您提到您要避免循环以提高效率。理论上这是个好主意，但要小心，您试图实现的结果（在列中具有列表作为值）本身相当低效，而且pandas的groupby也相当要求性能。

英文:

Note: This answer originally referred to a question which has since been edited to change the intent

Here's my solution:

import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                &quot;2000-01-01 12:00:00Z&quot;,
                &quot;2000-01-01 13:00:00Z&quot;,
                &quot;2000-01-01 14:00:00Z&quot;,
                &quot;2000-01-02 13:00:00Z&quot;,
                &quot;2000-01-02 18:00:00Z&quot;,
                &quot;2000-01-03 14:00:00Z&quot;,
                &quot;2000-01-03 20:00:00Z&quot;,
                &quot;2000-01-04 13:00:00Z&quot;,
            ]
        ),
        data={
            &quot;value1&quot;: [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    [&quot;2000-01-01&quot;, &quot;2000-01-02&quot;, &quot;2000-01-04&quot;, &quot;2000-01-05&quot;], freq=&quot;D&quot;
    )


merge = (pd.merge_asof(timeseries.index.to_frame(),
                    on.to_timestamp().to_frame(),
                    right_index=True, left_index=True)
                    .drop(&#39;0_x&#39;, axis=1)
                    .reset_index()
                    .rename({&#39;0_y&#39;:&#39;date&#39;, &#39;index&#39;:&#39;period&#39;}, axis=1)
        )

#extracting from `on` any date which does not have a matching date in timestamp
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT

merge = merge.groupby(&#39;date&#39;).agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns
merge = pd.concat((merge, unmatched_periods))
merge

I have never used PeriodIndex before, and was forced to convert it to a DateTimeIndex with to_timestamp. From looking at the documentation it seems that PeriodIndex is intended to create dates/periods programmatically (e.g. every X days between two days), which doesn't seem to be quite what it's being used for here.

Anyway the core of the solution is to use merge_asof which is like merge but instead of requiring equal keys, it will look for the closest key. By default it will look backwards which is what we want (the closest date in on which is before the date in timeseries).

Then we use groupby and agg to get the groups.

We also need to get the dates in on which did not have any match in timeseries (in this case 2000-01-05

Note: you say you "avoid" loops for efficiency. In theory that's a good idea, but be careful that the result you're trying to achieve (having lists as values in a column) is by itself quite inefficient, on top of pandas groupby also being fairly demanding.

答案2

得分: 0

你仍然可以使用.merge_asof，如@rorshan建议的。

如果创建一个开始/结束时间间隔的数据框：

df_on = pd.DataFrame({
   "period":     on,
   "start_time": on.start_time.tz_localize("UTC"), 
   "end_time":   on.end_time.tz_localize("UTC"),
})

df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")

# 在不在时间间隔内时清空期间
df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))

>>> df[["value1", "period"]]
                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaT
2000-01-03 20:00:00+00:00     5.0         NaT
2000-01-04 13:00:00+00:00     3.0  2000-01-04

英文:

You can still use .merge_asof as @rorshan suggested.

If you create a dataframe of the start/end intervals:

df_on = pd.DataFrame({
   &quot;period&quot;:     on,
   &quot;start_time&quot;: on.start_time.tz_localize(&quot;UTC&quot;), 
   &quot;end_time&quot;:   on.end_time.tz_localize(&quot;UTC&quot;),
})

df = pd.merge_asof(timeseries, df_on, left_index=True, right_on=&quot;start_time&quot;)

# blank out period when not inside
df[&quot;period&quot;] = df[&quot;period&quot;].where((df.index &gt;= df[&quot;start_time&quot;]) &amp; (df.index &lt;= df[&quot;end_time&quot;]))

&gt;&gt;&gt; df[[&quot;value1&quot;, &quot;period&quot;]]
                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaT
2000-01-03 20:00:00+00:00     5.0         NaT
2000-01-04 13:00:00+00:00     3.0  2000-01-04

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can I vectorize a for-loop running over a pandas Periodindex where I need to sort datetimes into appropriate period?

问题

答案1

答案2

Python Pandas DataFrame，添加列并标记已调整和插入的行。

Python XLSXWriter: 从工作簿获取写入器

如何在Flask或Pandas中动态填充下拉菜单以显示来自CSV的列名？

如何使用日期偏移来计算日期并同时固定其时间到特定时间？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论