问题

以下是您要求的代码部分的翻译：

我有一个收获数据框和一个天气数据框。
我想要获取在收获前 x 个月内温度高于某个阈值的天数，对于所有的区块。
请注意，收获数据框包括多年的数据，并且id在两个数据框之间不是一一对应的，也就是说，收获数据框中的两个区块可以共享一个与天气数据框中的位置对应的ID。

我目前的（有效）代码如下，但非常慢，需要几分钟的时间。我希望加速它，但不清楚如何做到。

def days_above_thresh(x, weather_df):
return weather_df.loc[
(weather_df["id"]==x.id) &
(weather_df["day"]>=x['harvest_date']-DateOffset(months=2)) &
(weather_df["day"]<=x['harvest_date']) &
(weather_df["temperature_max"]>30),
"temperature_max"].count()

harvest_df["days_above_30"] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)


数据框的结构大致如下 -

```none
weather_df
id      day      temperature_max
1    2020-01-01    30
1    2020-01-02    32
1    2020-01-03    28
1    2020-01-04    25 
         .
         .
         .
2    2020-01-01    10
2    2020-01-02    15
2    2020-01-03    17
2    2020-01-04    12
         .
         .
         .

harvest_df
id   farm_id  harvest_date
1       87    2020-01-02 
1       86    2020-01-03
2       13    2020-01-30

英文:

I have a harvest dataframe and a weather dataframe.
I want to get the number of days above a temp threshold for the previous x months before harvest for all blocks.
Note the harvest dataframe includes multiple years and the id is not 1-1 between frames, ie 2 blocks in harvest df can share an ID that correspond to a location in the weather frame.

My current (working) code is below, but it is VERY slow, on the order of minutes. I want to speed it up but unclear how.

def days_above_thresh(x, weather_df):
    return weather_df.loc[
            (weather_df[&quot;id&quot;]==x.id) &amp; \
            (weather_df[&quot;day&quot;]&gt;=x[&#39;harvest_date&#39;]-DateOffset(months=2)) &amp; \
            (weather_df[&quot;day&quot;]&lt;=x[&#39;harvest_date&#39;]) &amp; \
            (weather_df[&quot;temperature_max&quot;]&gt;30),
            &quot;temperature_max&quot;].count()

harvest_df[&quot;days_above_30&quot;] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)

The dataframes would look something like this -

weather_df
id      day      temperature_max
1    2020-01-01    30
1    2020-01-02    32
1    2020-01-03    28
1    2020-01-04    25 
         .
         .
         .
2    2020-01-01    10
2    2020-01-02    15
2    2020-01-03    17
2    2020-01-04    12
         .
         .
         .

harvest_df
id   farm_id  harvest_date
1       87    2020-01-02 
1       86    2020-01-03
2       13    2020-01-30

答案1

得分: 1

这可以通过根据 id 合并两个框架、使用在您定义的函数中构建的布尔掩码来筛选结果框架，然后对结果调用 groupby.size 来加速。

如果您的框架较大（但不是太大），这将显著减少运行时间（如果 harvest_df 有1万行，运行时间将从13.5秒减少到15毫秒）。但是，如果 harvest_df 太大（可能有数百万行），因为它创建了一个更大的框架，您可能会遇到内存问题。

此外，pd.DateOffset 由于某种原因未经过优化；但 np.timedelta64 经过了优化，因此替换它可以进一步提高速度。

tmp = harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
msk = tmp['day'].between(tmp['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'), tmp['harvest_date']) & tmp['temperature_max'].gt(30)
harvest_df["days_above_30"] = tmp[msk].groupby('index').size().reindex(harvest_df.index, fill_value=0)

也可以将其写成一行代码：

harvest_df["days_above_30"] = (
    harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
    .assign(two_month_prior=lambda x: x['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'))
    .query("two_month_prior <= day <= harvest_date and temperature_max > 30")
    .groupby('index').size()
    .reindex(harvest_df.index, fill_value=0)
)

英文:

This could be sped up by merging the two frames on id, filtering the resulting frame using the boolean mask (that is constructed in the function you defined) and calling groupby.size on the result.

If your frames are large (but not too large), this will cut down the runtime significantly (if harvest_df is 10k rows, it cuts down runtime from 13.5sec to 15ms). However, if harvest_df is too large (maybe millions of rows), since it creates an even larger frame, you might run into memory issues.

Also, pd.DateOffset is not optimized for some reason; however, np.timedelta64 is, so replacing it improves speed even further.

tmp = harvest_df.reset_index().merge(weather_df[[&#39;id&#39;, &#39;day&#39;, &#39;temperature_max&#39;]], on=&#39;id&#39;, how=&#39;left&#39;)
msk = tmp[&#39;day&#39;].between(tmp[&#39;harvest_date&#39;].sub(np.timedelta(2, &#39;M&#39;)).dt.floor(&#39;D&#39;), tmp[&#39;harvest_date&#39;]) &amp; tmp[&#39;temperature_max&#39;].gt(30)
harvest_df[&quot;days_above_30&quot;] = tmp[msk].groupby(&#39;index&#39;).size().reindex(harvest_df.index, fill_value=0)

Could also write it as a one-liner:

harvest_df[&quot;days_above_30&quot;] = (
    harvest_df.reset_index().merge(weather_df[[&#39;id&#39;, &#39;day&#39;, &#39;temperature_max&#39;]], on=&#39;id&#39;, how=&#39;left&#39;)
    .assign(two_month_prior=lambda x: x[&#39;harvest_date&#39;].sub(np.timedelta64(2, &#39;M&#39;)).dt.floor(&#39;D&#39;))
    .query(&quot;two_month_prior &lt;= day &lt;= harvest_date and temperature_max &gt; 30&quot;)
    .groupby(&#39;index&#39;).size()
    .reindex(harvest_df.index, fill_value=0)
)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从不同的数据框中获取聚合结果并根据条件将其添加到当前数据框中。

问题

答案1

在字符串中交换两个字符的多个实例

如何在Python中获取资源时等待但不阻塞线程？

golang json converts int to float. What can i do

基于选项文本的下拉选择框的期望值

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论