从不同的数据框中获取聚合结果并根据条件将其添加到当前数据框中。

huangapple go评论60阅读模式
英文:

Get aggregates from different Dataframe to current Dataframe with conditions

问题

以下是您要求的代码部分的翻译:

我有一个收获数据框和一个天气数据框
我想要获取在收获前 x 个月内温度高于某个阈值的天数对于所有的区块
请注意收获数据框包括多年的数据并且id在两个数据框之间不是一一对应的也就是说收获数据框中的两个区块可以共享一个与天气数据框中的位置对应的ID

我目前的有效代码如下但非常慢需要几分钟的时间我希望加速它但不清楚如何做到

def days_above_thresh(x, weather_df):
return weather_df.loc[
(weather_df["id"]==x.id) &
(weather_df["day"]>=x['harvest_date']-DateOffset(months=2)) &
(weather_df["day"]<=x['harvest_date']) &
(weather_df["temperature_max"]>30),
"temperature_max"].count()

harvest_df["days_above_30"] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)


数据框的结构大致如下 -

```none
weather_df
id      day      temperature_max
1    2020-01-01    30
1    2020-01-02    32
1    2020-01-03    28
1    2020-01-04    25 
         .
         .
         .
2    2020-01-01    10
2    2020-01-02    15
2    2020-01-03    17
2    2020-01-04    12
         .
         .
         .

harvest_df
id   farm_id  harvest_date
1       87    2020-01-02 
1       86    2020-01-03
2       13    2020-01-30 
英文:

I have a harvest dataframe and a weather dataframe.
I want to get the number of days above a temp threshold for the previous x months before harvest for all blocks.
Note the harvest dataframe includes multiple years and the id is not 1-1 between frames, ie 2 blocks in harvest df can share an ID that correspond to a location in the weather frame.

My current (working) code is below, but it is VERY slow, on the order of minutes. I want to speed it up but unclear how.

def days_above_thresh(x, weather_df):
    return weather_df.loc[
            (weather_df[&quot;id&quot;]==x.id) &amp; \
            (weather_df[&quot;day&quot;]&gt;=x[&#39;harvest_date&#39;]-DateOffset(months=2)) &amp; \
            (weather_df[&quot;day&quot;]&lt;=x[&#39;harvest_date&#39;]) &amp; \
            (weather_df[&quot;temperature_max&quot;]&gt;30),
            &quot;temperature_max&quot;].count()

harvest_df[&quot;days_above_30&quot;] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)

The dataframes would look something like this -

weather_df
id      day      temperature_max
1    2020-01-01    30
1    2020-01-02    32
1    2020-01-03    28
1    2020-01-04    25 
         .
         .
         .
2    2020-01-01    10
2    2020-01-02    15
2    2020-01-03    17
2    2020-01-04    12
         .
         .
         .

harvest_df
id   farm_id  harvest_date
1       87    2020-01-02 
1       86    2020-01-03
2       13    2020-01-30 

答案1

得分: 1

这可以通过根据 id 合并两个框架、使用在您定义的函数中构建的布尔掩码来筛选结果框架,然后对结果调用 groupby.size 来加速。

如果您的框架较大(但不是太大),这将显著减少运行时间(如果 harvest_df 有1万行,运行时间将从13.5秒减少到15毫秒)。但是,如果 harvest_df 太大(可能有数百万行),因为它创建了一个更大的框架,您可能会遇到内存问题。

此外,pd.DateOffset 由于某种原因未经过优化;但 np.timedelta64 经过了优化,因此替换它可以进一步提高速度。

tmp = harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
msk = tmp['day'].between(tmp['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'), tmp['harvest_date']) & tmp['temperature_max'].gt(30)
harvest_df["days_above_30"] = tmp[msk].groupby('index').size().reindex(harvest_df.index, fill_value=0)

也可以将其写成一行代码:

harvest_df["days_above_30"] = (
    harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
    .assign(two_month_prior=lambda x: x['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'))
    .query("two_month_prior <= day <= harvest_date and temperature_max > 30")
    .groupby('index').size()
    .reindex(harvest_df.index, fill_value=0)
)
英文:

This could be sped up by merging the two frames on id, filtering the resulting frame using the boolean mask (that is constructed in the function you defined) and calling groupby.size on the result.

If your frames are large (but not too large), this will cut down the runtime significantly (if harvest_df is 10k rows, it cuts down runtime from 13.5sec to 15ms). However, if harvest_df is too large (maybe millions of rows), since it creates an even larger frame, you might run into memory issues.

Also, pd.DateOffset is not optimized for some reason; however, np.timedelta64 is, so replacing it improves speed even further.

tmp = harvest_df.reset_index().merge(weather_df[[&#39;id&#39;, &#39;day&#39;, &#39;temperature_max&#39;]], on=&#39;id&#39;, how=&#39;left&#39;)
msk = tmp[&#39;day&#39;].between(tmp[&#39;harvest_date&#39;].sub(np.timedelta(2, &#39;M&#39;)).dt.floor(&#39;D&#39;), tmp[&#39;harvest_date&#39;]) &amp; tmp[&#39;temperature_max&#39;].gt(30)
harvest_df[&quot;days_above_30&quot;] = tmp[msk].groupby(&#39;index&#39;).size().reindex(harvest_df.index, fill_value=0)

Could also write it as a one-liner:

harvest_df[&quot;days_above_30&quot;] = (
    harvest_df.reset_index().merge(weather_df[[&#39;id&#39;, &#39;day&#39;, &#39;temperature_max&#39;]], on=&#39;id&#39;, how=&#39;left&#39;)
    .assign(two_month_prior=lambda x: x[&#39;harvest_date&#39;].sub(np.timedelta64(2, &#39;M&#39;)).dt.floor(&#39;D&#39;))
    .query(&quot;two_month_prior &lt;= day &lt;= harvest_date and temperature_max &gt; 30&quot;)
    .groupby(&#39;index&#39;).size()
    .reindex(harvest_df.index, fill_value=0)
)

huangapple
  • 本文由 发表于 2023年3月7日 09:31:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75657297.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定