英文:
Get aggregates from different Dataframe to current Dataframe with conditions
问题
以下是您要求的代码部分的翻译:
我有一个收获数据框和一个天气数据框。
我想要获取在收获前 x 个月内温度高于某个阈值的天数,对于所有的区块。
请注意,收获数据框包括多年的数据,并且id在两个数据框之间不是一一对应的,也就是说,收获数据框中的两个区块可以共享一个与天气数据框中的位置对应的ID。
我目前的(有效)代码如下,但非常慢,需要几分钟的时间。我希望加速它,但不清楚如何做到。
def days_above_thresh(x, weather_df):
return weather_df.loc[
(weather_df["id"]==x.id) &
(weather_df["day"]>=x['harvest_date']-DateOffset(months=2)) &
(weather_df["day"]<=x['harvest_date']) &
(weather_df["temperature_max"]>30),
"temperature_max"].count()
harvest_df["days_above_30"] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)
数据框的结构大致如下 -
```none
weather_df
id day temperature_max
1 2020-01-01 30
1 2020-01-02 32
1 2020-01-03 28
1 2020-01-04 25
.
.
.
2 2020-01-01 10
2 2020-01-02 15
2 2020-01-03 17
2 2020-01-04 12
.
.
.
harvest_df
id farm_id harvest_date
1 87 2020-01-02
1 86 2020-01-03
2 13 2020-01-30
英文:
I have a harvest dataframe and a weather dataframe.
I want to get the number of days above a temp threshold for the previous x months before harvest for all blocks.
Note the harvest dataframe includes multiple years and the id is not 1-1 between frames, ie 2 blocks in harvest df can share an ID that correspond to a location in the weather frame.
My current (working) code is below, but it is VERY slow, on the order of minutes. I want to speed it up but unclear how.
def days_above_thresh(x, weather_df):
return weather_df.loc[
(weather_df["id"]==x.id) & \
(weather_df["day"]>=x['harvest_date']-DateOffset(months=2)) & \
(weather_df["day"]<=x['harvest_date']) & \
(weather_df["temperature_max"]>30),
"temperature_max"].count()
harvest_df["days_above_30"] = harvest_df.apply(days_above_thresh , args=(weather_df,), axis=1)
The dataframes would look something like this -
weather_df
id day temperature_max
1 2020-01-01 30
1 2020-01-02 32
1 2020-01-03 28
1 2020-01-04 25
.
.
.
2 2020-01-01 10
2 2020-01-02 15
2 2020-01-03 17
2 2020-01-04 12
.
.
.
harvest_df
id farm_id harvest_date
1 87 2020-01-02
1 86 2020-01-03
2 13 2020-01-30
答案1
得分: 1
这可以通过根据 id
合并两个框架、使用在您定义的函数中构建的布尔掩码来筛选结果框架,然后对结果调用 groupby.size
来加速。
如果您的框架较大(但不是太大),这将显著减少运行时间(如果 harvest_df
有1万行,运行时间将从13.5秒减少到15毫秒)。但是,如果 harvest_df
太大(可能有数百万行),因为它创建了一个更大的框架,您可能会遇到内存问题。
此外,pd.DateOffset
由于某种原因未经过优化;但 np.timedelta64
经过了优化,因此替换它可以进一步提高速度。
tmp = harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
msk = tmp['day'].between(tmp['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'), tmp['harvest_date']) & tmp['temperature_max'].gt(30)
harvest_df["days_above_30"] = tmp[msk].groupby('index').size().reindex(harvest_df.index, fill_value=0)
也可以将其写成一行代码:
harvest_df["days_above_30"] = (
harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
.assign(two_month_prior=lambda x: x['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'))
.query("two_month_prior <= day <= harvest_date and temperature_max > 30")
.groupby('index').size()
.reindex(harvest_df.index, fill_value=0)
)
英文:
This could be sped up by merging the two frames on id
, filtering the resulting frame using the boolean mask (that is constructed in the function you defined) and calling groupby.size
on the result.
If your frames are large (but not too large), this will cut down the runtime significantly (if harvest_df
is 10k rows, it cuts down runtime from 13.5sec to 15ms). However, if harvest_df
is too large (maybe millions of rows), since it creates an even larger frame, you might run into memory issues.
Also, pd.DateOffset
is not optimized for some reason; however, np.timedelta64
is, so replacing it improves speed even further.
tmp = harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
msk = tmp['day'].between(tmp['harvest_date'].sub(np.timedelta(2, 'M')).dt.floor('D'), tmp['harvest_date']) & tmp['temperature_max'].gt(30)
harvest_df["days_above_30"] = tmp[msk].groupby('index').size().reindex(harvest_df.index, fill_value=0)
Could also write it as a one-liner:
harvest_df["days_above_30"] = (
harvest_df.reset_index().merge(weather_df[['id', 'day', 'temperature_max']], on='id', how='left')
.assign(two_month_prior=lambda x: x['harvest_date'].sub(np.timedelta64(2, 'M')).dt.floor('D'))
.query("two_month_prior <= day <= harvest_date and temperature_max > 30")
.groupby('index').size()
.reindex(harvest_df.index, fill_value=0)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论