Python比.apply()更高效的方法

huangapple go评论68阅读模式
英文:

Python more efficient method than .apply()

问题

I have a large dataframe with projected data 60 months into the future, and I need to drop the projections for months that haven't happened yet. I have a functioning way to do this but it's throwing memory errors for a 16 million row dataframe (I have removed all unnecessary columns):

from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()

def add_months(start_date, delta_period):
  end_date = start_date + relativedelta(months=delta_period)
  return end_date

# Apply function on the dataframe using lambda operation.
snapshots["End_Date"] = snapshots.progress_apply(lambda row: add_months(row["startDate"], row["projectedMonth"]), axis = 1)

Then I would drop columns where end_date > today. I tried to import 'swifter' but my organisation's settings won't allow that. Is there a more efficient way to deal with this? I wondered about doing

snapshots['End_Date'] = snapshots['startDate'] + relativedelta(months=snapshots['projectedMonth'])

But get the error about relativedelta needing int not series. Thanks!

英文:

I have a large dataframe with projected data 60 months into the future, and I need to drop the projections for months that haven't happened yet. I have a functioning way to do this but it's throwing memory errors for a 16 million row dataframe (I have removed all unnecessary columns):

from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()

def add_months(start_date, delta_period):
  end_date = start_date + relativedelta(months=delta_period)
  return end_date

# Apply function on the dataframe using lambda operation.
snapshots["End_Date"] = snapshots.progress_apply(lambda row: add_months(row["startDate"], row["projectedMonth"]), axis = 1)

Then I would drop columns where end_date>today. I tried to import 'swifter' but my organisation's settings won't allow that. Is there a more efficient way to deal with this? I wondered about doing

snapshots['End_Date']=snapshots['startDate']+relativedelta(months=snapshots['projectedMonth'])

But get the error about relativedelta needing int not series. Thanks!

答案1

得分: 3

使用Lambda函数和矢量化操作,您可以实现这一点,对于您的情况,可以使用pd.DateOffset来直接添加月份到日期列。

import pandas as pd

data = {
    "startDate": pd.date_range(start="2020-01-01", periods=10, freq="MS"),
    "projectedMonth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
snapshots = pd.DataFrame(data)

snapshots["End_Date"] = snapshots.apply(lambda row: row["startDate"] + pd.DateOffset(months=row["projectedMonth"]), axis=1)

today = pd.Timestamp.today()
filtered_snapshots = snapshots[snapshots["End_Date"] <= today]

print(filtered_snapshots)
英文:

With a lambda function and instead using vectorized operations you could do that, for your case pd.DateOffset to add months directly to the date column would be good.

import pandas as pd

data = {
    &quot;startDate&quot;: pd.date_range(start=&quot;2020-01-01&quot;, periods=10, freq=&quot;MS&quot;),
    &quot;projectedMonth&quot;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
snapshots = pd.DataFrame(data)

snapshots[&quot;End_Date&quot;] = snapshots.apply(lambda row: row[&quot;startDate&quot;] + pd.DateOffset(months=row[&quot;projectedMonth&quot;]), axis=1)

today = pd.Timestamp.today()
filtered_snapshots = snapshots[snapshots[&quot;End_Date&quot;] &lt;= today]

print(filtered_snapshots)

huangapple
  • 本文由 发表于 2023年4月19日 19:25:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053926.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定