英文:
Python more efficient method than .apply()
问题
I have a large dataframe with projected data 60 months into the future, and I need to drop the projections for months that haven't happened yet. I have a functioning way to do this but it's throwing memory errors for a 16 million row dataframe (I have removed all unnecessary columns):
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
snapshots["End_Date"] = snapshots.progress_apply(lambda row: add_months(row["startDate"], row["projectedMonth"]), axis = 1)
Then I would drop columns where end_date > today. I tried to import 'swifter' but my organisation's settings won't allow that. Is there a more efficient way to deal with this? I wondered about doing
snapshots['End_Date'] = snapshots['startDate'] + relativedelta(months=snapshots['projectedMonth'])
But get the error about relativedelta needing int not series. Thanks!
英文:
I have a large dataframe with projected data 60 months into the future, and I need to drop the projections for months that haven't happened yet. I have a functioning way to do this but it's throwing memory errors for a 16 million row dataframe (I have removed all unnecessary columns):
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
snapshots["End_Date"] = snapshots.progress_apply(lambda row: add_months(row["startDate"], row["projectedMonth"]), axis = 1)
Then I would drop columns where end_date>today. I tried to import 'swifter' but my organisation's settings won't allow that. Is there a more efficient way to deal with this? I wondered about doing
snapshots['End_Date']=snapshots['startDate']+relativedelta(months=snapshots['projectedMonth'])
But get the error about relativedelta needing int not series. Thanks!
答案1
得分: 3
使用Lambda函数和矢量化操作,您可以实现这一点,对于您的情况,可以使用pd.DateOffset
来直接添加月份到日期列。
import pandas as pd
data = {
"startDate": pd.date_range(start="2020-01-01", periods=10, freq="MS"),
"projectedMonth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
snapshots = pd.DataFrame(data)
snapshots["End_Date"] = snapshots.apply(lambda row: row["startDate"] + pd.DateOffset(months=row["projectedMonth"]), axis=1)
today = pd.Timestamp.today()
filtered_snapshots = snapshots[snapshots["End_Date"] <= today]
print(filtered_snapshots)
英文:
With a lambda function and instead using vectorized operations you could do that, for your case pd.DateOffset
to add months directly to the date column would be good.
import pandas as pd
data = {
"startDate": pd.date_range(start="2020-01-01", periods=10, freq="MS"),
"projectedMonth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
snapshots = pd.DataFrame(data)
snapshots["End_Date"] = snapshots.apply(lambda row: row["startDate"] + pd.DateOffset(months=row["projectedMonth"]), axis=1)
today = pd.Timestamp.today()
filtered_snapshots = snapshots[snapshots["End_Date"] <= today]
print(filtered_snapshots)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论