2023年4月19日 19:25:40go评论93阅读模式

英文:

Python more efficient method than .apply()

问题

I have a large dataframe with projected data 60 months into the future, and I need to drop the projections for months that haven't happened yet. I have a functioning way to do this but it's throwing memory errors for a 16 million row dataframe (I have removed all unnecessary columns):

from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
def add_months(start_date, delta_period):
  end_date = start_date + relativedelta(months=delta_period)
  return end_date
# Apply function on the dataframe using lambda operation.
snapshots["End_Date"] = snapshots.progress_apply(lambda row: add_months(row["startDate"], row["projectedMonth"]), axis = 1)

Then I would drop columns where end_date > today. I tried to import 'swifter' but my organisation's settings won't allow that. Is there a more efficient way to deal with this? I wondered about doing

snapshots['End_Date'] = snapshots['startDate'] + relativedelta(months=snapshots['projectedMonth'])

But get the error about relativedelta needing int not series. Thanks!

英文:

from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
def add_months(start_date, delta_period):
  end_date = start_date + relativedelta(months=delta_period)
  return end_date
# Apply function on the dataframe using lambda operation.
snapshots[&quot;End_Date&quot;] = snapshots.progress_apply(lambda row: add_months(row[&quot;startDate&quot;], row[&quot;projectedMonth&quot;]), axis = 1)

Then I would drop columns where end_date>today. I tried to import 'swifter' but my organisation's settings won't allow that. Is there a more efficient way to deal with this? I wondered about doing

snapshots[&#39;End_Date&#39;]=snapshots[&#39;startDate&#39;]+relativedelta(months=snapshots[&#39;projectedMonth&#39;])

But get the error about relativedelta needing int not series. Thanks!

答案1

得分: 3

使用Lambda函数和矢量化操作，您可以实现这一点，对于您的情况，可以使用pd.DateOffset来直接添加月份到日期列。

import pandas as pd
data = {
    "startDate": pd.date_range(start="2020-01-01", periods=10, freq="MS"),
    "projectedMonth": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
snapshots = pd.DataFrame(data)
snapshots["End_Date"] = snapshots.apply(lambda row: row["startDate"] + pd.DateOffset(months=row["projectedMonth"]), axis=1)
today = pd.Timestamp.today()
filtered_snapshots = snapshots[snapshots["End_Date"] <= today]
print(filtered_snapshots)

英文:

With a lambda function and instead using vectorized operations you could do that, for your case pd.DateOffset to add months directly to the date column would be good.

import pandas as pd
data = {
    &quot;startDate&quot;: pd.date_range(start=&quot;2020-01-01&quot;, periods=10, freq=&quot;MS&quot;),
    &quot;projectedMonth&quot;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
snapshots = pd.DataFrame(data)
snapshots[&quot;End_Date&quot;] = snapshots.apply(lambda row: row[&quot;startDate&quot;] + pd.DateOffset(months=row[&quot;projectedMonth&quot;]), axis=1)
today = pd.Timestamp.today()
filtered_snapshots = snapshots[snapshots[&quot;End_Date&quot;] &lt;= today]
print(filtered_snapshots)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python比.apply()更高效的方法

问题

答案1

使用PySpark：从具有匹配ID的数据框B的值中更新数据框A的列值。

如何在pandas中使用均值填充缺失的行？

The kernel dies with jax.random.PGRNKey.

使用Pandas按id列和每小时的日期时间分组，处理缺失的小时数。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。