英文:
how to avoid using loop in pandas
问题
Here's the translated code without the loop and with comments removed:
today = pd.Timestamp.today()
# Calculate the difference in days between today and the 'date_2021' column
df['date_2021'] = pd.to_datetime(df['date_2021'], errors='coerce')
df['inspection'] = np.where((df['date_2022'].isnull()) & (~df['date_2021'].isnull()) & (today - df['date_2021'] > pd.Timedelta(days=14)), 'check', np.nan)
This code adds an 'inspection' column to your DataFrame based on the specified conditions without using a loop.
英文:
today = pd.Timestamp.today()
for x in range(len(df)):
#trace back
if df.loc[x,'date_2022'] is pd.NaT and df.loc[x,'date_2021'] is not pd.NaT:
# extract month and day
d1 = today.strftime('%m-%d')
d2 = df.loc[x,'date_2021'].strftime('%m-%d')
# convert to datetime
d1 = datetime.strptime(d1, '%m-%d')
d2 = datetime.strptime(d2, '%m-%d')
# get difference in days
diff = d1 - d2
days = diff.days
#range 14 days
if days > 14:
df.loc[x,'inspection'] = 'check'
else:
df.loc[x,'inspection'] = np.nan
my aim is to add an inspection column, the condition is if the cell in 2022 is null(pd.NaT) but last year is not null, and it has past 14 days since the last year's date, how can I write it without using loop?
答案1
得分: 1
使用Timestamp.strftime
和Series.dt.strftime
来处理日期时间,对于测试缺失值,使用Series.isna
,与条件链配合使用Series.dt.days
来测试天数差异,并在numpy.where
中创建新列:
d1 = pd.to_datetime(pd.Timestamp.today().strftime('%m-%d'), format='%m-%d')
d2 = pd.to_datetime(df['date_2021'].dt.strftime('%m-%d'), format='%m-%d')
m = df['date_2022'].isna()
df['inspection'] = np.where(((d1 - d2).dt.days > 14) & m, 'check', np.nan)
英文:
Use Timestamp.strftime
and Series.dt.strftime
with to_datetime
for datetimes, for test missing values use Series.isna
, chain with conditon for test difference of days by Series.dt.days
and create new column in numpy.where
:
d1 = pd.to_datetime(pd.Timestamp.today().strftime('%m-%d'), format='%m-%d')
d2 = pd.to_datetime(df['date_2021'].dt.strftime('%m-%d'), format='%m-%d')
m = df['date_2022'].isna()
df['inspection'] = np.where(((d1 - d2).dt.days > 14) & m, 'check', np.nan)
答案2
得分: 0
使用Pandas时,即使在@jezrael提供的答案中,也不能摆脱使用循环,循环在其中是通过使用Pandas的内置方法来抽象的。一个更加详细的方法是使用pandas.DataFrame.apply,并将所有代码抽象成一个方法,类似这样。
使用您的确切代码-
首先,我注意到您正在使用另一个df,所以我认为您可能想要合并/连接这两个报告,以获得相同数据框中的列。但在下面的示例中,我将保留您的方法作为主要方法。
def perform_inspection(row, today, esgReport2021):
# 追溯日期
if row['date_2022'] is pd.NaT and esgReport2021.at[row.name, 'date_2021'] is not pd.NaT:
# 获取修改后的日期
old_month = esgReport2021.at[row.name, 'date_2021'].month
old_day = esgReport2021.at[row.name, 'date_2021'].day
old_modified_date = datetime.date(today.year, old_month, old_day)
# 计算日期差
diff = today - old_modified_date
days = diff.days
# 范围在14天以上
if days > 14:
row['inspection'] = 'check'
return row
today = pd.Timestamp.today().date()
df["date_2022"] = pd.NaT # 假设这是您报告处理的第0天。
df["inspection"] = pd.NaT
df = df.apply(perform_inspection, axis=1, args=(today, esgReport2021,))
应用方法会通过将行本身作为第一个参数来处理一个或多个“行级”操作,正如您可以在方法定义中看到的那样。
英文:
Well, you cannot move away from using loops in Pandas, even in the answer given by @jezrael looping is abstracted by using pandas' built-in methods. A much more elaborate approach would be to use pandas.DataFrame.apply and abstract all your code in a method, something like this.
Using your exact code-
Firstly, I noticed you are using another df so I think you might want to merge/ Join the two reports to get the columns in the same dataframe. In the example below though, I have kept your approach as the primary.
def perform_inspection(row, today, esgReport2021):
#trace back
if row['date_2022'] is pd.NaT and esgReport2021.at[row.name,'date_2021'] is not pd.NaT:
# get the modified date
old_month = esgReport2021.at[row.name,'date_2021'].month
old_day = esgReport2021.at[row.name,'date_2021'].day
old_modified_date = datetime.date(today.year, old_month, old_day)
# get difference in days
diff = today - old_modified_date
days = diff.days
#range 14 days
if days > 14:
row['inspection'] = 'check'
return row
today = pd.Timestamp.today().date()
df["date_2022"] = pd.NaT #Assuming this is the 0th day of your report processing.
df["inspection"] = pd.NaT
df = df.apply(perform_inspection, axis=1, args = (today, esgReport2021,))
Apply method will take care of one or more "row" level operations by passing the row itself as the first argument as you can see in the methods definition.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论