如何在pandas中避免使用循环

huangapple go评论65阅读模式
英文:

how to avoid using loop in pandas

问题

Here's the translated code without the loop and with comments removed:

today = pd.Timestamp.today()

# Calculate the difference in days between today and the 'date_2021' column
df['date_2021'] = pd.to_datetime(df['date_2021'], errors='coerce')
df['inspection'] = np.where((df['date_2022'].isnull()) & (~df['date_2021'].isnull()) & (today - df['date_2021'] > pd.Timedelta(days=14)), 'check', np.nan)

This code adds an 'inspection' column to your DataFrame based on the specified conditions without using a loop.

英文:
today = pd.Timestamp.today()  
for x in range(len(df)):
    #trace back
    if df.loc[x,'date_2022'] is pd.NaT and df.loc[x,'date_2021'] is not pd.NaT:
    # extract month and day   
        d1 = today.strftime('%m-%d')  
        d2 = df.loc[x,'date_2021'].strftime('%m-%d')  

    # convert to datetime
        d1 = datetime.strptime(d1, '%m-%d')  
        d2 = datetime.strptime(d2, '%m-%d')  

    # get difference in days 
        diff = d1 - d2
        days = diff.days
    #range 14 days
        if days > 14:
            df.loc[x,'inspection'] = 'check'
        else:
            df.loc[x,'inspection'] = np.nan

my aim is to add an inspection column, the condition is if the cell in 2022 is null(pd.NaT) but last year is not null, and it has past 14 days since the last year's date, how can I write it without using loop?

答案1

得分: 1

使用Timestamp.strftimeSeries.dt.strftime来处理日期时间,对于测试缺失值,使用Series.isna,与条件链配合使用Series.dt.days来测试天数差异,并在numpy.where中创建新列:

d1 = pd.to_datetime(pd.Timestamp.today().strftime('%m-%d'), format='%m-%d')
d2 = pd.to_datetime(df['date_2021'].dt.strftime('%m-%d'), format='%m-%d')

m = df['date_2022'].isna()

df['inspection'] = np.where(((d1 - d2).dt.days > 14) & m, 'check', np.nan)
英文:

Use Timestamp.strftime and Series.dt.strftime with to_datetime for datetimes, for test missing values use Series.isna, chain with conditon for test difference of days by Series.dt.days and create new column in numpy.where:

d1 = pd.to_datetime(pd.Timestamp.today().strftime('%m-%d'), format='%m-%d')
d2 = pd.to_datetime(df['date_2021'].dt.strftime('%m-%d'), format='%m-%d')

m = df['date_2022'].isna()

df['inspection'] = np.where(((d1 - d2).dt.days > 14) & m, 'check', np.nan)

答案2

得分: 0

使用Pandas时,即使在@jezrael提供的答案中,也不能摆脱使用循环,循环在其中是通过使用Pandas的内置方法来抽象的。一个更加详细的方法是使用pandas.DataFrame.apply,并将所有代码抽象成一个方法,类似这样。

使用您的确切代码-
首先,我注意到您正在使用另一个df,所以我认为您可能想要合并/连接这两个报告,以获得相同数据框中的列。但在下面的示例中,我将保留您的方法作为主要方法。

def perform_inspection(row, today, esgReport2021):
    # 追溯日期
    if row['date_2022'] is pd.NaT and esgReport2021.at[row.name, 'date_2021'] is not pd.NaT:
        # 获取修改后的日期
        old_month = esgReport2021.at[row.name, 'date_2021'].month
        old_day = esgReport2021.at[row.name, 'date_2021'].day
        old_modified_date = datetime.date(today.year, old_month, old_day)
        # 计算日期差
        diff = today - old_modified_date
        days = diff.days
        # 范围在14天以上
        if days > 14:
            row['inspection'] = 'check'
    return row

today = pd.Timestamp.today().date()
df["date_2022"] = pd.NaT    # 假设这是您报告处理的第0天。
df["inspection"] = pd.NaT
df = df.apply(perform_inspection, axis=1, args=(today, esgReport2021,))

应用方法会通过将行本身作为第一个参数来处理一个或多个“行级”操作,正如您可以在方法定义中看到的那样。

英文:

Well, you cannot move away from using loops in Pandas, even in the answer given by @jezrael looping is abstracted by using pandas' built-in methods. A much more elaborate approach would be to use pandas.DataFrame.apply and abstract all your code in a method, something like this.

Using your exact code-
Firstly, I noticed you are using another df so I think you might want to merge/ Join the two reports to get the columns in the same dataframe. In the example below though, I have kept your approach as the primary.

def perform_inspection(row, today, esgReport2021):
	#trace back
	if row['date_2022'] is pd.NaT and esgReport2021.at[row.name,'date_2021'] is not pd.NaT:
		# get the modified date
		old_month = esgReport2021.at[row.name,'date_2021'].month
		old_day = esgReport2021.at[row.name,'date_2021'].day
		old_modified_date = datetime.date(today.year, old_month, old_day)
		# get difference in days
		diff = today - old_modified_date
		days = diff.days
		#range 14 days
		if days > 14:
			row['inspection'] = 'check'
	return row

today = pd.Timestamp.today().date()
df["date_2022"] = pd.NaT    #Assuming this is the 0th day of your report processing.
df["inspection"] = pd.NaT
df = df.apply(perform_inspection, axis=1, args = (today, esgReport2021,))

Apply method will take care of one or more "row" level operations by passing the row itself as the first argument as you can see in the methods definition.

huangapple
  • 本文由 发表于 2023年5月15日 14:20:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76251340.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定