使用 Pandas 中基于日期和分类列的滚动函数

huangapple go评论56阅读模式
英文:

Using a Rolling Function in Pandas based on Date and a Categorical Column

问题

  1. 最近五张由该行客户开具的发票的 DaysLate 的均值
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window=5, min_periods=1).DaysLate.mean().reset_index().set_index("level_1").sort_index()["DaysLate"]
  1. 最近30天内所有发票的 DaysLate 的均值
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window='30d', on="InvoiceDate").DaysLate.mean()

请注意,第二个问题是关于如何按客户ID获取最近30天内的均值,代码中没有提供解决方案。如果需要按客户ID获取最近30天内的均值,你需要进一步修改代码以实现这个目标。

英文:

Im currently working on a dataset where I am using the rolling function in pandas to
create features.

The functions rely on three columns a DaysLate numeric column from which the mean is calculated from, an Invoice Date column from which the date is derived from and a customerID column which denotes the customer of a row.

Im trying to get a rolling mean of the DaysLate for the last 30 days limited to invoices raised to a specific customerID.

The following two functions are working.

  1. Mean of DaysLate for the last five invoices raised for the row's customer
df["CustomerDaysLate_lastfiveinvoices"] = df.groupby("customerID").rolling(window = 5,min_periods = 1).\
                              DaysLate.mean().reset_index().set_index("level_1").\
                              sort_index()["DaysLate"]
  1. Mean of DaysLate for all invoices raised in the last 30 days
df = df.sort_values('InvoiceDate')
df["GlobalDaysLate_30days"] = df.rolling(window = '30d', on = "InvoiceDate").DaysLate.mean()

Just cant seem to find the code get the mean of the last 30 days by CustomerID. Any help on above is greatly appreciated.

答案1

得分: 1

将日期列设置为索引,然后排序以确保升序,然后按客户ID对排序后的数据框进行分组,对每个分组计算30天滚动均值。

mean_30d = (
    df
    .set_index('InnvoiceDate')  # !important
    .sort_index()
    .groupby('customerID')
    .rolling('30d')['DaysLate'].mean()
    .reset_index(name='GlobalDaysLate_30days')
)

# 将滚动均值合并回原始数据框
result = df.merge(mean_30d)
英文:

Set the date column as index then sort to ensure ascending order then group the sorted dataframe by customer id and for each group calculate 30d rolling mean.

mean_30d = (
    df
    .set_index('InnvoiceDate') # !important
    .sort_index()
    .groupby('customerID')
    .rolling('30d')['DaysLate'].mean()
    .reset_index(name='GlobalDaysLate_30days')
)

# merge the rolling mean back to original dataframe
result = df.merge(mean_30d) 

huangapple
  • 本文由 发表于 2023年2月19日 16:07:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/75498772.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定