有没有优化这个日期范围转换的方法?在 pandas 中进行条件合并?

huangapple go评论64阅读模式
英文:

Is there a way to optimize this date range transformation? Conditional merge in pandas?

问题

# 将销售数据转换为新的二进制格式
import pandas as pd

# 销售数据示例
sales_data = {
    'Shop ID': ['A', 'B'],
    'Special Offer Start': ['2022-01-01', '2022-01-09'],
    'Special Offer End': ['2022-01-03', '2022-01-11']
}

sales_df = pd.DataFrame(sales_data)

new_list = []
for i, row in sales_df.iterrows():
    df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"], end=row["Special Offer End"]), columns=['Date'])
    df['Shop ID'] = row['Shop ID']
    df["Special Offer?"] = 1
    new_list.append(df)

result = pd.concat(new_list).reset_index(drop=True)
英文:

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:

Shop ID Special Offer Start Special Offer End
A '2022-01-01' '2022-01-03'
B '2022-01-09' '2022-01-11'

etc.

I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:

Shop ID Date Special Offer?
A '2022-01-01' 1
A '2022-01-02' 1
A '2022-01-03' 1
B '2022-01-09' 1
B '2022-01-10' 1
B '2022-01-11' 1

I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.

I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.

See example below:

Shop ID Special Offer Start Special Offer End Special Offer ?
A '2022-01-01' '2022-01-03' 1
B '2022-01-09' '2022-01-11' 1

join with (the join condition being: if Date between Special Offer Start and Special Offer End):

Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'

creates:

Shop ID Date Special Offer?
A '2022-01-01' 1
A '2022-01-02' 1
A '2022-01-03' 1
A '2022-01-04' NaN
A '2022-01-05' NaN
A '2022-01-06' NaN
A '2022-01-07' NaN
A '2022-01-08' NaN
A '2022-01-09' NaN
A '2022-01-10' NaN
A '2022-01-11' NaN
B '2022-01-01' NaN
B '2022-01-02' NaN
B '2022-01-03' NaN
B '2022-01-04' NaN
B '2022-01-05' NaN
B '2022-01-06' NaN
B '2022-01-07' NaN
B '2022-01-08' NaN
B '2022-01-09' 1
B '2022-01-10' 1
B '2022-01-11' 1

EDIT:
here is the code I've written:

new_list = []
for i, row in sales_df.iterrows():
    df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
    df['Shop ID'] = row['Shop ID']
    df["Special Offer?"] = 1
    new_list.append(df)

result = pd.concat(new_list ).reset_index(drop=True)

答案1

得分: 1

更新

商店ID列丢失

您可以使用date_range来扩展日期:

# 设置最小的可复制示例
data = [{'商店ID': 'A', '特别优惠开始': '2022-01-01', '特别优惠结束': '2022-01-03'},
        {'商店ID': 'B', '特别优惠开始': '2022-01-09', '特别优惠结束': '2022-01-11'}]
df = pd.DataFrame(data)

# 如果已经有DatetimeIndex,则不是必需的
df['特别优惠开始'] = pd.to_datetime(df['特别优惠开始'])
df['特别优惠结束'] = pd.to_datetime(df['特别优惠结束'])

# 创建完整的日期范围
start = df['特别优惠开始'].min()
end = df['特别优惠结束'].max()
dti = pd.date_range(start, end, freq='D', name='日期')

date_range = lambda x: pd.date_range(x['特别优惠开始'], x['特别优惠结束'])
out = (df.assign(优惠=df.apply(date_range, axis=1), 虚拟=1).explode('优惠')
         .pivot_table(index='优惠', columns='商店ID', values='虚拟', fill_value=0)
         .reindex(dti, fill_value=0).unstack().rename('特别优惠?').reset_index())
>>> out
   商店ID       日期  特别优惠
0        A 2022-01-01      1
1        A 2022-01-02      1
2        A 2022-01-03      1
3        A 2022-01-04      0
4        A 2022-01-05      0
5        A 2022-01-06      0
6        A 2022-01-07      0
7        A 2022-01-08      0
8        A 2022-01-09      0
9        A 2022-01-10      0
10       A 2022-01-11      0
11       B 2022-01-01      0
12       B 2022-01-02      0
13       B 2022-01-03      0
14       B 2022-01-04      0
15       B 2022-01-05      0
16       B 2022-01-06      0
17       B 2022-01-07      0
18       B 2022-01-08      0
19       B 2022-01-09      1
20       B 2022-01-10      1
21       B 2022-01-11      1
英文:

Update

> The Shop ID column is missing

You can use date_range to expand the dates:

# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
        {'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)

# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])

# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')

date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
         .pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
         .reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
   Shop ID       Date  Special Offer?
0        A 2022-01-01               1
1        A 2022-01-02               1
2        A 2022-01-03               1
3        A 2022-01-04               0
4        A 2022-01-05               0
5        A 2022-01-06               0
6        A 2022-01-07               0
7        A 2022-01-08               0
8        A 2022-01-09               0
9        A 2022-01-10               0
10       A 2022-01-11               0
11       B 2022-01-01               0
12       B 2022-01-02               0
13       B 2022-01-03               0
14       B 2022-01-04               0
15       B 2022-01-05               0
16       B 2022-01-06               0
17       B 2022-01-07               0
18       B 2022-01-08               0
19       B 2022-01-09               1
20       B 2022-01-10               1
21       B 2022-01-11               1

huangapple
  • 本文由 发表于 2023年2月14日 00:19:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438540.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定