英文:
Is there a way to optimize this date range transformation? Conditional merge in pandas?
问题
# 将销售数据转换为新的二进制格式
import pandas as pd
# 销售数据示例
sales_data = {
'Shop ID': ['A', 'B'],
'Special Offer Start': ['2022-01-01', '2022-01-09'],
'Special Offer End': ['2022-01-03', '2022-01-11']
}
sales_df = pd.DataFrame(sales_data)
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"], end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list).reset_index(drop=True)
英文:
I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:
Shop ID | Special Offer Start | Special Offer End |
---|---|---|
A | '2022-01-01' | '2022-01-03' |
B | '2022-01-09' | '2022-01-11' |
etc.
I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:
Shop ID | Date | Special Offer? |
---|---|---|
A | '2022-01-01' | 1 |
A | '2022-01-02' | 1 |
A | '2022-01-03' | 1 |
B | '2022-01-09' | 1 |
B | '2022-01-10' | 1 |
B | '2022-01-11' | 1 |
I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.
I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.
See example below:
Shop ID | Special Offer Start | Special Offer End | Special Offer ? |
---|---|---|---|
A | '2022-01-01' | '2022-01-03' | 1 |
B | '2022-01-09' | '2022-01-11' | 1 |
join with (the join condition being: if Date between Special Offer Start and Special Offer End):
Date |
---|
'2022-01-01' |
'2022-01-02' |
'2022-01-03' |
'2022-01-04' |
'2022-01-05' |
'2022-01-06' |
'2022-01-07' |
'2022-01-08' |
'2022-01-09' |
'2022-01-10' |
'2022-01-11' |
creates:
Shop ID | Date | Special Offer? |
---|---|---|
A | '2022-01-01' | 1 |
A | '2022-01-02' | 1 |
A | '2022-01-03' | 1 |
A | '2022-01-04' | NaN |
A | '2022-01-05' | NaN |
A | '2022-01-06' | NaN |
A | '2022-01-07' | NaN |
A | '2022-01-08' | NaN |
A | '2022-01-09' | NaN |
A | '2022-01-10' | NaN |
A | '2022-01-11' | NaN |
B | '2022-01-01' | NaN |
B | '2022-01-02' | NaN |
B | '2022-01-03' | NaN |
B | '2022-01-04' | NaN |
B | '2022-01-05' | NaN |
B | '2022-01-06' | NaN |
B | '2022-01-07' | NaN |
B | '2022-01-08' | NaN |
B | '2022-01-09' | 1 |
B | '2022-01-10' | 1 |
B | '2022-01-11' | 1 |
EDIT:
here is the code I've written:
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)
答案1
得分: 1
更新
商店ID列丢失
您可以使用date_range
来扩展日期:
# 设置最小的可复制示例
data = [{'商店ID': 'A', '特别优惠开始': '2022-01-01', '特别优惠结束': '2022-01-03'},
{'商店ID': 'B', '特别优惠开始': '2022-01-09', '特别优惠结束': '2022-01-11'}]
df = pd.DataFrame(data)
# 如果已经有DatetimeIndex,则不是必需的
df['特别优惠开始'] = pd.to_datetime(df['特别优惠开始'])
df['特别优惠结束'] = pd.to_datetime(df['特别优惠结束'])
# 创建完整的日期范围
start = df['特别优惠开始'].min()
end = df['特别优惠结束'].max()
dti = pd.date_range(start, end, freq='D', name='日期')
date_range = lambda x: pd.date_range(x['特别优惠开始'], x['特别优惠结束'])
out = (df.assign(优惠=df.apply(date_range, axis=1), 虚拟=1).explode('优惠')
.pivot_table(index='优惠', columns='商店ID', values='虚拟', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('特别优惠?').reset_index())
>>> out
商店ID 日期 特别优惠?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1
英文:
Update
> The Shop ID column is missing
You can use date_range
to expand the dates:
# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
{'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')
date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
.pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
Shop ID Date Special Offer?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论