有没有优化这个日期范围转换的方法?在 pandas 中进行条件合并?

huangapple go评论90阅读模式
英文:

Is there a way to optimize this date range transformation? Conditional merge in pandas?

问题

  1. # 将销售数据转换为新的二进制格式
  2. import pandas as pd
  3. # 销售数据示例
  4. sales_data = {
  5. 'Shop ID': ['A', 'B'],
  6. 'Special Offer Start': ['2022-01-01', '2022-01-09'],
  7. 'Special Offer End': ['2022-01-03', '2022-01-11']
  8. }
  9. sales_df = pd.DataFrame(sales_data)
  10. new_list = []
  11. for i, row in sales_df.iterrows():
  12. df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"], end=row["Special Offer End"]), columns=['Date'])
  13. df['Shop ID'] = row['Shop ID']
  14. df["Special Offer?"] = 1
  15. new_list.append(df)
  16. result = pd.concat(new_list).reset_index(drop=True)
英文:

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:

Shop ID Special Offer Start Special Offer End
A '2022-01-01' '2022-01-03'
B '2022-01-09' '2022-01-11'

etc.

I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:

Shop ID Date Special Offer?
A '2022-01-01' 1
A '2022-01-02' 1
A '2022-01-03' 1
B '2022-01-09' 1
B '2022-01-10' 1
B '2022-01-11' 1

I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.

I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.

See example below:

Shop ID Special Offer Start Special Offer End Special Offer ?
A '2022-01-01' '2022-01-03' 1
B '2022-01-09' '2022-01-11' 1

join with (the join condition being: if Date between Special Offer Start and Special Offer End):

Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'

creates:

Shop ID Date Special Offer?
A '2022-01-01' 1
A '2022-01-02' 1
A '2022-01-03' 1
A '2022-01-04' NaN
A '2022-01-05' NaN
A '2022-01-06' NaN
A '2022-01-07' NaN
A '2022-01-08' NaN
A '2022-01-09' NaN
A '2022-01-10' NaN
A '2022-01-11' NaN
B '2022-01-01' NaN
B '2022-01-02' NaN
B '2022-01-03' NaN
B '2022-01-04' NaN
B '2022-01-05' NaN
B '2022-01-06' NaN
B '2022-01-07' NaN
B '2022-01-08' NaN
B '2022-01-09' 1
B '2022-01-10' 1
B '2022-01-11' 1

EDIT:
here is the code I've written:

  1. new_list = []
  2. for i, row in sales_df.iterrows():
  3. df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
  4. df['Shop ID'] = row['Shop ID']
  5. df["Special Offer?"] = 1
  6. new_list.append(df)
  7. result = pd.concat(new_list ).reset_index(drop=True)

答案1

得分: 1

更新

商店ID列丢失

您可以使用date_range来扩展日期:

  1. # 设置最小的可复制示例
  2. data = [{'商店ID': 'A', '特别优惠开始': '2022-01-01', '特别优惠结束': '2022-01-03'},
  3. {'商店ID': 'B', '特别优惠开始': '2022-01-09', '特别优惠结束': '2022-01-11'}]
  4. df = pd.DataFrame(data)
  5. # 如果已经有DatetimeIndex,则不是必需的
  6. df['特别优惠开始'] = pd.to_datetime(df['特别优惠开始'])
  7. df['特别优惠结束'] = pd.to_datetime(df['特别优惠结束'])
  8. # 创建完整的日期范围
  9. start = df['特别优惠开始'].min()
  10. end = df['特别优惠结束'].max()
  11. dti = pd.date_range(start, end, freq='D', name='日期')
  12. date_range = lambda x: pd.date_range(x['特别优惠开始'], x['特别优惠结束'])
  13. out = (df.assign(优惠=df.apply(date_range, axis=1), 虚拟=1).explode('优惠')
  14. .pivot_table(index='优惠', columns='商店ID', values='虚拟', fill_value=0)
  15. .reindex(dti, fill_value=0).unstack().rename('特别优惠?').reset_index())
  1. >>> out
  2. 商店ID 日期 特别优惠
  3. 0 A 2022-01-01 1
  4. 1 A 2022-01-02 1
  5. 2 A 2022-01-03 1
  6. 3 A 2022-01-04 0
  7. 4 A 2022-01-05 0
  8. 5 A 2022-01-06 0
  9. 6 A 2022-01-07 0
  10. 7 A 2022-01-08 0
  11. 8 A 2022-01-09 0
  12. 9 A 2022-01-10 0
  13. 10 A 2022-01-11 0
  14. 11 B 2022-01-01 0
  15. 12 B 2022-01-02 0
  16. 13 B 2022-01-03 0
  17. 14 B 2022-01-04 0
  18. 15 B 2022-01-05 0
  19. 16 B 2022-01-06 0
  20. 17 B 2022-01-07 0
  21. 18 B 2022-01-08 0
  22. 19 B 2022-01-09 1
  23. 20 B 2022-01-10 1
  24. 21 B 2022-01-11 1
英文:

Update

> The Shop ID column is missing

You can use date_range to expand the dates:

  1. # Setup minimal reproducible example
  2. data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
  3. {'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
  4. df = pd.DataFrame(data)
  5. # Not mandatory if you have already DatetimeIndex
  6. df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
  7. df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
  8. # create full date range
  9. start = df['Special Offer Start'].min()
  10. end = df['Special Offer End'].max()
  11. dti = pd.date_range(start, end, freq='D', name='Date')
  12. date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
  13. out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
  14. .pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
  15. .reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
  1. >>> out
  2. Shop ID Date Special Offer?
  3. 0 A 2022-01-01 1
  4. 1 A 2022-01-02 1
  5. 2 A 2022-01-03 1
  6. 3 A 2022-01-04 0
  7. 4 A 2022-01-05 0
  8. 5 A 2022-01-06 0
  9. 6 A 2022-01-07 0
  10. 7 A 2022-01-08 0
  11. 8 A 2022-01-09 0
  12. 9 A 2022-01-10 0
  13. 10 A 2022-01-11 0
  14. 11 B 2022-01-01 0
  15. 12 B 2022-01-02 0
  16. 13 B 2022-01-03 0
  17. 14 B 2022-01-04 0
  18. 15 B 2022-01-05 0
  19. 16 B 2022-01-06 0
  20. 17 B 2022-01-07 0
  21. 18 B 2022-01-08 0
  22. 19 B 2022-01-09 1
  23. 20 B 2022-01-10 1
  24. 21 B 2022-01-11 1

huangapple
  • 本文由 发表于 2023年2月14日 00:19:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75438540.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定