2023年2月14日 00:19:21go评论90阅读模式

英文:

Is there a way to optimize this date range transformation? Conditional merge in pandas?

问题

# 将销售数据转换为新的二进制格式
import pandas as pd
# 销售数据示例
sales_data = {
    'Shop ID': ['A', 'B'],
    'Special Offer Start': ['2022-01-01', '2022-01-09'],
    'Special Offer End': ['2022-01-03', '2022-01-11']
}
sales_df = pd.DataFrame(sales_data)
new_list = []
for i, row in sales_df.iterrows():
    df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"], end=row["Special Offer End"]), columns=['Date'])
    df['Shop ID'] = row['Shop ID']
    df["Special Offer?"] = 1
    new_list.append(df)
result = pd.concat(new_list).reset_index(drop=True)

英文:

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:

Shop ID	Special Offer Start	Special Offer End
A	'2022-01-01'	'2022-01-03'
B	'2022-01-09'	'2022-01-11'

etc.

I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:

Shop ID	Date	Special Offer?
A	'2022-01-01'	1
A	'2022-01-02'	1
A	'2022-01-03'	1
B	'2022-01-09'	1
B	'2022-01-10'	1
B	'2022-01-11'	1

I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.

I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.

See example below:

Shop ID	Special Offer Start	Special Offer End	Special Offer ?
A	'2022-01-01'	'2022-01-03'	1
B	'2022-01-09'	'2022-01-11'	1

join with (the join condition being: if Date between Special Offer Start and Special Offer End):

Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'

creates:

Shop ID	Date	Special Offer?
A	'2022-01-01'	1
A	'2022-01-02'	1
A	'2022-01-03'	1
A	'2022-01-04'	NaN
A	'2022-01-05'	NaN
A	'2022-01-06'	NaN
A	'2022-01-07'	NaN
A	'2022-01-08'	NaN
A	'2022-01-09'	NaN
A	'2022-01-10'	NaN
A	'2022-01-11'	NaN
B	'2022-01-01'	NaN
B	'2022-01-02'	NaN
B	'2022-01-03'	NaN
B	'2022-01-04'	NaN
B	'2022-01-05'	NaN
B	'2022-01-06'	NaN
B	'2022-01-07'	NaN
B	'2022-01-08'	NaN
B	'2022-01-09'	1
B	'2022-01-10'	1
B	'2022-01-11'	1

EDIT:
here is the code I've written:

new_list = []
for i, row in sales_df.iterrows():
    df = pd.DataFrame(pd.date_range(start=row[&quot;Special Offer Start&quot;],end=row[&quot;Special Offer End&quot;]), columns=[&#39;Date&#39;])
    df[&#39;Shop ID&#39;] = row[&#39;Shop ID&#39;]
    df[&quot;Special Offer?&quot;] = 1
    new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)

答案1

得分: 1

更新

商店ID列丢失

您可以使用date_range来扩展日期：

# 设置最小的可复制示例
data = [{'商店ID': 'A', '特别优惠开始': '2022-01-01', '特别优惠结束': '2022-01-03'},
        {'商店ID': 'B', '特别优惠开始': '2022-01-09', '特别优惠结束': '2022-01-11'}]
df = pd.DataFrame(data)
# 如果已经有DatetimeIndex，则不是必需的
df['特别优惠开始'] = pd.to_datetime(df['特别优惠开始'])
df['特别优惠结束'] = pd.to_datetime(df['特别优惠结束'])
# 创建完整的日期范围
start = df['特别优惠开始'].min()
end = df['特别优惠结束'].max()
dti = pd.date_range(start, end, freq='D', name='日期')
date_range = lambda x: pd.date_range(x['特别优惠开始'], x['特别优惠结束'])
out = (df.assign(优惠=df.apply(date_range, axis=1), 虚拟=1).explode('优惠')
         .pivot_table(index='优惠', columns='商店ID', values='虚拟', fill_value=0)
         .reindex(dti, fill_value=0).unstack().rename('特别优惠？').reset_index())

>>> out
   商店ID       日期  特别优惠？
0        A 2022-01-01      1
1        A 2022-01-02      1
2        A 2022-01-03      1
3        A 2022-01-04      0
4        A 2022-01-05      0
5        A 2022-01-06      0
6        A 2022-01-07      0
7        A 2022-01-08      0
8        A 2022-01-09      0
9        A 2022-01-10      0
10       A 2022-01-11      0
11       B 2022-01-01      0
12       B 2022-01-02      0
13       B 2022-01-03      0
14       B 2022-01-04      0
15       B 2022-01-05      0
16       B 2022-01-06      0
17       B 2022-01-07      0
18       B 2022-01-08      0
19       B 2022-01-09      1
20       B 2022-01-10      1
21       B 2022-01-11      1

英文:

Update

> The Shop ID column is missing

You can use date_range to expand the dates:

# Setup minimal reproducible example
data = [{&#39;Shop ID&#39;: &#39;A&#39;, &#39;Special Offer Start&#39;: &#39;2022-01-01&#39;, &#39;Special Offer End&#39;: &#39;2022-01-03&#39;},
        {&#39;Shop ID&#39;: &#39;B&#39;, &#39;Special Offer Start&#39;: &#39;2022-01-09&#39;, &#39;Special Offer End&#39;: &#39;2022-01-11&#39;}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df[&#39;Special Offer Start&#39;] = pd.to_datetime(df[&#39;Special Offer Start&#39;])
df[&#39;Special Offer End&#39;] = pd.to_datetime(df[&#39;Special Offer End&#39;])
# create full date range
start = df[&#39;Special Offer Start&#39;].min()
end = df[&#39;Special Offer End&#39;].max()
dti = pd.date_range(start, end, freq=&#39;D&#39;, name=&#39;Date&#39;)
date_range = lambda x: pd.date_range(x[&#39;Special Offer Start&#39;], x[&#39;Special Offer End&#39;])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode(&#39;Offer&#39;)
         .pivot_table(index=&#39;Offer&#39;, columns=&#39;Shop ID&#39;, values=&#39;dummy&#39;, fill_value=0)
         .reindex(dti, fill_value=0).unstack().rename(&#39;Special Offer?&#39;).reset_index())

&gt;&gt;&gt; out
   Shop ID       Date  Special Offer?
0        A 2022-01-01               1
1        A 2022-01-02               1
2        A 2022-01-03               1
3        A 2022-01-04               0
4        A 2022-01-05               0
5        A 2022-01-06               0
6        A 2022-01-07               0
7        A 2022-01-08               0
8        A 2022-01-09               0
9        A 2022-01-10               0
10       A 2022-01-11               0
11       B 2022-01-01               0
12       B 2022-01-02               0
13       B 2022-01-03               0
14       B 2022-01-04               0
15       B 2022-01-05               0
16       B 2022-01-06               0
17       B 2022-01-07               0
18       B 2022-01-08               0
19       B 2022-01-09               1
20       B 2022-01-10               1
21       B 2022-01-11               1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有优化这个日期范围转换的方法？在 pandas 中进行条件合并？

问题

答案1

在 pandas 数据框列内的字典推导式

返回两个数据框之间值超出一定百分比差异的反连接。

How to convert a conda env yaml file to a list of requirements for a settings.ini file accounting for channels and conversions for pypi

如何在冻结后获取对象的实际旋转？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。