英文:
how to get a unique week number for start and end dates in multi years - Pandas
问题
我有一个数据框,其中两列表示数据记录的开始和结束日期。有多个年份。我的目标是为每一行分配一个新的列,该列表示数据记录的时间步长。由于我也有位置列,因此这些周中的一些将重复。
import pandas as pd
dates = pd.date_range(start='2021-11-11', periods=20, freq='W')
df = pd.DataFrame({
'start_date': np.repeat(dates, 5),
'end_date': np.repeat(dates + pd.DateOffset(days=6), 5),
'country': ['USA', 'Canada', 'UK', 'Australia', 'Russia'] * 20
})
df = df.sort_values("start_date")
start_date end_date country
0 2021-11-14 2021-11-20 USA
1 2021-11-14 2021-11-20 Canada
2 2021-11-14 2021-11-20 UK
3 2021-11-14 2021-11-20 Australia
4 2021-11-14 2021-11-20 Russia
我可以使用 isocalendar().week
获取周数,但它会给出相应年份的周数。例如,如果 2021-11-14
和 2021-11-20
是数据框中的第一周,它应该得到 1
。它可能跳过下一周,并且有另一条记录从 2021-11-27
开始。对于我来说,这样的时间步长应该是数据框中的第二周。
英文:
I have a dataframe where two of the columns represent the start and end date of the data record. There are multiple years. My goal is to assign a new column that represents the time step of the data record in each row. Since I have a location columns as well, some of these weeks will be repeating.
import pandas as pd
dates = pd.date_range(start='2021-11-11', periods=20, freq='W')
df = pd.DataFrame({
'start_date': np.repeat(dates, 5),
'end_date': np.repeat(dates + pd.DateOffset(days=6), 5),
'country': ['USA', 'Canada', 'UK', 'Australia', 'Russia'] * 20
})
df = df.sort_values("start_date")
start_date end_date country
0 2021-11-14 2021-11-20 USA
1 2021-11-14 2021-11-20 Canada
2 2021-11-14 2021-11-20 UK
3 2021-11-14 2021-11-20 Australia
4 2021-11-14 2021-11-20 Russia
I can get the week number using isocalendar().week
, but it is giving the week number of the corresponding year. For instance, if 2021-11-14
and 2021-11-20
is the first week in the data frame, it should get 1
. It may skip the next week, and have another record starting from 2021-11-27
. Such time step should be the second week for me in the data frame.
答案1
得分: 2
理解的话,你可以使用 groupby_ngroup
方法:
df['week'] = df.groupby(df['start_date']).ngroup().add(1)
print(df)
# 输出
start_date end_date country week
0 2021-11-14 2021-11-20 USA 1
1 2021-11-14 2021-11-20 Canada 1
2 2021-11-14 2021-11-20 UK 1
3 2021-11-14 2021-11-20 Australia 1
4 2021-11-14 2021-11-20 Russia 1
.. ... ... ... ...
98 2022-03-27 2022-04-02 Australia 20
95 2022-03-27 2022-04-02 USA 20
96 2022-03-27 2022-04-02 Canada 20
97 2022-03-27 2022-04-02 UK 20
99 2022-03-27 2022-04-02 Russia 20
[100 rows x 4 columns]
另一种方法是使用 pd.factorize
(如果数据框已按 start_date
值排序):
df['week'] = pd.factorize(df['start_date'])[0] + 1
英文:
IIUC, you can use groupby_ngroup
:
df['week'] = df.groupby(df['start_date']).ngroup().add(1)
print(df)
# Output
start_date end_date country week
0 2021-11-14 2021-11-20 USA 1
1 2021-11-14 2021-11-20 Canada 1
2 2021-11-14 2021-11-20 UK 1
3 2021-11-14 2021-11-20 Australia 1
4 2021-11-14 2021-11-20 Russia 1
.. ... ... ... ...
98 2022-03-27 2022-04-02 Australia 20
95 2022-03-27 2022-04-02 USA 20
96 2022-03-27 2022-04-02 Canada 20
97 2022-03-27 2022-04-02 UK 20
99 2022-03-27 2022-04-02 Russia 20
[100 rows x 4 columns]
Alternative with pd.factorize
(IF the dataframe is already sorted by start_date
value:
df['week'] = pd.factorize(df['start_date'])[0] + 1
答案2
得分: 1
你可以使用 dt.week
并从中减去最小的 week
值。为了确保第一周始终标记为1,你可以在减法结果上加1。
df['start_date'] = pd.to_datetime(df['start_date'])
df['time_step'] = df['start_date'].dt.week - df['start_date'].dt.week.min() + 1
英文:
You can use dt.week
and subtract the minimum week
from it. In order to ensure the first week is always labeled 1, you can add 1 to the subtraction.
df['start_date'] = pd.to_datetime(df['start_date'])
df['time_step'] = df['start_date'].dt.week - df['start_date'].dt.week.min() + 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论