2023年6月19日 22:12:03go评论98阅读模式

英文:

how to get a unique week number for start and end dates in multi years - Pandas

问题

我有一个数据框，其中两列表示数据记录的开始和结束日期。有多个年份。我的目标是为每一行分配一个新的列，该列表示数据记录的时间步长。由于我也有位置列，因此这些周中的一些将重复。

import pandas as pd
dates = pd.date_range(start='2021-11-11', periods=20, freq='W')
df = pd.DataFrame({
    'start_date': np.repeat(dates, 5),
    'end_date': np.repeat(dates + pd.DateOffset(days=6), 5),
    'country': ['USA', 'Canada', 'UK', 'Australia', 'Russia'] * 20
})
df = df.sort_values("start_date")
start_date	end_date	country	
0	2021-11-14	2021-11-20	USA	
1	2021-11-14	2021-11-20	Canada
2	2021-11-14	2021-11-20	UK
3	2021-11-14	2021-11-20	Australia	
4	2021-11-14	2021-11-20	Russia

我可以使用 isocalendar().week 获取周数，但它会给出相应年份的周数。例如，如果 2021-11-14 和 2021-11-20 是数据框中的第一周，它应该得到 1。它可能跳过下一周，并且有另一条记录从 2021-11-27 开始。对于我来说，这样的时间步长应该是数据框中的第二周。

英文:

I have a dataframe where two of the columns represent the start and end date of the data record. There are multiple years. My goal is to assign a new column that represents the time step of the data record in each row. Since I have a location columns as well, some of these weeks will be repeating.

import pandas as pd
dates = pd.date_range(start=&#39;2021-11-11&#39;, periods=20, freq=&#39;W&#39;)
df = pd.DataFrame({
    &#39;start_date&#39;: np.repeat(dates, 5),
    &#39;end_date&#39;: np.repeat(dates + pd.DateOffset(days=6), 5),
    &#39;country&#39;: [&#39;USA&#39;, &#39;Canada&#39;, &#39;UK&#39;, &#39;Australia&#39;, &#39;Russia&#39;] * 20
})
df = df.sort_values(&quot;start_date&quot;)
	start_date	end_date	country	
0	2021-11-14	2021-11-20	USA	
1	2021-11-14	2021-11-20	Canada
2	2021-11-14	2021-11-20	UK
3	2021-11-14	2021-11-20	Australia	
4	2021-11-14	2021-11-20	Russia

I can get the week number using isocalendar().week, but it is giving the week number of the corresponding year. For instance, if 2021-11-14 and 2021-11-20 is the first week in the data frame, it should get 1. It may skip the next week, and have another record starting from 2021-11-27. Such time step should be the second week for me in the data frame.

答案1

得分: 2

理解的话，你可以使用 groupby_ngroup 方法：

df[&#39;week&#39;] = df.groupby(df[&#39;start_date&#39;]).ngroup().add(1)
print(df)
# 输出
   start_date   end_date    country  week
0  2021-11-14 2021-11-20        USA     1
1  2021-11-14 2021-11-20     Canada     1
2  2021-11-14 2021-11-20         UK     1
3  2021-11-14 2021-11-20  Australia     1
4  2021-11-14 2021-11-20     Russia     1
..        ...        ...        ...   ...
98 2022-03-27 2022-04-02  Australia    20
95 2022-03-27 2022-04-02        USA    20
96 2022-03-27 2022-04-02     Canada    20
97 2022-03-27 2022-04-02         UK    20
99 2022-03-27 2022-04-02     Russia    20
[100 rows x 4 columns]

另一种方法是使用 pd.factorize （如果数据框已按 start_date 值排序）：

df[&#39;week&#39;] = pd.factorize(df[&#39;start_date&#39;])[0] + 1

英文:

IIUC, you can use groupby_ngroup:

df[&#39;week&#39;] = df.groupby(df[&#39;start_date&#39;]).ngroup().add(1)
print(df)
# Output
   start_date   end_date    country  week
0  2021-11-14 2021-11-20        USA     1
1  2021-11-14 2021-11-20     Canada     1
2  2021-11-14 2021-11-20         UK     1
3  2021-11-14 2021-11-20  Australia     1
4  2021-11-14 2021-11-20     Russia     1
..        ...        ...        ...   ...
98 2022-03-27 2022-04-02  Australia    20
95 2022-03-27 2022-04-02        USA    20
96 2022-03-27 2022-04-02     Canada    20
97 2022-03-27 2022-04-02         UK    20
99 2022-03-27 2022-04-02     Russia    20
[100 rows x 4 columns]

Alternative with pd.factorize (IF the dataframe is already sorted by start_date value:

df[&#39;week&#39;] = pd.factorize(df[&#39;start_date&#39;])[0] + 1

答案2

得分: 1

你可以使用 dt.week 并从中减去最小的 week 值。为了确保第一周始终标记为1，你可以在减法结果上加1。

df['start_date'] = pd.to_datetime(df['start_date'])
df['time_step'] = df['start_date'].dt.week - df['start_date'].dt.week.min() + 1

英文:

You can use dt.week and subtract the minimum week from it. In order to ensure the first week is always labeled 1, you can add 1 to the subtraction.

df[&#39;start_date&#39;] = pd.to_datetime(df[&#39;start_date&#39;])
df[&#39;time_step&#39;] = df[&#39;start_date&#39;].dt.week - df[&#39;start_date&#39;].dt.week.min() + 1

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在多年内获取起始日期和结束日期的唯一周数 – Pandas

问题

答案1

答案2

Typehinting函数以接受numpy数组

如何在弹出菜单中设置/取消设置复选框。

facebook_business.exceptions.FacebookBadObjectError: Bad data to set object data error when trying to pull advertiser data

How to do simple inheritance in Go

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。