Pandas按年份日期排序两列,年末环绕到新年。

huangapple go评论61阅读模式
英文:

Pandas Sort Two Columns with Day of Year Wrap-Around to New Year

问题

我有一些数据,在每年的某个时候,通常是在每年的年初附近,一个"day_of_year"序列会涉及将"year"列更改为新的年份,当"day_of_year"等于1时。这是一个我一直没有弄清楚的技巧,某种程度上不确定如何开始,所以在这里的任何帮助都非常感激。我的数据如下:

这是我的df1:

day_of_year	year	var_1
364	2017	17.71666667
364	2018	5.166666667
364	2019	2
364	2020	1.595833333
364	2021	3.75
364	2022	6.8875
365	2017	14.83333333
365	2018	2.758333333
365	2019	4.108333333
365	2020	5.766666667
365	2021	5.291666667
365	2022	10.58636364
1	2017	2.0125
1	2018	14.0125
1	2019	-0.504166667
1	2020	7.666666667
1	2021	5.520833333
1	2022	1.229166667
2	2017	1.7625
2	2018	15.10416667
2	2019	-0.391666667
2	2020	9.5
2	2021	7.645833333
2	2022	0.9125

在重新格式化后,我需要它看起来像下面排序的df,对于任何可能缺少数据的年份,需要填入"n/a"。再次感谢您,

最终的df:

day_of_year	year	var_1
364	2017	17.71666667
365	2017	14.83333333
1	2018	14.0125
2	2018	15.10416667
364	2018	5.166666667
365	2018	2.758333333
1	2019	-0.504166667
2	2019	-0.391666667
364	2019	2
365	2019	4.108333333
1	2020	7.666666667
2	2020	9.5
364	2020	1.595833333
365	2020	5.766666667
1	2021	5.520833333
2	2021	7.645833333
364	2021	3.75
365	2021	5.291666667
1	2022	1.229166667
2	2022	0.9125
364	2022	6.8875
365	2022	10.58636364
n/a	n/a	n/a
n/a	n/a	n/a
英文:

I have data that may at certain times of the year around the first of each year, that a day_of_year sequence involves changing the "year" column to the new year when day_of_year ==1. It is a trick that I have not been able to figure out and in some ways not sure how to start so any help here is much appreciated. My data looks like this:

Here is my df1 =

day_of_year	year	var_1
364	2017	17.71666667
364	2018	5.166666667
364	2019	2
364	2020	1.595833333
364	2021	3.75
364	2022	6.8875
365	2017	14.83333333
365	2018	2.758333333
365	2019	4.108333333
365	2020	5.766666667
365	2021	5.291666667
365	2022	10.58636364
1	2017	2.0125
1	2018	14.0125
1	2019	-0.504166667
1	2020	7.666666667
1	2021	5.520833333
1	2022	1.229166667
2	2017	1.7625
2	2018	15.10416667
2	2019	-0.391666667
2	2020	9.5
2	2021	7.645833333
2	2022	0.9125

And, after the re-formatting, I need it to look like the below sorted df with "n/a" for any missing or expected data in a year that might be missing data. thank you again,

final df:

day_of_year	year	var_1
364	2017	17.71666667
365	2017	14.83333333
1	2018	14.0125
2	2018	15.10416667
364	2018	5.166666667
365	2018	2.758333333
1	2019	-0.504166667
2	2019	-0.391666667
364	2019	2
365	2019	4.108333333
1	2020	7.666666667
2	2020	9.5
364	2020	1.595833333
365	2020	5.766666667
1	2021	5.520833333
2	2021	7.645833333
364	2021	3.75
365	2021	5.291666667
1	2022	1.229166667
2	2022	0.9125
364	2022	6.8875
365	2022	10.58636364
n/a	n/a	n/a
n/a	n/a	n/a

答案1

得分: 1

为什么要根据日期改变年份?只需按两列排序:

df.sort_values(by=['year', 'day_of_year'])

输出:

    day_of_year  year      var_1
0           364  2017  17.716667
6           365  2017  14.833333
12            1  2018   2.012500
18            2  2018   1.762500
1           364  2018   5.166667
7           365  2018   2.758333
13            1  2019  14.012500
19            2  2019  15.104167
2           364  2019   2.000000
8           365  2019   4.108333
14            1  2020  -0.504167
20            2  2020  -0.391667
3           364  2020   1.595833
9           365  2020   5.766667
15            1  2021   7.666667
21            2  2021   9.500000
4           364  2021   3.750000
10          365  2021   5.291667
16            1  2022   5.520833
22            2  2022   7.645833
5           364  2022   6.887500
11          365  2022  10.586364
17            1  2023   1.229167
23            2  2023   0.912500

如果因某种原因你确实需要修复年份,请使用条件语句和 mask 函数:

(df.assign(year=df['year'].mask(df['day_of_year'].le(2), df['year'].add(1)))
   .sort_values(by=['year', 'day_of_year'])
)

或者,如果你想在从365天变为较低天数后更新年份:

(df.assign(year=df['year'].add(df['day_of_year'].diff().lt(0).cumsum()))
   .sort_values(by=['year', 'day_of_year'])
)

输出:

    day_of_year  year      var_1
0           364  2017  17.716667
6           365  2017  14.833333
12            1  2018   2.012500
18            2  2018   1.762500
1           364  2018   5.166667
7           365  2018   2.758333
13            1  2019  14.012500
19            2  2019  15.104167
2           364  2019   2.000000
8           365  2019   4.108333
14            1  2020  -0.504167
20            2  2020  -0.391667
3           364  2020   1.595833
9           365  2020   5.766667
15            1  2021   7.666667
21            2  2021   9.500000
4           364  2021   3.750000
10          365  2021   5.291667
16            1  2022   5.520833
22            2  2022   7.645833
5           364  2022   6.887500
11          365  2022  10.586364
17            1  2023   1.229167
23            2  2023   0.912500
英文:

Why would you change the year based on the day? Just sort by the two columns:

df.sort_values(by=['year', 'day_of_year'])

Output:

    day_of_year  year      var_1
12            1  2017   2.012500
18            2  2017   1.762500
0           364  2017  17.716667
6           365  2017  14.833333
13            1  2018  14.012500
19            2  2018  15.104167
1           364  2018   5.166667
7           365  2018   2.758333
14            1  2019  -0.504167
20            2  2019  -0.391667
2           364  2019   2.000000
8           365  2019   4.108333
15            1  2020   7.666667
21            2  2020   9.500000
3           364  2020   1.595833
9           365  2020   5.766667
16            1  2021   5.520833
22            2  2021   7.645833
4           364  2021   3.750000
10          365  2021   5.291667
17            1  2022   1.229167
23            2  2022   0.912500
5           364  2022   6.887500
11          365  2022  10.586364

If for some reason you really need to fix the year, use a conditional with mask:

(df.assign(year=df['year'].mask(df['day_of_year'].le(2), df['year'].add(1)))
   .sort_values(by=['year', 'day_of_year'])
)

Or, if you want to update the years after a change from 365 to a lower day:

(df.assign(year=df['year'].add(df['day_of_year'].diff().lt(0).cumsum()))
   .sort_values(by=['year', 'day_of_year'])
)

Output:

    day_of_year  year      var_1
0           364  2017  17.716667
6           365  2017  14.833333
12            1  2018   2.012500
18            2  2018   1.762500
1           364  2018   5.166667
7           365  2018   2.758333
13            1  2019  14.012500
19            2  2019  15.104167
2           364  2019   2.000000
8           365  2019   4.108333
14            1  2020  -0.504167
20            2  2020  -0.391667
3           364  2020   1.595833
9           365  2020   5.766667
15            1  2021   7.666667
21            2  2021   9.500000
4           364  2021   3.750000
10          365  2021   5.291667
16            1  2022   5.520833
22            2  2022   7.645833
5           364  2022   6.887500
11          365  2022  10.586364
17            1  2023   1.229167
23            2  2023   0.912500

答案2

得分: 0

我会首先将所有内容转换为日期时间格式。只需运行:

pd.to_datetime(df['day_of_year'].astype(str) + '-' + df['year'].astype(str), 
               format='%j-%Y')

我将其赋值给列 ymd 并进行排序,得到以下结果:

df.sort_values('ymd')
    day_of_year  year      var_1        ymd
12            1  2017   2.012500 2017-01-01
18            2  2017   1.762500 2017-01-02
0           364  2017  17.716667 2017-12-30
6           365  2017  14.833333 2017-12-31
13            1  2018  14.012500 2018-01-01
19            2  2018  15.104167 2018-01-02
1           364  2018   5.166667 2018-12-30
7           365  2018   2.758333 2018-12-31
14            1  2019  -0.504167 2019-01-01
20            2  2019  -0.391667 2019-01-02
2           364  2019   2.000000 2019-12-30
8           365  2019   4.108333 2019-12-31
15            1  2020   7.666667 2020-01-01
21            2  2020   9.500000 2020-01-02
3           364  2020   1.595833 2020-12-29
9           365  2020   5.766667 2020-12-30
16            1  2021   5.520833 2021-01-01
22            2  2021   7.645833 2021-01-02
4           364  2021   3.750000 2021-12-30
10          365  2021   5.291667 2021-12-31
17            1  2022   1.229167 2022-01-01
23            2  2022   0.912500 2022-01-02
5           364  2022   6.887500 2022-12-30
11          365  2022  10.586364 2022-12-31
英文:

I would convert everything to date time first. Just run:

pd.to_datetime(df['day_of_year'].astype(str) + '-' + df['year'].astype(str), 
               format='%j-%Y')

I assign it to column ymd and sort, yielding the following:

>>> df.sort_values('ymd')
    day_of_year  year      var_1        ymd
12            1  2017   2.012500 2017-01-01
18            2  2017   1.762500 2017-01-02
0           364  2017  17.716667 2017-12-30
6           365  2017  14.833333 2017-12-31
13            1  2018  14.012500 2018-01-01
19            2  2018  15.104167 2018-01-02
1           364  2018   5.166667 2018-12-30
7           365  2018   2.758333 2018-12-31
14            1  2019  -0.504167 2019-01-01
20            2  2019  -0.391667 2019-01-02
2           364  2019   2.000000 2019-12-30
8           365  2019   4.108333 2019-12-31
15            1  2020   7.666667 2020-01-01
21            2  2020   9.500000 2020-01-02
3           364  2020   1.595833 2020-12-29
9           365  2020   5.766667 2020-12-30
16            1  2021   5.520833 2021-01-01
22            2  2021   7.645833 2021-01-02
4           364  2021   3.750000 2021-12-30
10          365  2021   5.291667 2021-12-31
17            1  2022   1.229167 2022-01-01
23            2  2022   0.912500 2022-01-02
5           364  2022   6.887500 2022-12-30
11          365  2022  10.586364 2022-12-31

huangapple
  • 本文由 发表于 2023年2月18日 03:40:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/75488602.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定