英文:
Efficient way for ordering values within group in Pandas
问题
I'm working with time series data, which is packaged in a time long dataframe, something like this:
ACCOUNT | VAR1 | VAR2 | DAY
I'm interested in creating a new column DAY_ORD
, which would give the ordinal rank of the DAY
variable to each row within a group over unique (ACCOUNT, VAR1, VAR2)
triplets.
Here is a small example of what I want to achieve:
ACCOUNT | VAR1 | VAR2 | DAY | DAY-ORD |
---|---|---|---|---|
A | X | True | 2022-02-03 | 0 |
A | X | True | 2022-02-04 | 1 |
B | X | True | 2021-05-18 | 0 |
A | X | True | 2022-02-05 | 2 |
B | X | True | 2022-05-20 | 1 |
A | Y | True | 2022-02-05 | 0 |
A | X | True | 2022-03-12 | 3 |
Here is my current implementation:
#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]
#iterate over all triplets that appear in the data
for (_, row) in df[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows():
acc, v1, v2 = row[0], row[1], row[2]
#find the df slice that adheres to the considered triplet
fdf = df.loc[(df.ACCOUNT == acc) & (df.VAR1 == v1) & (df.VAR2 == v2)].sort_values('DAY')
#assign them an ordinal rank
fdf['DAY_ORD'] = [i for i in range(len(fdf))]
#set the DAY_ORD values in the original dataframe
for i in fdf.index:
df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']
It seems like it will do the job, but it runs very slowly, at around 8 it/s
. What is a clean way to make this faster?
英文:
I'm working with time series data, which is packaged in a time long dataframe, something like this:
ACCOUNT | VAR1 | VAR2 | DAY
I'm interested in creating a new column DAY_ORD
, which would give the ordinal rank of the DAY
variable to each row within a group over unique (ACCOUNT, VAR1, VAR2)
triplets.
Here is a small example of what I want to achieve:
ACCOUNT | VAR1 | VAR2 | DAY | DAY-ORD |
---|---|---|---|---|
A | X | True | 2022-02-03 | 0 |
A | X | True | 2022-02-04 | 1 |
B | X | True | 2021-05-18 | 0 |
A | X | True | 2022-02-05 | 2 |
B | X | True | 2022-05-20 | 1 |
A | Y | True | 2022-02-05 | 0 |
A | X | True | 2022-03-12 | 3 |
Here is my current implementation:
#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]
#iterate over all triplets that appear in the data
for (_, row) in fb_data[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows()):
acc, v1, v2 = row[0], row[1], row[2]
#find the df slice that adheres to the considered triplet
fdf = df.loc[(df.ACCOUNT== acc) & (fb_data.VAR1 == v1) & (fb_data.VAR2 == v2)].sort_values('DAY')
#assign them an ordinal rank
fdf['DAY_ORD'] = [i for i in range(len(fdf))]
#set the DAY_ORD values in the original dataframe
for i in fdf.index:
df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']
It seems like it will do the job, but it runs very slowly, at around 8 it/s
. What is a clean way to make this faster?
答案1
得分: 1
使用GroupBy.rank
将值转换为日期时间,减去1
并转换为整数:
df['DAY'] = pd.to_datetime(df['DAY'])
df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
.rank('dense').sub(1).astype(int))
print(df)
ACCOUNT VAR1 VAR2 DAY DAY-ORD
0 A X True 2022-02-03 0
1 A X True 2022-02-04 1
2 B X True 2021-05-18 0
3 A X True 2022-02-05 2
4 B X True 2022-05-20 1
5 A Y True 2022-02-05 0
6 A X True 2022-03-12 3
英文:
Use GroupBy.rank
with convert values to datetimes, subtract 1
and convert to integers:
df['DAY'] = pd.to_datetime(df['DAY'])
df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
.rank('dense').sub(1).astype(int))
print (df)
ACCOUNT VAR1 VAR2 DAY DAY-ORD
0 A X True 2022-02-03 0
1 A X True 2022-02-04 1
2 B X True 2021-05-18 0
3 A X True 2022-02-05 2
4 B X True 2022-05-20 1
5 A Y True 2022-02-05 0
6 A X True 2022-03-12 3
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论