在Pandas中对组内数值进行排序的高效方式:

huangapple go评论130阅读模式
英文:

Efficient way for ordering values within group in Pandas

问题

I'm working with time series data, which is packaged in a time long dataframe, something like this:

ACCOUNT | VAR1 | VAR2 | DAY

I'm interested in creating a new column DAY_ORD, which would give the ordinal rank of the DAY variable to each row within a group over unique (ACCOUNT, VAR1, VAR2) triplets.

Here is a small example of what I want to achieve:

ACCOUNT VAR1 VAR2 DAY DAY-ORD
A X True 2022-02-03 0
A X True 2022-02-04 1
B X True 2021-05-18 0
A X True 2022-02-05 2
B X True 2022-05-20 1
A Y True 2022-02-05 0
A X True 2022-03-12 3

Here is my current implementation:

#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]

#iterate over all triplets that appear in the data
for (_, row) in df[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows():
    acc, v1, v2 = row[0], row[1], row[2]

    #find the df slice that adheres to the considered triplet
    fdf = df.loc[(df.ACCOUNT == acc) & (df.VAR1 == v1) & (df.VAR2 == v2)].sort_values('DAY')

    #assign them an ordinal rank
    fdf['DAY_ORD'] = [i for i in range(len(fdf))]

    #set the DAY_ORD values in the original dataframe
    for i in fdf.index:
        df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']

It seems like it will do the job, but it runs very slowly, at around 8 it/s. What is a clean way to make this faster?

英文:

I'm working with time series data, which is packaged in a time long dataframe, something like this:

ACCOUNT | VAR1 | VAR2 | DAY

I'm interested in creating a new column DAY_ORD, which would give the ordinal rank of the DAY variable to each row within a group over unique (ACCOUNT, VAR1, VAR2) triplets.

Here is a small example of what I want to achieve:

ACCOUNT VAR1 VAR2 DAY DAY-ORD
A X True 2022-02-03 0
A X True 2022-02-04 1
B X True 2021-05-18 0
A X True 2022-02-05 2
B X True 2022-05-20 1
A Y True 2022-02-05 0
A X True 2022-03-12 3

Here is my current implementation:

#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]

#iterate over all triplets that appear in the data
for (_, row) in fb_data[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows()):
    acc, v1, v2 = row[0], row[1], row[2]

    #find the df slice that adheres to the considered triplet
    fdf = df.loc[(df.ACCOUNT== acc) & (fb_data.VAR1 == v1) & (fb_data.VAR2 == v2)].sort_values('DAY')

    #assign them an ordinal rank
    fdf['DAY_ORD'] = [i for i in range(len(fdf))]

    #set the DAY_ORD values in the original dataframe
    for i in fdf.index:
        df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']

It seems like it will do the job, but it runs very slowly, at around 8 it/s. What is a clean way to make this faster?

答案1

得分: 1

使用GroupBy.rank将值转换为日期时间,减去1并转换为整数:

df['DAY'] = pd.to_datetime(df['DAY'])

df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
                   .rank('dense').sub(1).astype(int))

print(df)
  ACCOUNT VAR1  VAR2        DAY  DAY-ORD
0       A    X  True 2022-02-03        0
1       A    X  True 2022-02-04        1
2       B    X  True 2021-05-18        0
3       A    X  True 2022-02-05        2
4       B    X  True 2022-05-20        1
5       A    Y  True 2022-02-05        0
6       A    X  True 2022-03-12        3
英文:

Use GroupBy.rank with convert values to datetimes, subtract 1 and convert to integers:

df['DAY'] = pd.to_datetime(df['DAY'])

df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
                   .rank('dense').sub(1).astype(int))

print (df)
  ACCOUNT VAR1  VAR2        DAY  DAY-ORD
0       A    X  True 2022-02-03        0
1       A    X  True 2022-02-04        1
2       B    X  True 2021-05-18        0
3       A    X  True 2022-02-05        2
4       B    X  True 2022-05-20        1
5       A    Y  True 2022-02-05        0
6       A    X  True 2022-03-12        3

huangapple
  • 本文由 发表于 2023年3月8日 19:33:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75672488.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定