问题

I'm working with time series data, which is packaged in a time long dataframe, something like this:

ACCOUNT | VAR1 | VAR2 | DAY

I'm interested in creating a new column DAY_ORD, which would give the ordinal rank of the DAY variable to each row within a group over unique (ACCOUNT, VAR1, VAR2) triplets.

Here is a small example of what I want to achieve:

ACCOUNT	VAR1	VAR2	DAY	DAY-ORD
A	X	True	2022-02-03	0
A	X	True	2022-02-04	1
B	X	True	2021-05-18	0
A	X	True	2022-02-05	2
B	X	True	2022-05-20	1
A	Y	True	2022-02-05	0
A	X	True	2022-03-12	3

Here is my current implementation:

#initialize an empty 'DAY_ORD' column
df['DAY_ORD'] = [None for i in range(len(df))]

#iterate over all triplets that appear in the data
for (_, row) in df[['ACCOUNT', 'VAR1', 'VAR2']].copy().drop_duplicates().iterrows():
    acc, v1, v2 = row[0], row[1], row[2]

    #find the df slice that adheres to the considered triplet
    fdf = df.loc[(df.ACCOUNT == acc) & (df.VAR1 == v1) & (df.VAR2 == v2)].sort_values('DAY')

    #assign them an ordinal rank
    fdf['DAY_ORD'] = [i for i in range(len(fdf))]

    #set the DAY_ORD values in the original dataframe
    for i in fdf.index:
        df.loc[i, 'DAY_ORD'] = fdf['DAY_ORD'][i]
df['DAY_ORD']

It seems like it will do the job, but it runs very slowly, at around 8 it/s. What is a clean way to make this faster?

英文:

I'm working with time series data, which is packaged in a time long dataframe, something like this:

ACCOUNT | VAR1 | VAR2 | DAY

I'm interested in creating a new column DAY_ORD, which would give the ordinal rank of the DAY variable to each row within a group over unique (ACCOUNT, VAR1, VAR2) triplets.

Here is a small example of what I want to achieve:

ACCOUNT	VAR1	VAR2	DAY	DAY-ORD
A	X	True	2022-02-03	0
A	X	True	2022-02-04	1
B	X	True	2021-05-18	0
A	X	True	2022-02-05	2
B	X	True	2022-05-20	1
A	Y	True	2022-02-05	0
A	X	True	2022-03-12	3

Here is my current implementation:

#initialize an empty &#39;DAY_ORD&#39; column
df[&#39;DAY_ORD&#39;] = [None for i in range(len(df))]

#iterate over all triplets that appear in the data
for (_, row) in fb_data[[&#39;ACCOUNT&#39;, &#39;VAR1&#39;, &#39;VAR2&#39;]].copy().drop_duplicates().iterrows()):
    acc, v1, v2 = row[0], row[1], row[2]

    #find the df slice that adheres to the considered triplet
    fdf = df.loc[(df.ACCOUNT== acc) &amp; (fb_data.VAR1 == v1) &amp; (fb_data.VAR2 == v2)].sort_values(&#39;DAY&#39;)

    #assign them an ordinal rank
    fdf[&#39;DAY_ORD&#39;] = [i for i in range(len(fdf))]

    #set the DAY_ORD values in the original dataframe
    for i in fdf.index:
        df.loc[i, &#39;DAY_ORD&#39;] = fdf[&#39;DAY_ORD&#39;][i]
df[&#39;DAY_ORD&#39;]

It seems like it will do the job, but it runs very slowly, at around 8 it/s. What is a clean way to make this faster?

答案1

得分: 1

使用GroupBy.rank将值转换为日期时间，减去1并转换为整数：

df['DAY'] = pd.to_datetime(df['DAY'])

df['DAY-ORD'] = (df.groupby(['ACCOUNT', 'VAR1', 'VAR2'])['DAY']
                   .rank('dense').sub(1).astype(int))

print(df)
  ACCOUNT VAR1  VAR2        DAY  DAY-ORD
0       A    X  True 2022-02-03        0
1       A    X  True 2022-02-04        1
2       B    X  True 2021-05-18        0
3       A    X  True 2022-02-05        2
4       B    X  True 2022-05-20        1
5       A    Y  True 2022-02-05        0
6       A    X  True 2022-03-12        3

英文:

Use GroupBy.rank with convert values to datetimes, subtract 1 and convert to integers:

df[&#39;DAY&#39;] = pd.to_datetime(df[&#39;DAY&#39;])

df[&#39;DAY-ORD&#39;] = (df.groupby([&#39;ACCOUNT&#39;, &#39;VAR1&#39;, &#39;VAR2&#39;])[&#39;DAY&#39;]
                   .rank(&#39;dense&#39;).sub(1).astype(int))

print (df)
  ACCOUNT VAR1  VAR2        DAY  DAY-ORD
0       A    X  True 2022-02-03        0
1       A    X  True 2022-02-04        1
2       B    X  True 2021-05-18        0
3       A    X  True 2022-02-05        2
4       B    X  True 2022-05-20        1
5       A    Y  True 2022-02-05        0
6       A    X  True 2022-03-12        3

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在Pandas中对组内数值进行排序的高效方式：

问题

答案1

Golang和Python中的zlib有什么区别？

使用内置的切片函数来切片一个二维数组。

无法解决错误：帮助我解决错误导入 “telegram.ext” 无法解决。

使用父类的 `new` 或 `init` 方法为子类的每个对象/实例分配默认属性。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论