2023年5月15日 02:16:37go评论71阅读模式

英文:

manipulating timeseries data with frequency counts

问题

你可以尝试使用以下代码来实现你想要的结果：

# 将starttime列转换为日期时间类型
df['starttime'] = pd.to_datetime(df['starttime'])

# 提取日期部分并将其作为新的列
df['date'] = df['starttime'].dt.date

# 提取时间部分并将其作为新的列
df['time'] = df['starttime'].dt.time

# 使用groupby按日期分组，然后在每个组内按Count列降序排序，获取每天的前3个最大值
result = df.groupby('date').apply(lambda x: x.nlargest(3, 'Count')).reset_index(drop=True)

# 打印结果
print(result)

这段代码会将数据按日期分组，然后在每个日期组内找到Count列的前3个最大值，最后将结果合并成一个新的DataFrame。

英文:

I have a dataframe -

            starttime	Count
0	2013-10-01 00:00:00	274
1	2013-10-01 01:00:00	140
2	2013-10-01 02:00:00	67
3	2013-10-01 03:00:00	37
4	2013-10-01 04:00:00	57
...	...	...
739	2013-10-31 19:00:00	1690
740	2013-10-31 20:00:00	1207
741	2013-10-31 21:00:00	1011
742	2013-10-31 22:00:00	730
743	2013-10-31 23:00:00	438

In the above dataset the starttime column as hourly wise data for each day in a month with frequency count corresponding to each hour.
I am trying to get the 3 largest or smallest counts with their starttime per day basis, the resultant dataframe should look like this -

              starttime	    Count
0	2013-10-01 00:00:00	      274
1	2013-10-01 01:00:00	      140
2	2013-10-01 02:00:00	       67
3   2013-10-02 06:00:00       300
4   2013-10-02 08:00:00       250
5   2013-10-02 08:00:00       100

I have tried multiple manupulation to achieve this but failed.
I am not posting my attempts as there were many hit and trials to achieve it.

I have tried splitting the date and time separetely -

df2_start_timeseries[&#39;date&#39;] = pd.to_datetime(df2_start_timeseries[&#39;starttime&#39;]).dt.date
df2_start_timeseries[&#39;time&#39;] = pd.to_datetime(df2_start_timeseries[&#39;starttime&#39;]).dt.time

and tried grouping them and sort it by counts column, that too didnt work.

答案1

得分: 3

你可以按日期分组，对计数值进行排名（下面我使用了'max'）。然后只返回排名为'<=3'的行：

df.loc[df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False).le(3), :]

如果可能存在重复，你也可以使用'dense'，它将采用最小值。查看更多这里。

或者，你可以对值进行排序，然后在每个日期组中取前3个，如下所示 # option 2。

为什么使用 rank().le(3)：

问题要求每天的最小3个或最大3个值，所以“count”列需要在每个组内以某种方式排序。

# 对于最大值
# 选项 1: rank().le(3)
ranked = df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False)
ranked.head()
# 返回布尔值True/False，表示是否小于或等于3
ranked = ranked.le(3)
df.loc[ranked, :]  # 返回过滤后的数据框

# 选项 2: sort_values().head(3)
# 按count和starttime排序行，然后按日期分组，取前3个值
sorted_ = df.sort_values(["count", "starttime"], ascending=[False, True]).groupby(df["starttime"].dt.date).head(3)
# 再次按starttime排序，以确保它们以正确的顺序返回。
sorted_.sort_values("starttime")

如上所示，返回的结果存在轻微差异。上述rank("max", ascending=False)将将重复项设置为重复排名的最后一个，即如果排序为[5, 4, 4, 3, 2]，则排名为[1, 3, 3, 4, 5]。使用"min"将是重复排名的第一个：[1, 2, 2, 4, 5]，使用rank("dense", ascending=False)将是[1, 2, 2, 3, 4]，可能返回超过3行数据，但其中所有行都具有最高的3个计数。

英文:

You could groupby date, and rank the count values (below I have used 'max'). Then return only the rows where the rank is '<=3':

import pandas as pd
import numpy as np

np.random.seed(4)

times = pd.date_range(&quot;2013-10-01&quot;, &quot;2013-10-31&quot;, freq=&quot;H&quot;)
c = np.random.randint(0, 500, len(times))

df = pd.DataFrame({&quot;starttime&quot;: times,
                   &quot;count&quot;: c})

# groupby by date, rank the &#39;count&#39; column and only return rows where rank is &lt;= 3
df.loc[df.groupby(df[&quot;starttime&quot;].dt.date)[&quot;count&quot;].rank(&quot;max&quot;, ascending=False).le(3), :]

If there might be duplicates that you would also want to use, you could use 'dense' which will take the minimum values. See more here.

Alternatively, you could sort the values then take the first 3 in each date group, as shown below as # option 2.

Why rank().le(3)

The question asks for the minimum 3, or maximum 3 values for each day, so the "count" column needs to be ordered somehow within each group.

# for maximum
# option 1: rank().le(3)
ranked = df.groupby(df[&quot;starttime&quot;].dt.date)[&quot;count&quot;].rank(&quot;max&quot;, ascending=False)
ranked.head()
# Out[]: 
# 0    19.0
# 1    16.0
# 2     3.0
# 3    15.0
# 4     6.0
# Name: count, dtype: float64
ranked = ranked.le(3)  # returned boolean of True/False if less than or equal to 3
df.loc[ranked, :]  # returns the filtered dataframe.
# Out[]: 
#               starttime  count
# 2   2013-10-01 02:00:00    439
# 7   2013-10-01 07:00:00    456
# 16  2013-10-01 16:00:00    439
# 24  2013-10-02 00:00:00    490
# 33  2013-10-02 09:00:00    486
# ..                  ...    ...
# 695 2013-10-29 23:00:00    476
# 709 2013-10-30 13:00:00    484
# 711 2013-10-30 15:00:00    444
# 715 2013-10-30 19:00:00    447
# 720 2013-10-31 00:00:00    495

# [89 rows x 2 columns]

# option 2: sort_values().head(3)
# sort the rows by count and start time, then groupby date and take the first 3 values
sorted_ = df.sort_values([&quot;count&quot;, &quot;starttime&quot;], ascending=[False, True]).groupby(df[&quot;starttime&quot;].dt.date).head(3)
# sort again by starttime so they are back in the correct order.
sorted_.sort_values(&quot;starttime&quot;)
# Out[]: 
#               starttime  count
# 2   2013-10-01 02:00:00    439
# 7   2013-10-01 07:00:00    456
# 16  2013-10-01 16:00:00    439
# 24  2013-10-02 00:00:00    490
# 33  2013-10-02 09:00:00    486
# ..                  ...    ...
# 695 2013-10-29 23:00:00    476
# 709 2013-10-30 13:00:00    484
# 711 2013-10-30 15:00:00    444
# 715 2013-10-30 19:00:00    447
# 720 2013-10-31 00:00:00    495

# [91 rows x 2 columns]

As you can see in the above, there is a slight difference in what is returned. The above rank("max", ascending=False) will set duplicates to be the last of the duplicate rank, i.e. if sorted as [5, 4, 4, 3, 2], the rank would be [1, 3, 3, 4, 5]. Using "min" will be first of duplicate rank: [1, 2, 2, 4, 5], and using rank("dense", ascending=False), then it would instead be [1, 2, 2, 3, 4] - both of which might return more than 3 rows of data, but all of the rows with the 3 highest counts.

答案2

得分: 2

你可以按日期分组，然后使用nlargest / nsmallest选择每个组中的前3个最大/最小值：

top_3_large = df.groupby(df["starttime"].dt.date, group_keys=False)["count"].nlargest(3)

7      456
2      439
16     439
24     490
33     486
...
678    471
709    484
715    447
711    444
720    495
Name: count, Length: 91, dtype: int64

（如果要打印时清晰地看到组，请将group_keys更改为True）

然后，将结果与初始df的starttime列连接，以恢复每个实例的starttime：

pd.concat([df["starttime"], top_3_large.rename("3_largest")], join="inner", axis=1)

              starttime  3_largest
2   2013-10-01 02:00:00        439
7   2013-10-01 07:00:00        456
16  2013-10-01 16:00:00        439
24  2013-10-02 00:00:00        490
33  2013-10-02 09:00:00        486
...
695 2013-10-29 23:00:00        476
709 2013-10-30 13:00:00        484
711 2013-10-30 15:00:00        444
715 2013-10-30 19:00:00        447
720 2013-10-31 00:00:00        495

[91 rows x 2 columns]

英文:

You can group by date and then select the 3 largest/smallest from each group with nlargest/nsmallest:

top_3_large = df.groupby(df[&quot;starttime&quot;].dt.date, group_keys=False)[&quot;count&quot;].nlargest(3)

7      456
2      439
16     439
24     490
33     486
      ... 
678    471
709    484
715    447
711    444
720    495
Name: count, Length: 91, dtype: int64

(You can change group_keys to True to see the groups clearly when printing)

Then join the result with the initial df's starttime column to recover the starttime of each instance:

pd.concat([df[&quot;starttime&quot;], top_3_large.rename(&quot;3_largest&quot;)], join=&quot;inner&quot;, axis=1)

              starttime  3_largest
2   2013-10-01 02:00:00        439
7   2013-10-01 07:00:00        456
16  2013-10-01 16:00:00        439
24  2013-10-02 00:00:00        490
33  2013-10-02 09:00:00        486
..                  ...        ...
695 2013-10-29 23:00:00        476
709 2013-10-30 13:00:00        484
711 2013-10-30 15:00:00        444
715 2013-10-30 19:00:00        447
720 2013-10-31 00:00:00        495

[91 rows x 2 columns]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

操纵时间序列数据，使用频率计数。

问题

答案1

答案2

有没有推荐的算法，用于压缩字符串中类似DNA的多个特定子串？

Not able to write spark dataframe. Error Found nested NullType in column 'colname' which is of ArrayType

Python Network Hours

polars使用DataFrame的行与Expression API。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论