操纵时间序列数据,使用频率计数。

huangapple go评论68阅读模式
英文:

manipulating timeseries data with frequency counts

问题

你可以尝试使用以下代码来实现你想要的结果:

# 将starttime列转换为日期时间类型
df['starttime'] = pd.to_datetime(df['starttime'])

# 提取日期部分并将其作为新的列
df['date'] = df['starttime'].dt.date

# 提取时间部分并将其作为新的列
df['time'] = df['starttime'].dt.time

# 使用groupby按日期分组,然后在每个组内按Count列降序排序,获取每天的前3个最大值
result = df.groupby('date').apply(lambda x: x.nlargest(3, 'Count')).reset_index(drop=True)

# 打印结果
print(result)

这段代码会将数据按日期分组,然后在每个日期组内找到Count列的前3个最大值,最后将结果合并成一个新的DataFrame。

英文:

I have a dataframe -

            starttime	Count
0	2013-10-01 00:00:00	274
1	2013-10-01 01:00:00	140
2	2013-10-01 02:00:00	67
3	2013-10-01 03:00:00	37
4	2013-10-01 04:00:00	57
...	...	...
739	2013-10-31 19:00:00	1690
740	2013-10-31 20:00:00	1207
741	2013-10-31 21:00:00	1011
742	2013-10-31 22:00:00	730
743	2013-10-31 23:00:00	438

In the above dataset the starttime column as hourly wise data for each day in a month with frequency count corresponding to each hour.
I am trying to get the 3 largest or smallest counts with their starttime per day basis, the resultant dataframe should look like this -

              starttime	    Count
0	2013-10-01 00:00:00	      274
1	2013-10-01 01:00:00	      140
2	2013-10-01 02:00:00	       67
3   2013-10-02 06:00:00       300
4   2013-10-02 08:00:00       250
5   2013-10-02 08:00:00       100

I have tried multiple manupulation to achieve this but failed.
I am not posting my attempts as there were many hit and trials to achieve it.

I have tried splitting the date and time separetely -

df2_start_timeseries['date'] = pd.to_datetime(df2_start_timeseries['starttime']).dt.date
df2_start_timeseries['time'] = pd.to_datetime(df2_start_timeseries['starttime']).dt.time

and tried grouping them and sort it by counts column, that too didnt work.

答案1

得分: 3

你可以按日期分组,对计数值进行排名(下面我使用了'max')。然后只返回排名为'<=3'的行:

df.loc[df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False).le(3), :]

如果可能存在重复,你也可以使用'dense',它将采用最小值。查看更多这里

或者,你可以对值进行排序,然后在每个日期组中取前3个,如下所示 # option 2

为什么使用 rank().le(3):

问题要求每天的最小3个或最大3个值,所以“count”列需要在每个组内以某种方式排序。

# 对于最大值
# 选项 1: rank().le(3)
ranked = df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False)
ranked.head()
# 返回布尔值True/False,表示是否小于或等于3
ranked = ranked.le(3)
df.loc[ranked, :]  # 返回过滤后的数据框
# 选项 2: sort_values().head(3)
# 按count和starttime排序行,然后按日期分组,取前3个值
sorted_ = df.sort_values(["count", "starttime"], ascending=[False, True]).groupby(df["starttime"].dt.date).head(3)
# 再次按starttime排序,以确保它们以正确的顺序返回。
sorted_.sort_values("starttime")

如上所示,返回的结果存在轻微差异。上述rank("max", ascending=False)将将重复项设置为重复排名的最后一个,即如果排序为[5, 4, 4, 3, 2],则排名为[1, 3, 3, 4, 5]。使用"min"将是重复排名的第一个:[1, 2, 2, 4, 5],使用rank("dense", ascending=False)将是[1, 2, 2, 3, 4],可能返回超过3行数据,但其中所有行都具有最高的3个计数。

英文:

You could groupby date, and rank the count values (below I have used 'max'). Then return only the rows where the rank is '<=3':

import pandas as pd
import numpy as np

np.random.seed(4)

times = pd.date_range(&quot;2013-10-01&quot;, &quot;2013-10-31&quot;, freq=&quot;H&quot;)
c = np.random.randint(0, 500, len(times))

df = pd.DataFrame({&quot;starttime&quot;: times,
                   &quot;count&quot;: c})

# groupby by date, rank the &#39;count&#39; column and only return rows where rank is &lt;= 3
df.loc[df.groupby(df[&quot;starttime&quot;].dt.date)[&quot;count&quot;].rank(&quot;max&quot;, ascending=False).le(3), :]

If there might be duplicates that you would also want to use, you could use 'dense' which will take the minimum values. See more here.

Alternatively, you could sort the values then take the first 3 in each date group, as shown below as # option 2.

Why rank().le(3)

The question asks for the minimum 3, or maximum 3 values for each day, so the "count" column needs to be ordered somehow within each group.

# for maximum
# option 1: rank().le(3)
ranked = df.groupby(df[&quot;starttime&quot;].dt.date)[&quot;count&quot;].rank(&quot;max&quot;, ascending=False)
ranked.head()
# Out[]: 
# 0    19.0
# 1    16.0
# 2     3.0
# 3    15.0
# 4     6.0
# Name: count, dtype: float64
ranked = ranked.le(3)  # returned boolean of True/False if less than or equal to 3
df.loc[ranked, :]  # returns the filtered dataframe.
# Out[]: 
#               starttime  count
# 2   2013-10-01 02:00:00    439
# 7   2013-10-01 07:00:00    456
# 16  2013-10-01 16:00:00    439
# 24  2013-10-02 00:00:00    490
# 33  2013-10-02 09:00:00    486
# ..                  ...    ...
# 695 2013-10-29 23:00:00    476
# 709 2013-10-30 13:00:00    484
# 711 2013-10-30 15:00:00    444
# 715 2013-10-30 19:00:00    447
# 720 2013-10-31 00:00:00    495

# [89 rows x 2 columns]

# option 2: sort_values().head(3)
# sort the rows by count and start time, then groupby date and take the first 3 values
sorted_ = df.sort_values([&quot;count&quot;, &quot;starttime&quot;], ascending=[False, True]).groupby(df[&quot;starttime&quot;].dt.date).head(3)
# sort again by starttime so they are back in the correct order.
sorted_.sort_values(&quot;starttime&quot;)
# Out[]: 
#               starttime  count
# 2   2013-10-01 02:00:00    439
# 7   2013-10-01 07:00:00    456
# 16  2013-10-01 16:00:00    439
# 24  2013-10-02 00:00:00    490
# 33  2013-10-02 09:00:00    486
# ..                  ...    ...
# 695 2013-10-29 23:00:00    476
# 709 2013-10-30 13:00:00    484
# 711 2013-10-30 15:00:00    444
# 715 2013-10-30 19:00:00    447
# 720 2013-10-31 00:00:00    495

# [91 rows x 2 columns]

As you can see in the above, there is a slight difference in what is returned. The above rank(&quot;max&quot;, ascending=False) will set duplicates to be the last of the duplicate rank, i.e. if sorted as [5, 4, 4, 3, 2], the rank would be [1, 3, 3, 4, 5]. Using &quot;min&quot; will be first of duplicate rank: [1, 2, 2, 4, 5], and using rank(&quot;dense&quot;, ascending=False), then it would instead be [1, 2, 2, 3, 4] - both of which might return more than 3 rows of data, but all of the rows with the 3 highest counts.

答案2

得分: 2

你可以按日期分组,然后使用nlargest / nsmallest选择每个组中的前3个最大/最小值:

top_3_large = df.groupby(df["starttime"].dt.date, group_keys=False)["count"].nlargest(3)
7      456
2      439
16     439
24     490
33     486
...
678    471
709    484
715    447
711    444
720    495
Name: count, Length: 91, dtype: int64

(如果要打印时清晰地看到组,请将group_keys更改为True

然后,将结果与初始df的starttime列连接,以恢复每个实例的starttime

pd.concat([df["starttime"], top_3_large.rename("3_largest")], join="inner", axis=1)
              starttime  3_largest
2   2013-10-01 02:00:00        439
7   2013-10-01 07:00:00        456
16  2013-10-01 16:00:00        439
24  2013-10-02 00:00:00        490
33  2013-10-02 09:00:00        486
...
695 2013-10-29 23:00:00        476
709 2013-10-30 13:00:00        484
711 2013-10-30 15:00:00        444
715 2013-10-30 19:00:00        447
720 2013-10-31 00:00:00        495

[91 rows x 2 columns]
英文:

You can group by date and then select the 3 largest/smallest from each group with nlargest/nsmallest:

top_3_large = df.groupby(df[&quot;starttime&quot;].dt.date, group_keys=False)[&quot;count&quot;].nlargest(3)
7      456
2      439
16     439
24     490
33     486
      ... 
678    471
709    484
715    447
711    444
720    495
Name: count, Length: 91, dtype: int64

(You can change group_keys to True to see the groups clearly when printing)

Then join the result with the initial df's starttime column to recover the starttime of each instance:

pd.concat([df[&quot;starttime&quot;], top_3_large.rename(&quot;3_largest&quot;)], join=&quot;inner&quot;, axis=1)
              starttime  3_largest
2   2013-10-01 02:00:00        439
7   2013-10-01 07:00:00        456
16  2013-10-01 16:00:00        439
24  2013-10-02 00:00:00        490
33  2013-10-02 09:00:00        486
..                  ...        ...
695 2013-10-29 23:00:00        476
709 2013-10-30 13:00:00        484
711 2013-10-30 15:00:00        444
715 2013-10-30 19:00:00        447
720 2013-10-31 00:00:00        495

[91 rows x 2 columns]

huangapple
  • 本文由 发表于 2023年5月15日 02:16:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76249026.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定