英文:
manipulating timeseries data with frequency counts
问题
你可以尝试使用以下代码来实现你想要的结果:
# 将starttime列转换为日期时间类型
df['starttime'] = pd.to_datetime(df['starttime'])
# 提取日期部分并将其作为新的列
df['date'] = df['starttime'].dt.date
# 提取时间部分并将其作为新的列
df['time'] = df['starttime'].dt.time
# 使用groupby按日期分组,然后在每个组内按Count列降序排序,获取每天的前3个最大值
result = df.groupby('date').apply(lambda x: x.nlargest(3, 'Count')).reset_index(drop=True)
# 打印结果
print(result)
这段代码会将数据按日期分组,然后在每个日期组内找到Count列的前3个最大值,最后将结果合并成一个新的DataFrame。
英文:
I have a dataframe -
starttime Count
0 2013-10-01 00:00:00 274
1 2013-10-01 01:00:00 140
2 2013-10-01 02:00:00 67
3 2013-10-01 03:00:00 37
4 2013-10-01 04:00:00 57
... ... ...
739 2013-10-31 19:00:00 1690
740 2013-10-31 20:00:00 1207
741 2013-10-31 21:00:00 1011
742 2013-10-31 22:00:00 730
743 2013-10-31 23:00:00 438
In the above dataset the starttime column as hourly wise data for each day in a month with frequency count corresponding to each hour.
I am trying to get the 3 largest or smallest counts with their starttime per day basis, the resultant dataframe should look like this -
starttime Count
0 2013-10-01 00:00:00 274
1 2013-10-01 01:00:00 140
2 2013-10-01 02:00:00 67
3 2013-10-02 06:00:00 300
4 2013-10-02 08:00:00 250
5 2013-10-02 08:00:00 100
I have tried multiple manupulation to achieve this but failed.
I am not posting my attempts as there were many hit and trials to achieve it.
I have tried splitting the date and time separetely -
df2_start_timeseries['date'] = pd.to_datetime(df2_start_timeseries['starttime']).dt.date
df2_start_timeseries['time'] = pd.to_datetime(df2_start_timeseries['starttime']).dt.time
and tried grouping them and sort it by counts column, that too didnt work.
答案1
得分: 3
你可以按日期分组,对计数值进行排名(下面我使用了'max')。然后只返回排名为'<=3'的行:
df.loc[df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False).le(3), :]
如果可能存在重复,你也可以使用'dense',它将采用最小值。查看更多这里。
或者,你可以对值进行排序,然后在每个日期组中取前3个,如下所示 # option 2
。
为什么使用 rank().le(3):
问题要求每天的最小3个或最大3个值,所以“count”列需要在每个组内以某种方式排序。
# 对于最大值
# 选项 1: rank().le(3)
ranked = df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False)
ranked.head()
# 返回布尔值True/False,表示是否小于或等于3
ranked = ranked.le(3)
df.loc[ranked, :] # 返回过滤后的数据框
# 选项 2: sort_values().head(3)
# 按count和starttime排序行,然后按日期分组,取前3个值
sorted_ = df.sort_values(["count", "starttime"], ascending=[False, True]).groupby(df["starttime"].dt.date).head(3)
# 再次按starttime排序,以确保它们以正确的顺序返回。
sorted_.sort_values("starttime")
如上所示,返回的结果存在轻微差异。上述rank("max", ascending=False)
将将重复项设置为重复排名的最后一个,即如果排序为[5, 4, 4, 3, 2]
,则排名为[1, 3, 3, 4, 5]
。使用"min"
将是重复排名的第一个:[1, 2, 2, 4, 5]
,使用rank("dense", ascending=False)
将是[1, 2, 2, 3, 4]
,可能返回超过3行数据,但其中所有行都具有最高的3个计数。
英文:
You could groupby date, and rank the count values (below I have used 'max'). Then return only the rows where the rank is '<=3':
import pandas as pd
import numpy as np
np.random.seed(4)
times = pd.date_range("2013-10-01", "2013-10-31", freq="H")
c = np.random.randint(0, 500, len(times))
df = pd.DataFrame({"starttime": times,
"count": c})
# groupby by date, rank the 'count' column and only return rows where rank is <= 3
df.loc[df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False).le(3), :]
If there might be duplicates that you would also want to use, you could use 'dense' which will take the minimum values. See more here.
Alternatively, you could sort the values then take the first 3 in each date group, as shown below as # option 2
.
Why rank().le(3)
The question asks for the minimum 3, or maximum 3 values for each day, so the "count" column needs to be ordered somehow within each group.
# for maximum
# option 1: rank().le(3)
ranked = df.groupby(df["starttime"].dt.date)["count"].rank("max", ascending=False)
ranked.head()
# Out[]:
# 0 19.0
# 1 16.0
# 2 3.0
# 3 15.0
# 4 6.0
# Name: count, dtype: float64
ranked = ranked.le(3) # returned boolean of True/False if less than or equal to 3
df.loc[ranked, :] # returns the filtered dataframe.
# Out[]:
# starttime count
# 2 2013-10-01 02:00:00 439
# 7 2013-10-01 07:00:00 456
# 16 2013-10-01 16:00:00 439
# 24 2013-10-02 00:00:00 490
# 33 2013-10-02 09:00:00 486
# .. ... ...
# 695 2013-10-29 23:00:00 476
# 709 2013-10-30 13:00:00 484
# 711 2013-10-30 15:00:00 444
# 715 2013-10-30 19:00:00 447
# 720 2013-10-31 00:00:00 495
# [89 rows x 2 columns]
# option 2: sort_values().head(3)
# sort the rows by count and start time, then groupby date and take the first 3 values
sorted_ = df.sort_values(["count", "starttime"], ascending=[False, True]).groupby(df["starttime"].dt.date).head(3)
# sort again by starttime so they are back in the correct order.
sorted_.sort_values("starttime")
# Out[]:
# starttime count
# 2 2013-10-01 02:00:00 439
# 7 2013-10-01 07:00:00 456
# 16 2013-10-01 16:00:00 439
# 24 2013-10-02 00:00:00 490
# 33 2013-10-02 09:00:00 486
# .. ... ...
# 695 2013-10-29 23:00:00 476
# 709 2013-10-30 13:00:00 484
# 711 2013-10-30 15:00:00 444
# 715 2013-10-30 19:00:00 447
# 720 2013-10-31 00:00:00 495
# [91 rows x 2 columns]
As you can see in the above, there is a slight difference in what is returned. The above rank("max", ascending=False)
will set duplicates to be the last of the duplicate rank, i.e. if sorted as [5, 4, 4, 3, 2]
, the rank would be [1, 3, 3, 4, 5]
. Using "min"
will be first of duplicate rank: [1, 2, 2, 4, 5]
, and using rank("dense", ascending=False)
, then it would instead be [1, 2, 2, 3, 4]
- both of which might return more than 3 rows of data, but all of the rows with the 3 highest counts.
答案2
得分: 2
你可以按日期分组,然后使用nlargest
/ nsmallest
选择每个组中的前3个最大/最小值:
top_3_large = df.groupby(df["starttime"].dt.date, group_keys=False)["count"].nlargest(3)
7 456
2 439
16 439
24 490
33 486
...
678 471
709 484
715 447
711 444
720 495
Name: count, Length: 91, dtype: int64
(如果要打印时清晰地看到组,请将group_keys
更改为True
)
然后,将结果与初始df的starttime
列连接,以恢复每个实例的starttime
:
pd.concat([df["starttime"], top_3_large.rename("3_largest")], join="inner", axis=1)
starttime 3_largest
2 2013-10-01 02:00:00 439
7 2013-10-01 07:00:00 456
16 2013-10-01 16:00:00 439
24 2013-10-02 00:00:00 490
33 2013-10-02 09:00:00 486
...
695 2013-10-29 23:00:00 476
709 2013-10-30 13:00:00 484
711 2013-10-30 15:00:00 444
715 2013-10-30 19:00:00 447
720 2013-10-31 00:00:00 495
[91 rows x 2 columns]
英文:
You can group by date and then select the 3 largest/smallest from each group with nlargest
/nsmallest
:
top_3_large = df.groupby(df["starttime"].dt.date, group_keys=False)["count"].nlargest(3)
7 456
2 439
16 439
24 490
33 486
...
678 471
709 484
715 447
711 444
720 495
Name: count, Length: 91, dtype: int64
(You can change group_keys
to True
to see the groups clearly when printing)
Then join the result with the initial df's starttime
column to recover the starttime
of each instance:
pd.concat([df["starttime"], top_3_large.rename("3_largest")], join="inner", axis=1)
starttime 3_largest
2 2013-10-01 02:00:00 439
7 2013-10-01 07:00:00 456
16 2013-10-01 16:00:00 439
24 2013-10-02 00:00:00 490
33 2013-10-02 09:00:00 486
.. ... ...
695 2013-10-29 23:00:00 476
709 2013-10-30 13:00:00 484
711 2013-10-30 15:00:00 444
715 2013-10-30 19:00:00 447
720 2013-10-31 00:00:00 495
[91 rows x 2 columns]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论