英文:
How to make a new dataframe with output from Pandas apply function?
问题
我目前正在努力解决一个问题,尽量不使用for循环(尽管这会让我更容易理解),而是使用'pandas'方法。
我面临的问题是我有一个大的日志数据框,allLogs,如下所示:
index message date_time user_id
0 message1 2023-01-01 09:00:49 123
1 message2 2023-01-01 09:00:58 123
2 message3 2023-01-01 09:01:03 125
... 等等
我正在针对每个user_id进行分析,为此我编写了一个函数。这个函数需要allLogs数据帧的子集:所有id的消息和date_times per user_id。可以将其视为:对于每个唯一的user_id,我想运行该函数。
此函数计算每条消息之间的日期时间并创建一个包含所有这些时间差(时间差异)的Series。我想将其制作成一个单独的数据帧,其中有一个大的时间差列表/系列/数组,用于每个唯一的user_id。
当前函数如下:
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64展开收缩')
return seconds
我使用allLogs.groupby('user_id').apply(lambda x: makeSeriesPerUser(x))
将其应用于我的user_id分组。
如何在不返回任何内容并将其添加到现有数据帧中的情况下,为每个唯一的user_id创建一个包含这些时间差的新数据帧(每个用户具有不同数量的日志)?
英文:
I'm currently struggling with a problem of which I try not to use for loops (even though that would make it easier for me to understand) and instead use the 'pandas' approach.
The problem I'm facing is that I have a big dataframe of logs, allLogs, like:
index message date_time user_id
0 message1 2023-01-01 09:00:49 123
1 message2 2023-01-01 09:00:58 123
2 message3 2023-01-01 09:01:03 125
... etc
I'm doing analysis per user_id, for which I've written a function. This function needs a subset of the allLogs dataframe: all id's, messages ande date_times per user_id. Think of it like: for each unique user_id I want to run the function.
This function calculates the date-times between each message and makes a Series with all those time-delta's (time differences). I want to make this into a separate dataframe, for which I have a big list/series/array of time-delta's for each unique user_id.
The current function looks like this:
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64展开收缩')
return seconds
And I use allLogs.groupby('user_id').apply(lambda x: makeSeriesPerUser(x))
to apply it to my user_id groups.
How do I, instead of returning something and adding it to the existing dataframe, make a new dataframe with for each unique user_id a series of these time-delta's (each user has different amounts of logs)?
答案1
得分: 1
首先,你应该使用方法链。这样更容易阅读。
其次,pd.DataFrame.groupby().apply
可以直接传递函数本身,无需使用 lambda 函数。
你的 sort_values(inplace=True)
返回的是 None。去掉这个参数会返回排序后的 DataFrame。
def makeSeriesPerUser(df):
df = df[['message', 'date_time']]
df = df.drop_duplicates(['date_time', 'message'])
df.sort_values(by='date_time', inplace=True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~m1]
df = df['date_time'].shift(-1) - df['date_time']
df.reset_index(drop=True, inplace=True)
seconds = m1.astype('timedelta64展开收缩')
return seconds
改写为:
import pandas as pd
def extract_timedelta(df_grouped_by_user: pd.DataFrame) -> pd.Series:
selected_columns = ['message', 'date_time']
time_delta = (df_grouped_by_user[selected_columns]
.drop_duplicates(selected_columns) # 去除重复条目
['date_time'] # 选择 date_time 列
.sort_values() # 排序选择的 date_time 列
.diff() # 计算差值
.astype('timedelta64展开收缩') # 设置数据类型
.reset_index(drop=True)
)
return time_delta
time_delta_df = df.groupby('user_id').apply(extract_timedelta)
这将返回一个时间差的 DataFrame,按照每个用户分组。分组后的 DataFrame 实际上只是一个具有 MultiIndex 的 Series。这个索引是一个元组['user_id', int]。
如果你想要一个以用户为列的新 DataFrame,你可以使用以下方式:
data = {group_name: extract_timedelta(group_df) for group_name, group_df in messages_df.groupby('user_id')}
time_delta_df = pd.DataFrame(data)
英文:
First off, you should use chaining. It's much simpler to read.
Secondly, the pd.DataFrame.groupby().apply can take the function itself. No lambda function is required.
Your sort_values(inplace=True) is returning None. Removing this will return the sorted DataFrame.
def makeSeriesPerUser(df):
df = df[['message','date_time']]
df = df.drop_duplicates(['date_time','message'])
df = df.sort_values(by='date_time', inplace = True)
m1 = (df['message'] == df['message'].shift(-1))
df = df[~(m1)]
df = (df['date_time'].shift(-1) - df['date_time'])
df = df.reset_index(drop=True)
seconds = m1.astype('timedelta64展开收缩')
return seconds
Turns into
def extract_timedelta(df_grouped_by_user: pd.DataFrame) -> Series:
selected_columns = ['message', 'date_time']
time_delta = (df_grouped_by_user[selected_columns]
.drop_duplicates(selected_columns) # drop duplicate entries
['date_time'] # select date_time column
.sort_values() # sort values of selected date_time column
.diff() # take difference
.astype('timedelta64展开收缩') # as type
.reset_index(drop=True)
)
return time_delta
time_delta_df = df.groupby('user_id').apply(extract_timedelta)
This returns a dataframe of timedeltas and is grouped by each user_id. The grouped dataframe is actually just a series with a MultiIndex. This index is just a tuple['user_id', int].
If you want a new dataframe with users as columns, then you want to this
data = {group_name: extract_timedelta(group_df) for group_name, group_df in messages_df.groupby('user_id')}
time_delta_df = pd.DataFrame(data)
答案2
得分: 0
你应该只创建一个字典,其中键是用户ID,值是每个用户的相关数据框。除非你有数百万用户,每个用户只有很少的记录,否则没有必要将所有数据都保存在一个巨大的数据框中。
英文:
You should just create a dict where the keys are the user IDs and the values are the relevant DataFrames per user. There is no need to keep everything in one giant DataFrame, unless you have millions of users with only a few records apiece.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论