英文:
Can I use a boolean mask to find if a DateTime value falls between two other DateTime values in a different dataframe
问题
我想筛选我的数据点,直到只剩下参与者处于睡眠状态的数据点。我有一个包含日期时间值和我正在研究的值的数据框,还有一个不同的数据框,其中包含参与者开始睡觉和结束睡觉的时间。我想知道是否有一种方法可以通过迭代大数据框或开始和结束睡眠时间的数据框,或者是否有其他更好的方法,而不必手动输入175个晚上的时间来实现这一点。
开始/停止数据框如下,我为每个参与者都有一个:
df_sleep1:
日期 开始时间 结束时间
5/30/2023 5/29/2023 22:15 5/30/2023 7:22
5/31/2023 5/30/2023 23:19 5/31/2023 6:46
6/1/2023 6/1/2023 0:02 6/1/2023 8:31
包含所有数据的数据框如下,我想添加一个 "asleep" 列:
df:
日期时间 心率 参与者 睡着了
0 2023-05-29 23:44:00 76.0 1
1 2023-05-30 06:44:00 76.0 1
2 2023-05-30 20:45:00 84.0 1
3 2023-05-31 04:45:00 84.0 2
4 2023-06-1 20:46:00 81.0 2
我尝试过的方法:
dt = df['日期时间'].to_numpy()
start1 = df_sleep1['开始时间'].to_numpy()[:, None]
end1 = df_sleep1['结束时间'].to_numpy()[:, None]
mask1 = ((start1 <= dt) & (dt <= end1) & (df['参与者'] == 1))
df['睡着了'] = mask1.any(axis=0)
def sleepFunction(row):
if (df_sleep1['开始时间'] <= dt) & (dt <= df_sleep1['结束时间']) & (df['参与者'] == 1):
return True
else:
return False
df['睡眠状态'] = df.apply(lambda row: sleepFunction(row), axis = 1)
两者都会引发类似于数据框/数组形状不匹配的错误,这也不是我想要的方式。
英文:
I want to filter the datapoints I have, until I only have the datapoints were the participant was asleep left. I have my dataframe with DateTime values and the values I am researching, and a different dataframe that has when the participant started sleeping, and when they ended sleeping. Instead of having to write out every start and stop time in a boolean mask, I wanted to know if there is a way to do this by iterating over either the big dataframe or the dataframe with the start and end sleeping times, or any other better way than having to manually enter 175 nights.
The start/stop dataframe looks like this, I have one for every participant:
df_sleep1:
date start stop
5/30/2023 5/29/2023 22:15 5/30/2023 7:22
5/31/2023 5/30/2023 23:19 5/31/2023 6:46
6/1/2023 6/1/2023 0:02 6/1/2023 8:31
The dataframe with all the data looks like this, where I want to add an "asleep" column:
df:
DateTime HeartRate Participant Asleep
0 2023-05-29 23:44:00 76.0 1
1 2023-05-30 06:44:00 76.0 1
2 2023-05-30 20:45:00 84.0 1
3 2023-05-31 04:45:00 84.0 2
4 2023-06-1 20:46:00 81.0 2
What I have tried:
dt = df['DateTime'].to_numpy()
start1 = df_sleep1['Start'].to_numpy()[:, None]
end1 = df_sleep1['Stop'].to_numpy()[:, None]
mask1 = ((start1 <= dt) & (dt <= end1) & (df['Participant'] == 1))
df['Sleep'] = mask1.any(axis=0)
def sleepFunction(row):
if (df_sleep1['Start'] <= dt) & (dt <= df_sleep1['Stop']) & (df['Participant'] == 1):
return True
else:
return False
df['sleepState'] = df.apply(lambda row: sleepFunction(row), axis = 1)
Both give similar errors about the shapes of the dataframes/arrays not matching up, which is not something I want to do anyways.
答案1
得分: 0
如果时间间隔不重叠,一个高效的方法是使用merge_asof
:按参与者按开始日期合并,然后确保日期在结束日期之后。
# 将参与者ID映射到df_sleep DataFrame
all_sleep = pd.concat({1: df_sleep1}, names=['Participant']).reset_index(level=0)
# 确保具有日期时间类型
all_sleep[['start', 'stop']] = all_sleep[['start', 'stop']].apply(pd.to_datetime)
df['DateTime'] = pd.to_datetime(df['DateTime'])
# 按日期和参与者合并
df['Asleep'] = (
pd.merge_asof(df.sort_values(by='DateTime').reset_index(),
all_sleep.sort_values(by='start'),
left_on='DateTime', right_on='start',
by='Participant'
)
.assign(Asleep=lambda d: d['DateTime'].le(d['stop']))
.set_index('index')['Asleep']
)
输出:
DateTime HeartRate Participant Asleep
0 2023-05-29 23:44:00 76.0 1 True
1 2023-05-30 06:44:00 76.0 1 True
2 2023-05-30 20:45:00 84.0 1 False
3 2023-05-31 04:45:00 84.0 2 False
4 2023-06-01 20:46:00 81.0 2 False
英文:
If the intervals are non-overlapping, an efficient method would be to use a merge_asof
: merge on the starting date by participant, then ensure that the date is after the end.
# map the Participant ID to the df_sleep DataFrame
all_sleep = pd.concat({1: df_sleep1}, names=['Participant']).reset_index(level=0)
# ensure having datetime types
all_sleep[['start', 'stop']] = all_sleep[['start', 'stop']].apply(pd.to_datetime)
df['DateTime'] = pd.to_datetime(df['DateTime'])
# merge by date and participant
df['Asleep'] = (
pd.merge_asof(df.sort_values(by='DateTime').reset_index(),
all_sleep.sort_values(by='start'),
left_on='DateTime', right_on='start',
by='Participant'
)
.assign(Asleep=lambda d: d['DateTime'].le(d['stop']))
.set_index('index')['Asleep']
)
Output:
DateTime HeartRate Participant Asleep
0 2023-05-29 23:44:00 76.0 1 True
1 2023-05-30 06:44:00 76.0 1 True
2 2023-05-30 20:45:00 84.0 1 False
3 2023-05-31 04:45:00 84.0 2 False
4 2023-06-01 20:46:00 81.0 2 False
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论