Can I use a boolean mask to find if a DateTime value falls between two other DateTime values in a different dataframe

huangapple go评论71阅读模式
英文:

Can I use a boolean mask to find if a DateTime value falls between two other DateTime values in a different dataframe

问题

我想筛选我的数据点,直到只剩下参与者处于睡眠状态的数据点。我有一个包含日期时间值和我正在研究的值的数据框,还有一个不同的数据框,其中包含参与者开始睡觉和结束睡觉的时间。我想知道是否有一种方法可以通过迭代大数据框或开始和结束睡眠时间的数据框,或者是否有其他更好的方法,而不必手动输入175个晚上的时间来实现这一点。

开始/停止数据框如下,我为每个参与者都有一个:

df_sleep1:

日期         开始时间          结束时间
5/30/2023	5/29/2023 22:15	5/30/2023 7:22
5/31/2023	5/30/2023 23:19	5/31/2023 6:46
6/1/2023	6/1/2023 0:02	    6/1/2023 8:31

包含所有数据的数据框如下,我想添加一个 "asleep" 列:

df:

    	日期时间	        心率         参与者      睡着了
0	2023-05-29 23:44:00	76.0             1
1	2023-05-30 06:44:00	76.0             1
2	2023-05-30 20:45:00	84.0             1
3	2023-05-31 04:45:00	84.0             2
4	2023-06-1 20:46:00	81.0             2

我尝试过的方法:

dt = df['日期时间'].to_numpy()

start1 = df_sleep1['开始时间'].to_numpy()[:, None]
end1 = df_sleep1['结束时间'].to_numpy()[:, None]
    
mask1 = ((start1 <= dt) & (dt <= end1) & (df['参与者'] == 1))
df['睡着了'] = mask1.any(axis=0)
def sleepFunction(row):
    if (df_sleep1['开始时间'] <= dt) & (dt <= df_sleep1['结束时间']) & (df['参与者'] == 1):
        return True
    else:
        return False

df['睡眠状态'] = df.apply(lambda row: sleepFunction(row), axis = 1)

两者都会引发类似于数据框/数组形状不匹配的错误,这也不是我想要的方式。

英文:

I want to filter the datapoints I have, until I only have the datapoints were the participant was asleep left. I have my dataframe with DateTime values and the values I am researching, and a different dataframe that has when the participant started sleeping, and when they ended sleeping. Instead of having to write out every start and stop time in a boolean mask, I wanted to know if there is a way to do this by iterating over either the big dataframe or the dataframe with the start and end sleeping times, or any other better way than having to manually enter 175 nights.

The start/stop dataframe looks like this, I have one for every participant:

df_sleep1:

date	        start     	stop
5/30/2023	5/29/2023 22:15	5/30/2023 7:22
5/31/2023	5/30/2023 23:19	5/31/2023 6:46
6/1/2023	6/1/2023 0:02	6/1/2023 8:31

The dataframe with all the data looks like this, where I want to add an "asleep" column:

df:

    	DateTime	        HeartRate        Participant      Asleep
0	2023-05-29 23:44:00	76.0             1
1	2023-05-30 06:44:00	76.0             1
2	2023-05-30 20:45:00	84.0             1
3	2023-05-31 04:45:00	84.0             2
4	2023-06-1 20:46:00	81.0             2

What I have tried:

dt = df[&#39;DateTime&#39;].to_numpy()

start1 = df_sleep1[&#39;Start&#39;].to_numpy()[:, None]
end1 = df_sleep1[&#39;Stop&#39;].to_numpy()[:, None]
    
mask1 = ((start1 &lt;= dt) &amp; (dt &lt;= end1) &amp; (df[&#39;Participant&#39;] == 1))
df[&#39;Sleep&#39;] = mask1.any(axis=0)
def sleepFunction(row):
    if (df_sleep1[&#39;Start&#39;] &lt;= dt) &amp; (dt &lt;= df_sleep1[&#39;Stop&#39;]) &amp; (df[&#39;Participant&#39;] == 1):
        return True
    else:
        return False

df[&#39;sleepState&#39;] = df.apply(lambda row: sleepFunction(row), axis = 1)

Both give similar errors about the shapes of the dataframes/arrays not matching up, which is not something I want to do anyways.

答案1

得分: 0

如果时间间隔不重叠,一个高效的方法是使用merge_asof:按参与者按开始日期合并,然后确保日期在结束日期之后。

# 将参与者ID映射到df_sleep DataFrame
all_sleep = pd.concat({1: df_sleep1}, names=['Participant']).reset_index(level=0)

# 确保具有日期时间类型
all_sleep[['start', 'stop']] = all_sleep[['start', 'stop']].apply(pd.to_datetime)
df['DateTime'] = pd.to_datetime(df['DateTime'])

# 按日期和参与者合并
df['Asleep'] = (
 pd.merge_asof(df.sort_values(by='DateTime').reset_index(),
               all_sleep.sort_values(by='start'),
               left_on='DateTime', right_on='start',
               by='Participant'
              )
   .assign(Asleep=lambda d: d['DateTime'].le(d['stop']))
   .set_index('index')['Asleep']
)

输出:

             DateTime  HeartRate  Participant  Asleep
0 2023-05-29 23:44:00       76.0            1    True
1 2023-05-30 06:44:00       76.0            1    True
2 2023-05-30 20:45:00       84.0            1   False
3 2023-05-31 04:45:00       84.0            2   False
4 2023-06-01 20:46:00       81.0            2   False
英文:

If the intervals are non-overlapping, an efficient method would be to use a merge_asof: merge on the starting date by participant, then ensure that the date is after the end.

# map the Participant ID to the df_sleep DataFrame
all_sleep = pd.concat({1: df_sleep1}, names=[&#39;Participant&#39;]).reset_index(level=0)

# ensure having datetime types
all_sleep[[&#39;start&#39;, &#39;stop&#39;]] = all_sleep[[&#39;start&#39;, &#39;stop&#39;]].apply(pd.to_datetime)
df[&#39;DateTime&#39;] = pd.to_datetime(df[&#39;DateTime&#39;])

# merge by date and participant
df[&#39;Asleep&#39;] = (
 pd.merge_asof(df.sort_values(by=&#39;DateTime&#39;).reset_index(),
               all_sleep.sort_values(by=&#39;start&#39;),
               left_on=&#39;DateTime&#39;, right_on=&#39;start&#39;,
               by=&#39;Participant&#39;
              )
   .assign(Asleep=lambda d: d[&#39;DateTime&#39;].le(d[&#39;stop&#39;]))
   .set_index(&#39;index&#39;)[&#39;Asleep&#39;]
)

Output:

             DateTime  HeartRate  Participant  Asleep
0 2023-05-29 23:44:00       76.0            1    True
1 2023-05-30 06:44:00       76.0            1    True
2 2023-05-30 20:45:00       84.0            1   False
3 2023-05-31 04:45:00       84.0            2   False
4 2023-06-01 20:46:00       81.0            2   False

huangapple
  • 本文由 发表于 2023年6月19日 21:14:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76506994.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定