在基于特定条件的数据集中如何添加一列

huangapple go评论53阅读模式
英文:

How to add a column in a dataset based on a certain criteria

问题

我需要执行以下操作:我有一个包含特定车辆在特定点通过的时间的数据集。我需要插入一列,指示每辆特定车辆经过该点的次数。此外,我需要在同一辆车的连续通过之间的时间间隔超过某个阈值时重置计数。

例如:

车辆 || 时间 || 通过次数
A         00:15      1
B         00:20      1
C         00:25      1
C         00:45      2
A         00:59      2
A         01:56      3
B         22:55      1   (时间间隔超过阈值,所以重置)
A         23:49      1   (时间间隔超过阈值,所以重置)
df['period'] = pd.to_datetime(df['date_time'])
dfM['Number'] = df.groupby(['Vehicle']).cumcount().add(1)

我认为这只是总结了通过的次数,而没有考虑在某个阈值以上重置,对于这部分我完全不知道如何做。

英文:

I have the need to do the following: I have a dataset containing the time at which a certain specific vehicle passes at a specific point. I need to insert a column indicating how many times each specific vehicle passes there. Moreover, I need to reset the count each time the delta time between two subsequent passes of the same vehicle is over a certain threshold.

For example:

Vehicle || Time || number times passed
A         00:15      1
B         00:20      1
C         00:25      1
C         00:45      2
A         00:59      2
A         01:56      3
B         22:55      1   (delta time above the threshold, so reset)
A         23:49      1   (delta time above the threshold, so reset)
df['period']=pd.to_datetime(df['date_time'])
dfM['Number'] = df.groupby(['Vehicle']).cumcount().add(1) 

I think this just summes up the times without considering the reset above a certain threshold, for which I have absolutely no idea how to do it.

答案1

得分: 0

# 将df简单分成几个部分,然后分别计算每个部分的结果
df['epoch'] = (
    pd.to_datetime(df['Time']).diff() > \
    pd.Timedelta('01:00:00')  # 你的阈值
).cumsum()

# 从你的代码
def get_cumcount(df):
    return df.groupby('Vehicle').cumcount().add(1).values

# 对于每个epoch:
# 分别计算结果
df.loc[:, 'result'] = None
for i in df['epoch'].unique():
    cumcount = get_cumcount(df[df['epoch'] == i])
    df.loc[df['epoch'] == i, 'result'] = cumcount
英文:

My first idea is to simply split df into parts and then compute the result for each part separately

This is not perfect, but looks like it works:

# add "epoch" for calculations
# for each epoch we will compute result separately
# epoch = how many timediffs were more than thresholds (so far)
df['epoch'] = (
    pd.to_datetime(df['Time']).diff() > \
    pd.Timedelta('01:00:00')  # your threshold
).cumsum()

# from your code
def get_cumcount(df):
    return df.groupby('Vehicle').cumcount().add(1).values

# for each epoch:
# compute result separately
df.loc[:, 'result'] = None
for i in df['epoch'].unique():
    cumcount = get_cumcount(df[df['epoch'] == i])
    df.loc[df['epoch'] == i, 'result'] = cumcount

I also tried doing it using groupby and transform, but got errors

huangapple
  • 本文由 发表于 2023年5月13日 15:52:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76241647.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定