如何按照id和日期(YYYY-MM-DD)分组生成pandas数据框的时间差列?

huangapple go评论100阅读模式
英文:

How to generate pandas dataframe timedelta column grouped by id and date (YYYY-MM-DD)?

问题

  1. id datetime datetime_baseline timedelta
  2. a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
  3. a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
  4. a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
  5. a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
  6. a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
  7. a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
  8. a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
  9. a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
  10. a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
  11. a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
  12. a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
  13. a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
英文:

Suppose I have a dataframe with id and datetime columns:

  1. df = pd.DataFrame({"id": ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2", "a3", "a3", "a3", "a3"],
  2. "datetime": ["2016-01-01 00:01:00.156",
  3. "2016-01-01 12:00:00.425",
  4. "2016-01-02 00:59:00.123",
  5. "2016-01-02 14:16:00.548",
  6. "2016-01-01 12:00:00.147",
  7. "2016-01-01 13:59:00.123",
  8. "2016-01-02 08:01:00.147",
  9. "2016-01-02 18:49:00.123",
  10. "2016-02-01 12:00:00.147",
  11. "2016-02-01 13:59:00.123",
  12. "2016-02-02 08:01:00.147",
  13. "2016-02-02 18:49:00.123"]})
  14. df["datetime"] = pd.to_datetime(df["datetime"])
  15. df

Here is the dataframe:

  1. id datetime
  2. 0 a1 2016-01-01 00:01:00.156
  3. 1 a1 2016-01-01 12:00:00.425
  4. 2 a1 2016-01-02 00:59:00.123
  5. 3 a1 2016-01-02 14:16:00.548
  6. 4 a2 2016-01-01 12:00:00.147
  7. 5 a2 2016-01-01 13:59:00.123
  8. 6 a2 2016-01-02 08:01:00.147
  9. 7 a2 2016-01-02 18:49:00.123
  10. 8 a3 2016-02-01 12:00:00.147
  11. 9 a3 2016-02-01 13:59:00.123
  12. 10 a3 2016-02-02 08:01:00.147
  13. 11 a3 2016-02-02 18:49:00.123

I want to generate column timedelta that has a timedelta value. This is the output I expect to get:

  1. id datetime datetime_baseline timedelta
  2. 0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
  3. 1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
  4. 2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
  5. 3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
  6. 4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
  7. 5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
  8. 6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
  9. 7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
  10. 8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
  11. 9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
  12. 10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
  13. 11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648

Here is how the timedelta values should be calculated: 1) the code needs to identify the FIRST datetime within the same id and date ('YYYY-MM-DD'), and 2) use it as baseline (datetime_baseline) to compute the timedelta (in minutes) w.r.t. other datetimes within same id and same date. For id='a1' and date='2016-01-01', the datetime_baseline='2016-01-01 00:01:00.156'. So, at index=0, timedelta has value=0 because '2016-01-01 00:01:00.156' - datetime_baseline=0. Meanwhile, at index=1, timedelta has value=719 because '2016-01-01 12:00:00.425' - datetime_baseline=719 (minutes). At index=2, id is the same as before but date is now '2016-01-02', so a new baseline will be used: '2016-01-02 00:59:00.123'. timedelta='2016-01-02 00:59:00.123' - datetime_baseline=0. At index=3, timedelta='2016-01-02 14:16:00.548' - datetime_baseline=797.

Although I see how the timedelta values should be calculated (timedelta=datetime-datetime_baseline), I don't know how to have the baseline values identified (i.e. how to generate datetime_baseline column). Please, let me know if you need any further explanation.

ps> the actual dataframe has +500 thousand rows.

答案1

得分: 2

使用GroupBy.transform进行基线计算:

  1. df["datetime_baseline"] = (df.groupby(["id", df["datetime"].dt.date])
  2. ["datetime"].transform("first"))

并使用dt.total_seconds计算时间差

  1. df["timedelta"] = ((df["datetime"].sub(df["datetime_baseline"]))
  2. .dt.total_seconds().div(60).round(0).astype(int))

输出:

  1. print(df)
  2. id datetime datetime_baseline timedelta
  3. 0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
  4. 1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
  5. 2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
  6. 3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
  7. 4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
  8. 5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
  9. 6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
  10. 7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
  11. 8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
  12. 9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
  13. 10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
  14. 11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648
英文:

With GroupBy.transform to make the baseline :

  1. df["datetime_baseline"] = (df.groupby(["id", df["datetime"].dt.date])
  2. ["datetime"].transform("first"))

And dt.total_seconds to compute the timedelta :

  1. df["timedelta"] = ((df["datetime"].sub(df["datetime_baseline"]))
  2. .dt.total_seconds().div(60).round(0).astype(int))


Output :

  1. print(df)
  2. id datetime datetime_baseline timedelta
  3. 0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0
  4. 1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719
  5. 2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0
  6. 3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797
  7. 4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0
  8. 5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119
  9. 6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0
  10. 7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648
  11. 8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0
  12. 9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119
  13. 10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0
  14. 11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648

答案2

得分: 1

  1. df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
  2. df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
  3. print(df)
英文:

Try:

  1. df['datetime_baseline'] = df.groupby(['id', df['datetime'].dt.date])["datetime"].transform('min')
  2. df['timedelta'] = np.round((df['datetime'] - df['datetime_baseline']).dt.seconds / 60)
  3. print(df)

Prints:

  1. id datetime datetime_baseline timedelta
  2. 0 a1 2016-01-01 00:01:00.156 2016-01-01 00:01:00.156 0.0
  3. 1 a1 2016-01-01 12:00:00.425 2016-01-01 00:01:00.156 719.0
  4. 2 a1 2016-01-02 00:59:00.123 2016-01-02 00:59:00.123 0.0
  5. 3 a1 2016-01-02 14:16:00.548 2016-01-02 00:59:00.123 797.0
  6. 4 a2 2016-01-01 12:00:00.147 2016-01-01 12:00:00.147 0.0
  7. 5 a2 2016-01-01 13:59:00.123 2016-01-01 12:00:00.147 119.0
  8. 6 a2 2016-01-02 08:01:00.147 2016-01-02 08:01:00.147 0.0
  9. 7 a2 2016-01-02 18:49:00.123 2016-01-02 08:01:00.147 648.0
  10. 8 a3 2016-02-01 12:00:00.147 2016-02-01 12:00:00.147 0.0
  11. 9 a3 2016-02-01 13:59:00.123 2016-02-01 12:00:00.147 119.0
  12. 10 a3 2016-02-02 08:01:00.147 2016-02-02 08:01:00.147 0.0
  13. 11 a3 2016-02-02 18:49:00.123 2016-02-02 08:01:00.147 648.0

huangapple
  • 本文由 发表于 2023年4月11日 01:14:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/75979159.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定