基于匹配观察时间计算差异。

huangapple go评论122阅读模式
英文:

Compute difference based on matching observations over time

问题

假设我们有以下数据框:

  1. Date Type Country Value
  2. 0 2016-04-30 A NL 1
  3. 1 2016-04-30 A BE 2
  4. 2 2016-04-30 B NL 3
  5. 3 2016-04-30 B BE 4
  6. 4 2016-04-30 C NL 5
  7. 5 2016-04-30 C BE 6
  8. 6 2016-04-30 C FR 7
  9. 7 2016-04-30 C UK 8
  10. 8 2016-05-31 A NL 9
  11. 9 2016-05-31 A BE 10
  12. 10 2016-05-31 A FR 11
  13. 11 2016-05-31 B NL 12
  14. 12 2016-05-31 B BE 13
  15. 13 2016-05-31 B FR 14
  16. 14 2016-05-31 C NL 15
  17. 15 2016-05-31 C BE 16
  18. 16 2016-05-31 C UK 17
  19. 17 2016-05-31 C SL 18
  20. 18 2016-06-30 A NL 19
  21. 19 2016-06-30 B FR 20
  22. 20 2016-06-30 B UK 21
  23. 21 2016-06-30 B SL 22
  24. 22 2016-06-30 C NL 23
  25. 23 2016-06-30 C BE 24

以下是要添加的代码段:

  1. import pandas as pd
  2. # 创建数据框
  3. df = pd.DataFrame([['2016-04-30','A','NL',1], ['2016-04-30','A','BE',2], ['2016-04-30','B','NL',3], ['2016-04-30','B','BE',4], ['2016-04-30','C','NL',5], ['2016-04-30','C','BE',6],['2016-04-30','C','FR',7], ['2016-04-30','C','UK',8], ['2016-05-31','A','NL',9], ['2016-05-31','A','BE',10], ['2016-05-31','A','FR',11], ['2016-05-31','B','NL',12], ['2016-05-31','B','BE',13], ['2016-05-31','B','FR',14], ['2016-05-31','C','NL',15], ['2016-05-31','C','BE',16], ['2016-05-31','C','UK',17], ['2016-05-31','C','SL',18], ['2016-06-30','A','NL',19], ['2016-06-30','B','FR',20], ['2016-06-30','B','UK',21], ['2016-06-30','B','SL',22], ['2016-06-30','C','NL',23], ['2016-06-30','C','BE',24]], columns=['Date','Type','Country','Value'])
  4. # 将日期列转换为日期时间对象
  5. df['Date'] = pd.to_datetime(df['Date'])
  6. # 按日期和类型进行排序
  7. df.sort_values(['Date', 'Type'], inplace=True)
  8. # 计算ValueDiff列,即每个观察值与前一期的差异
  9. df['ValueDiff'] = df.groupby(['Type', 'Country'])['Value'].diff()
  10. # 对于第一次出现的观察值,将ValueDiff设置为NaN
  11. df.loc[df.groupby(['Type', 'Country'])['Value'].head(1).index, 'ValueDiff'] = None
  12. # 打印结果
  13. print(df)

期望的数据框如下:

  1. Date Type Country Value ValueDiff
  2. 0 2016-04-30 A NL 1 NaN
  3. 1 2016-04-30 A BE 2 NaN
  4. 2 2016-04-30 B NL 3 NaN
  5. 3 2016-04-30 B BE 4 NaN
  6. 4 2016-04-30 C NL 5 NaN
  7. 5 2016-04-30 C BE 6 NaN
  8. 6 2016-04-30 C FR 7 NaN
  9. 7 2016-04-30 C UK 8 NaN
  10. 8 2016-05-31 A NL 9 8.0
  11. 9 2016-05-31 A BE 10 8.0
  12. 10 2016-05-31 A FR 11 NaN
  13. 11 2016-05-31 B NL 12 9.0
  14. 12 2016-05-31 B BE 13 9.0
  15. 13 2016-05-31 B FR 14 NaN
  16. 14 2016-05-31 C NL 15 10.0
  17. 15 2016-05-31 C BE 16 10.0
  18. 16 2016-05-31 C UK 17 9.0
  19. 17 2016-05-31 C SL 18 NaN
  20. 18 2016-06-30 A NL 19 10.0
  21. 19 2016-06-30 B FR 20 6.0
  22. 20 2016-06-30 B UK 21 NaN
  23. 21 2016-06-30 B SL 22 NaN
  24. 22 2016-06-30 C NL 23 8.0
  25. 23 2016-06-30 C BE 24 8.0
英文:

Suppose we have the following dataframe:

  1. Date Type Country Value
  2. 0 2016-04-30 A NL 1
  3. 1 2016-04-30 A BE 2
  4. 2 2016-04-30 B NL 3
  5. 3 2016-04-30 B BE 4
  6. 4 2016-04-30 C NL 5
  7. 5 2016-04-30 C BE 6
  8. 6 2016-04-30 C FR 7
  9. 7 2016-04-30 C UK 8
  10. 8 2016-05-31 A NL 9
  11. 9 2016-05-31 A BE 10
  12. 10 2016-05-31 A FR 11
  13. 11 2016-05-31 B NL 12
  14. 12 2016-05-31 B BE 13
  15. 13 2016-05-31 B FR 14
  16. 14 2016-05-31 C NL 15
  17. 15 2016-05-31 C BE 16
  18. 16 2016-05-31 C UK 17
  19. 17 2016-05-31 C SL 18
  20. 18 2016-06-30 A NL 19
  21. 19 2016-06-30 B FR 20
  22. 20 2016-06-30 B UK 21
  23. 21 2016-06-30 B SL 22
  24. 22 2016-06-30 C NL 23
  25. 23 2016-06-30 C BE 24

Which can be computed with the following code:

  1. df = pd.DataFrame([['2016-04-30','A','NL',1], ['2016-04-30','A', "BE" ,2], ['2016-04-30', 'B', 'NL',3], ['2016-04-30','B','BE',4], ['2016-04-30','C','NL',5], ['2016-04-30','C','BE',6],['2016-04-30','C','FR', 7], ['2016-04-30','C','UK',8], ['2016-05-31','A','NL',9], ['2016-05-31','A','BE',10], ['2016-05-31','A','FR',11], ['2016-05-31','B','NL',12], ['2016-05-31','B','BE',13], ['2016-05-31','B','FR',14], ['2016-05-31','C','NL',15], ['2016-05-31','C','BE',16], ['2016-05-31','C','UK',17], ['2016-05-31','C','SL',18], ['2016-06-30','A','NL',19], ['2016-06-30','B','FR',20], ['2016-06-30','B','UK',21], ['2016-06-30','B','SL',22], ['2016-06-30','C','NL',23], ['2016-06-30','C','BE',24]], columns=['Date','Type' ,'Country' ,'Value'])

I want to add an extra column 'ValueDiff', that basically computes the difference compared to the observation in the previous period. Hence, for instance for the observation 'Date: 2016-05-31, Type: B, Country: BE', I would want to set 'ValueDiff' to 13-4 = 9. In case the observation is not available in the previous period, I would want to set it to NaN.

Expected df:

  1. Date Type Country Value ValueDiff
  2. 0 2016-04-30 A NL 1 nan
  3. 1 2016-04-30 A BE 2 nan
  4. 2 2016-04-30 B NL 3 nan
  5. 3 2016-04-30 B BE 4 nan
  6. 4 2016-04-30 C NL 5 nan
  7. 5 2016-04-30 C BE 6 nan
  8. 6 2016-04-30 C FR 7 nan
  9. 7 2016-04-30 C UK 8 nan
  10. 8 2016-05-31 A NL 9 8
  11. 9 2016-05-31 A BE 10 8
  12. 10 2016-05-31 A FR 11 nan
  13. 11 2016-05-31 B NL 12 9
  14. 12 2016-05-31 B BE 13 9
  15. 13 2016-05-31 B FR 14 nan
  16. 14 2016-05-31 C NL 15 10
  17. 15 2016-05-31 C BE 16 10
  18. 16 2016-05-31 C UK 17 9
  19. 17 2016-05-31 C SL 18 nan
  20. 18 2016-06-30 A NL 19 10
  21. 19 2016-06-30 B FR 20 6
  22. 20 2016-06-30 B UK 21 nan
  23. 21 2016-06-30 B SL 22 nan
  24. 22 2016-06-30 C NL 23 8
  25. 23 2016-06-30 C BE 24 8

Is there an efficient way to do this?

答案1

得分: 2

以下是翻译好的部分:

如果每个Date组中都有唯一的TypeCountry对,那么可以使用DataFrameGroupBy.diff

  1. df['ValueDiff'] = df.groupby(['Type', 'Country'])['Value'].diff()
  2. print(df)
  3. Date Type Country Value ValueDiff
  4. 0 2016-04-30 A NL 1 NaN
  5. 1 2016-04-30 A BE 2 NaN
  6. 2 2016-04-30 B NL 3 NaN
  7. 3 2016-04-30 B BE 4 NaN
  8. 4 2016-04-30 C NL 5 NaN
  9. 5 2016-04-30 C BE 6 NaN
  10. 6 2016-04-30 C FR 7 NaN
  11. 7 2016-04-30 C UK 8 NaN
  12. 8 2016-05-31 A NL 9 8.0
  13. 9 2016-05-31 A BE 10 8.0
  14. 10 2016-05-31 A FR 11 NaN
  15. 11 2016-05-31 B NL 12 9.0
  16. 12 2016-05-31 B BE 13 9.0
  17. 13 2016-05-31 B FR 14 NaN
  18. 14 2016-05-31 C NL 15 10.0
  19. 15 2016-05-31 C BE 16 10.0
  20. 16 2016-05-31 C UK 17 9.0
  21. 17 2016-05-31 C SL 18 NaN
  22. 18 2016-06-30 A NL 19 10.0
  23. 19 2016-06-30 B FR 20 6.0
  24. 20 2016-06-30 B UK 21 NaN
  25. 21 2016-06-30 B SL 22 NaN
  26. 22 2016-06-30 C NL 23 8.0
  27. 23 2016-06-30 C BE 24 8.0
英文:

If there are all unique pairs Type and Country per Date groups then is possible use DataFrameGroupBy.diff:

  1. df['ValueDiff'] = df.groupby(['Type','Country'])['Value'].diff()
  2. print (df)
  3. Date Type Country Value ValueDiff
  4. 0 2016-04-30 A NL 1 NaN
  5. 1 2016-04-30 A BE 2 NaN
  6. 2 2016-04-30 B NL 3 NaN
  7. 3 2016-04-30 B BE 4 NaN
  8. 4 2016-04-30 C NL 5 NaN
  9. 5 2016-04-30 C BE 6 NaN
  10. 6 2016-04-30 C FR 7 NaN
  11. 7 2016-04-30 C UK 8 NaN
  12. 8 2016-05-31 A NL 9 8.0
  13. 9 2016-05-31 A BE 10 8.0
  14. 10 2016-05-31 A FR 11 NaN
  15. 11 2016-05-31 B NL 12 9.0
  16. 12 2016-05-31 B BE 13 9.0
  17. 13 2016-05-31 B FR 14 NaN
  18. 14 2016-05-31 C NL 15 10.0
  19. 15 2016-05-31 C BE 16 10.0
  20. 16 2016-05-31 C UK 17 9.0
  21. 17 2016-05-31 C SL 18 NaN
  22. 18 2016-06-30 A NL 19 10.0
  23. 19 2016-06-30 B FR 20 6.0
  24. 20 2016-06-30 B UK 21 NaN
  25. 21 2016-06-30 B SL 22 NaN
  26. 22 2016-06-30 C NL 23 8.0
  27. 23 2016-06-30 C BE 24 8.0

huangapple
  • 本文由 发表于 2020年1月6日 17:14:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/59609376.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定