如何在`pd.groupby()`中插值缺失的年份?

huangapple go评论107阅读模式
英文:

How to interpolate missing years within pd.groupby()

问题

问题:

我有一个包含5年时间间隔的数据帧。我需要按' id '列对条目进行分组,并在组中的第一个和最后一个项目之间进行插值。我理解这必须是groupby(),set_index()和interpolate()的某种组合,但我无法使其对整个输入数据帧起作用。

示例df:

  1. import pandas as pd
  2. data = {
  3. 'id': ['a', 'b', 'a', 'b'],
  4. 'year': [2005, 2005, 2010, 2010],
  5. 'val': [0, 0, 100, 100],
  6. }
  7. df = pd.DataFrame.from_dict(data)

示例输入df:

  1. _ id year val
  2. 0 a 2005 0
  3. 1 a 2010 100
  4. 2 b 2005 0
  5. 3 b 2010 100

期望的输出df:

  1. _ id year val type
  2. 0 a 2005 0 原始的
  3. 1 a 2006 20 插值的
  4. 2 a 2007 40 插值的
  5. 3 a 2008 60 插值的
  6. 4 a 2009 80 插值的
  7. 5 a 2010 100 原始的
  8. 6 b 2005 0 原始的
  9. 7 b 2006 20 插值的
  10. 8 b 2007 40 插值的
  11. 9 b 2008 60 插值的
  12. 10 b 2009 80 插值的
  13. 11 b 2010 100 原始的

'type'不是必需的,仅用于说明目的。

问题:

如何向groupby()视图添加缺失的年份并interpolate()它们对应的值?

谢谢!

英文:

Problem:

I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.

Sample df:

  1. import pandas as pd
  2. data = {
  3. 'id': ['a', 'b', 'a', 'b'],
  4. 'year': [2005, 2005, 2010, 2010],
  5. 'val': [0, 0, 100, 100],
  6. }
  7. df = pd.DataFrame.from_dict(data)

example input df:

  1. _ id year val
  2. 0 a 2005 0
  3. 1 a 2010 100
  4. 2 b 2005 0
  5. 3 b 2010 100

expected output df:

  1. _ id year val type
  2. 0 a 2005 0 original
  3. 1 a 2006 20 interpolated
  4. 2 a 2007 40 interpolated
  5. 3 a 2008 60 interpolated
  6. 4 a 2009 80 interpolated
  7. 5 a 2010 100 original
  8. 6 b 2005 0 original
  9. 7 b 2006 20 interpolated
  10. 8 b 2007 40 interpolated
  11. 9 b 2008 60 interpolated
  12. 10 b 2009 80 interpolated
  13. 11 b 2010 100 original

'type' is not necessary its just for illustration purposes.

Question:

How can I add missing years to the groupby() view and interpolate() their corresponding values?

Thank you!

答案1

得分: 1

针对每个组分别创建年份的最小和最大年份的解决方案:

首先,通过 DataFrame.reindex 根据每个组的最小和最大值创建缺失值,然后通过 Series.interpolate 进行插值,最后将原始DataFrame的值标识到新列中:

  1. df = (df.set_index('year')
  2. .groupby('id')['val']
  3. .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
  4. .reset_index()
  5. .merge(df, how='left', indicator=True)
  6. .assign(type=lambda x: np.where(x.pop('_merge').eq('both'),
  7. 'original',
  8. 'interpolated')))
  9. print (df)
  10. id year val type
  11. 0 a 2005 0.0 original
  12. 1 a 2006 20.0 interpolated
  13. 2 a 2007 40.0 interpolated
  14. 3 a 2008 60.0 interpolated
  15. 4 a 2009 80.0 interpolated
  16. 5 a 2010 100.0 original
  17. 6 b 2005 0.0 original
  18. 7 b 2006 20.0 interpolated
  19. 8 b 2007 40.0 interpolated
  20. 9 b 2008 60.0 interpolated
  21. 10 b 2009 80.0 interpolated
  22. 11 b 2010 100.0 original

请注意,这是给定代码的翻译部分,没有包括其他信息或回答您可能有的其他问题。

英文:

Solution for create years by minimal and maximal years for each group independently:

First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:

  1. df = (df.set_index('year')
  2. .groupby('id')['val']
  3. .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
  4. .reset_index()
  5. .merge(df, how='left', indicator=True)
  6. .assign(type = lambda x: np.where(x.pop('_merge').eq('both'),
  7. 'original',
  8. 'interpolated')))
  9. print (df)
  10. id year val type
  11. 0 a 2005 0.0 original
  12. 1 a 2006 20.0 interpolated
  13. 2 a 2007 40.0 interpolated
  14. 3 a 2008 60.0 interpolated
  15. 4 a 2009 80.0 interpolated
  16. 5 a 2010 100.0 original
  17. 6 b 2005 0.0 original
  18. 7 b 2006 20.0 interpolated
  19. 8 b 2007 40.0 interpolated
  20. 9 b 2008 60.0 interpolated
  21. 10 b 2009 80.0 interpolated
  22. 11 b 2010 100.0 original

答案2

得分: 1

使用pivotunstackreindexinterpolate来进行临时重塑以添加缺失的年份:

  1. out = (df
  2. .pivot(index='year', columns='id', values='val')
  3. .reindex(range(df['year'].min(), df['year'].max()+1))
  4. .interpolate('index')
  5. .unstack(-1).reset_index(name='val')
  6. )

输出:

  1. id year val
  2. 0 a 2005 0.0
  3. 1 a 2006 20.0
  4. 2 a 2007 40.0
  5. 3 a 2008 60.0
  6. 4 a 2009 80.0
  7. 5 a 2010 100.0
  8. 6 b 2005 0.0
  9. 7 b 2006 20.0
  10. 8 b 2007 40.0
  11. 9 b 2008 60.0
  12. 10 b 2009 80.0
  13. 11 b 2010 100.0
英文:

Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:

  1. out = (df
  2. .pivot(index='year', columns='id', values='val')
  3. .reindex(range(df['year'].min(), df['year'].max()+1))
  4. .interpolate('index')
  5. .unstack(-1).reset_index(name='val')
  6. )

Output:

  1. id year val
  2. 0 a 2005 0.0
  3. 1 a 2006 20.0
  4. 2 a 2007 40.0
  5. 3 a 2008 60.0
  6. 4 a 2009 80.0
  7. 5 a 2010 100.0
  8. 6 b 2005 0.0
  9. 7 b 2006 20.0
  10. 8 b 2007 40.0
  11. 9 b 2008 60.0
  12. 10 b 2009 80.0
  13. 11 b 2010 100.0

huangapple
  • 本文由 发表于 2023年2月6日 19:59:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/75361034.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定