2023年2月6日 19:59:26go评论107阅读模式

英文:

How to interpolate missing years within pd.groupby()

问题

问题:

我有一个包含5年时间间隔的数据帧。我需要按' id '列对条目进行分组，并在组中的第一个和最后一个项目之间进行插值。我理解这必须是groupby()，set_index()和interpolate()的某种组合，但我无法使其对整个输入数据帧起作用。

示例df：

import pandas as pd
data = {
    'id': ['a', 'b', 'a', 'b'],
    'year': [2005, 2005, 2010, 2010],
    'val': [0, 0, 100, 100],
}
df = pd.DataFrame.from_dict(data)

示例输入df：

    _    id  year  val
    0     a  2005    0
    1     a  2010  100
    2     b  2005    0
    3     b  2010  100

期望的输出df：

    _     id  year  val type
    0      a  2005    0 原始的
    1      a  2006   20 插值的
    2      a  2007   40 插值的
    3      a  2008   60 插值的
    4      a  2009   80 插值的
    5      a  2010  100 原始的
    6      b  2005    0 原始的
    7      b  2006   20 插值的
    8      b  2007   40 插值的
    9      b  2008   60 插值的
    10     b  2009   80 插值的
    11     b  2010  100 原始的

'type'不是必需的，仅用于说明目的。

问题:

如何向groupby()视图添加缺失的年份并interpolate()它们对应的值？

谢谢！

英文:

Problem:

I have a dataframe that contains entries with 5 year time intervals. I need to group entries by 'id' columns and interpolate values between the first and last item in the group. I understand that it has to be some combination of groupby(), set_index() and interpolate() but I am unable to make it work for the whole input dataframe.

Sample df:

import pandas as pd
data = {
    &#39;id&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;a&#39;, &#39;b&#39;],
    &#39;year&#39;: [2005, 2005, 2010, 2010],
    &#39;val&#39;: [0, 0, 100, 100],
    
}
df = pd.DataFrame.from_dict(data)

example input df:

_    id  year  val
0     a  2005    0
1     a  2010  100
2     b  2005    0
3     b  2010  100

expected output df:

_     id  year  val type
0      a  2005    0 original
1      a  2006   20 interpolated
2      a  2007   40 interpolated
3      a  2008   60 interpolated
4      a  2009   80 interpolated
5      a  2010  100 original
6      b  2005    0 original
7      b  2006   20 interpolated
8      b  2007   40 interpolated
9      b  2008   60 interpolated
10     b  2009   80 interpolated
11     b  2010  100 original

'type' is not necessary its just for illustration purposes.

Question:

How can I add missing years to the groupby() view and interpolate() their corresponding values?

Thank you!

答案1

得分: 1

针对每个组分别创建年份的最小和最大年份的解决方案：

首先，通过 DataFrame.reindex 根据每个组的最小和最大值创建缺失值，然后通过 Series.interpolate 进行插值，最后将原始DataFrame的值标识到新列中：

df = (df.set_index('year')
        .groupby('id')['val']
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
        .reset_index()
        .merge(df, how='left', indicator=True)
        .assign(type=lambda x: np.where(x.pop('_merge').eq('both'),
                                      'original',
                                      'interpolated')))
print (df)
   id  year    val          type
0   a  2005    0.0      original
1   a  2006   20.0  interpolated
2   a  2007   40.0  interpolated
3   a  2008   60.0  interpolated
4   a  2009   80.0  interpolated
5   a  2010  100.0      original
6   b  2005    0.0      original
7   b  2006   20.0  interpolated
8   b  2007   40.0  interpolated
9   b  2008   60.0  interpolated
10  b  2009   80.0  interpolated
11  b  2010  100.0      original

请注意，这是给定代码的翻译部分，没有包括其他信息或回答您可能有的其他问题。

英文:

Solution for create years by minimal and maximal years for each group independently:

First create missing values by DataFrame.reindex per groups by minimal and maximal values and then interpolate by Series.interpolate, last identify values from original DataFrame to new column:

df = (df.set_index(&#39;year&#39;)
        .groupby(&#39;id&#39;)[&#39;val&#39;]
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1)).interpolate())
        .reset_index()
        .merge(df, how=&#39;left&#39;, indicator=True)
        .assign(type = lambda x: np.where(x.pop(&#39;_merge&#39;).eq(&#39;both&#39;),
                                          &#39;original&#39;,
                                          &#39;interpolated&#39;)))
print (df)
   id  year    val          type
0   a  2005    0.0      original
1   a  2006   20.0  interpolated
2   a  2007   40.0  interpolated
3   a  2008   60.0  interpolated
4   a  2009   80.0  interpolated
5   a  2010  100.0      original
6   b  2005    0.0      original
7   b  2006   20.0  interpolated
8   b  2007   40.0  interpolated
9   b  2008   60.0  interpolated
10  b  2009   80.0  interpolated
11  b  2010  100.0      original

答案2

得分: 1

使用pivot、unstack、reindex和interpolate来进行临时重塑以添加缺失的年份：

out = (df
   .pivot(index='year', columns='id', values='val')
   .reindex(range(df['year'].min(), df['year'].max()+1))
   .interpolate('index')
   .unstack(-1).reset_index(name='val')
)

输出：

   id  year    val
0   a  2005    0.0
1   a  2006   20.0
2   a  2007   40.0
3   a  2008   60.0
4   a  2009   80.0
5   a  2010  100.0
6   b  2005    0.0
7   b  2006   20.0
8   b  2007   40.0
9   b  2008   60.0
10  b  2009   80.0
11  b  2010  100.0

英文:

Using a temporary reshaping with pivot and unstack and reindex+interpolate to add the missing years:

out = (df
   .pivot(index=&#39;year&#39;, columns=&#39;id&#39;, values=&#39;val&#39;)
   .reindex(range(df[&#39;year&#39;].min(), df[&#39;year&#39;].max()+1))
   .interpolate(&#39;index&#39;)
   .unstack(-1).reset_index(name=&#39;val&#39;)
)

Output:

   id  year    val
0   a  2005    0.0
1   a  2006   20.0
2   a  2007   40.0
3   a  2008   60.0
4   a  2009   80.0
5   a  2010  100.0
6   b  2005    0.0
7   b  2006   20.0
8   b  2007   40.0
9   b  2008   60.0
10  b  2009   80.0
11  b  2010  100.0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在`pd.groupby()`中插值缺失的年份？

问题

答案1

答案2

Hi, am having trouble on how to code [False, False, True, True, True, False, True, True, False] into [[2,3],[6,2]] in python

Wandb在运行训练脚本时忽略–configs标志。

位非运算符不会翻转位。

Z3约束求解器用于哈希操作

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。