2023年4月13日 17:04:40go评论105阅读模式

英文:

How to insert and fill the rows with calculated value in pandas?

问题

我有一个带有缺失theta步骤的pandas数据帧，如下所示，

我需要添加缺失的theta值，然后用线性插值填充空值。为了简单起见，我们可以考虑前一个和后一个可用值的平均值。如何做到这一点？

英文:

I have a pandas dataframe with missing theta steps as below,

index  name theta r
1      wind 0     10
2      wind 30    17
3      wind 60    19
4      wind 90    14
5      wind 120   17
6      wind 210   18
7      wind 240   17
8      wind 270   11
9      wind 300   13

I need to add the missing theta with values,

index  name theta r
1      wind 0     10
2      wind 30    17
3      wind 60    19
4      wind 90    14
5      wind 120   17
6      wind 150   null
7      wind 180   null
8      wind 210   18
9      wind 240   17
10     wind 270   11
11     wind 300   13
12     wind 330   null

And then fill the null values with linear interpolation. For simplicity here we can consider average of previous and next available value,

index  name theta r
1      wind 0     10
2      wind 30    17
3      wind 60    19
4      wind 90    14
5      wind 120   17
6      wind 150   17.5 #(17 + 18)/2
7      wind 180   17.5 #(17 + 18)/2
8      wind 210   18
9      wind 240   17
10     wind 270   11
11     wind 300   13
12     wind 330   11.5 #(13 + 10)/2

How can I do this?

答案1

得分: 3

你可以使用interpolate和ffill来执行插值操作：

out = (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate().ffill().reset_index()[df.columns]
)

输出：

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  13.000000

执行圆形插值时，你可以仅使用limit_area='inside'填充内部值，然后使用fillna填充第一行和最后一行的均值：

out = (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate(method='linear', limit_area='inside')
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
   .ffill().reset_index()[df.columns]
)

输出：

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  11.500000

如果你希望多个中间值具有相同的值，另一种选择是自己计算均值（使用ffill/bfill）：

tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
tmp2 = tmp.ffill()
out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
       .select_dtypes('number').div(2)
       .combine_first(tmp2).reset_index()[df.columns]
      )

输出：

    name  theta     r
0   wind      0  10.0
1   wind     30  17.0
2   wind     60  19.0
3   wind     90  14.0
4   wind    120  17.0
5   wind    150  17.5  # 相同的值
6   wind    180  17.5  #
7   wind    210  18.0
8   wind    240  17.0
9   wind    270  11.0
10  wind    300  13.0
11  wind    330  11.5

这些方法适用于任何数量的数值列（不仅仅是'r'列）。

英文:

You can use interpolate and ffill:

out = (
 df.set_index(&#39;theta&#39;).reindex(range(0, 330+1, 30))
   .interpolate().ffill().reset_index()[df.columns]
)

Output:

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  13.000000

Performing a circular interpolation, you can only fill the inner values with limit_area='inside', then fillna with the mean of the first and last valid rows:

out = (
 df.set_index(&#39;theta&#39;).reindex(range(0, 330+1, 30))
   .interpolate(method=&#39;linear&#39;, limit_area=&#39;inside&#39;)
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes(&#39;number&#39;).mean()))
   .ffill().reset_index()[df.columns]
)

Output:

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  11.500000

If you really want the same values for multiple intermediates, another option could be to compute yourself the mean (with ffill/bfill):

tmp = df.set_index(&#39;theta&#39;).reindex(range(0, 330+1, 30))
tmp2 = tmp.ffill()
out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
       .select_dtypes(&#39;number&#39;).div(2)
       .combine_first(tmp2).reset_index()[df.columns]
      )

Output:

    name  theta     r
0   wind      0  10.0
1   wind     30  17.0
2   wind     60  19.0
3   wind     90  14.0
4   wind    120  17.0
5   wind    150  17.5  # same values
6   wind    180  17.5  #
7   wind    210  18.0
8   wind    240  17.0
9   wind    270  11.0
10  wind    300  13.0
11  wind    330  11.5

NB. these approaches should work with any number of numeric columns (not just 'r').

working with groups

One simple approach is to use a function and groupby.apply:

def interp(df):
    return  (
 df.set_index(&#39;theta&#39;).reindex(range(0, 330+1, 30))
   .interpolate(method=&#39;linear&#39;, limit_area=&#39;inside&#39;)
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes(&#39;number&#39;).mean()))
   .ffill().reset_index()[df.columns]
)
out = df.groupby(&#39;name&#39;, group_keys=False).apply(interp)

Or, first pivot your data:

out = (
 df.pivot(index=&#39;theta&#39;, columns=&#39;name&#39;)
   .reindex(range(0, 330+1, 30))
   .interpolate(method=&#39;linear&#39;, limit_area=&#39;inside&#39;)
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes(&#39;number&#39;).mean()))
   .ffill().stack().reset_index()[df.columns]
)

Example output (# shows the initially missing values):

    name  theta           r
0   turb      0  100.000000
1   turb     30  170.000000
2   turb     60  190.000000
3   turb     90  140.000000
4   turb    120  170.000000
5   turb    150  173.333333  #
6   turb    180  176.666667  #
7   turb    210  180.000000
8   turb    240  170.000000
9   turb    270  110.000000
10  turb    300  130.000000
11  turb    330  115.000000  #
0   wind      0   10.000000
1   wind     30   17.000000
2   wind     60   19.000000
3   wind     90   14.000000
4   wind    120   17.000000  #
5   wind    150   17.333333  #
6   wind    180   17.666667
7   wind    210   18.000000
8   wind    240   17.000000
9   wind    270   11.000000
10  wind    300   13.000000
11  wind    330   11.500000  #

With bfill/ffill:

tmp = (df.set_index([&#39;name&#39;, &#39;theta&#39;])
         .reindex(pd.MultiIndex.from_product([df[&#39;name&#39;].unique(), range(0, 330+1, 30)],
                                             names=[&#39;name&#39;, &#39;theta&#39;]
                                            ))
      )
tmp2 = tmp.groupby(level=&#39;name&#39;).ffill()
out = ((tmp2+tmp.groupby(level=&#39;name&#39;).bfill().fillna(df.iloc[0]))
       .select_dtypes(&#39;number&#39;).div(2)
       .combine_first(tmp2).reset_index()[df.columns]
      )

答案2

得分: 2

Here is the translated code part:

如果在name列中有相同的值，您可以使用DataFrame.reindex，通过range进行前向和后向填充值，将s2中的最后缺失值替换为s1的第一个值：

df1 = df.set_index('theta').reindex(range(0, 360, 30))
s1 = df1['r'].ffill()
s2 = df1['r'].bfill().fillna(s1.iat[0])
df = s1.add(s2).div(2).reset_index().assign(name='wind')[df.columns]
print(df)

使用DataFrame.interpolate和r的后向填充值辅助行的插值解决方案：

df1 = df.set_index('theta').reindex(range(0, 360, 30))
df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])
print(df)

如果可能缺失第一行：

print(df)
df1 = df.set_index('theta').reindex(range(0, 360, 30))
df = (pd.concat([df1[['r']].ffill().iloc[[-1]], 
                 df1, 
                 df1[['r']].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])
print(df)

Please note that this code is in Python, and I have translated it for you.

英文:

If there is same value in name column you can use DataFrame.reindex by range with divide forwar and back filling values with replace last missing values in s2 by first value of s1:

df1 = df.set_index(&#39;theta&#39;).reindex(range(0, 360, 30))
s1 = df1[&#39;r&#39;].ffill()  
s2 = df1[&#39;r&#39;].bfill().fillna(s1.iat[0])  
df = s1.add(s2).div(2).reset_index().assign(name = &#39;wind&#39;)[df.columns]
print (df)
    name  theta     r
0   wind      0  10.0
1   wind     30  17.0
2   wind     60  19.0
3   wind     90  14.0
4   wind    120  17.0
5   wind    150  17.5
6   wind    180  17.5
7   wind    210  18.0
8   wind    240  17.0
9   wind    270  11.0
10  wind    300  13.0
11  wind    330  11.5

Solution with interpolation with DataFrame.interpolate and helper row by backfilled values of r:

df1 = df.set_index(&#39;theta&#39;).reindex(range(0, 360, 30))
df = (pd.concat([df1, df1[[&#39;r&#39;]].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[:-1].assign(name=&#39;wind&#39;)[df.columns])
print (df)
    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  11.500000

If possible missing first row:

print (df)
   name  theta   r
2  wind     30  17
3  wind     60  19
4  wind     90  14
5  wind    120  17
6  wind    210  18
7  wind    240  17
8  wind    270  11
9  wind    300  13
df1 = df.set_index(&#39;theta&#39;).reindex(range(0, 360, 30))
df = (pd.concat([df1[[&#39;r&#39;]].ffill().iloc[[-1]], 
                 df1, 
                 df1[[&#39;r&#39;]].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[1:-1].assign(name=&#39;wind&#39;)[df.columns])
print (df)
    name  theta          r
1   wind      0  15.000000
2   wind     30  17.000000
3   wind     60  19.000000
4   wind     90  14.000000
5   wind    120  17.000000
6   wind    150  17.333333
7   wind    180  17.666667
8   wind    210  18.000000
9   wind    240  17.000000
10  wind    270  11.000000
11  wind    300  13.000000
12  wind    330  15.000000

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 pandas 中插入并填充带有计算值的行？

问题

答案1

working with groups

答案2

Python 根据条件读取多行数值。

“turtle.Terminator” 出现在使用 turtle 时发生的错误。

使用嵌套循环在Python中输入一个二维数组。

创建Django信号使用Django如何？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论