如何在 pandas 中插入并填充带有计算值的行?

huangapple go评论75阅读模式
英文:

How to insert and fill the rows with calculated value in pandas?

问题

我有一个带有缺失theta步骤的pandas数据帧,如下所示,

我需要添加缺失的theta值,然后用线性插值填充空值。为了简单起见,我们可以考虑前一个和后一个可用值的平均值。如何做到这一点?

英文:

I have a pandas dataframe with missing theta steps as below,

index  name theta r
1      wind 0     10
2      wind 30    17
3      wind 60    19
4      wind 90    14
5      wind 120   17
6      wind 210   18
7      wind 240   17
8      wind 270   11
9      wind 300   13

I need to add the missing theta with values,

index  name theta r
1      wind 0     10
2      wind 30    17
3      wind 60    19
4      wind 90    14
5      wind 120   17
6      wind 150   null
7      wind 180   null
8      wind 210   18
9      wind 240   17
10     wind 270   11
11     wind 300   13
12     wind 330   null

And then fill the null values with linear interpolation. For simplicity here we can consider average of previous and next available value,

index  name theta r
1      wind 0     10
2      wind 30    17
3      wind 60    19
4      wind 90    14
5      wind 120   17
6      wind 150   17.5 #(17 + 18)/2
7      wind 180   17.5 #(17 + 18)/2
8      wind 210   18
9      wind 240   17
10     wind 270   11
11     wind 300   13
12     wind 330   11.5 #(13 + 10)/2

How can I do this?

答案1

得分: 3

你可以使用interpolateffill来执行插值操作:

out = (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate().ffill().reset_index()[df.columns]
)

输出:

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  13.000000

执行圆形插值时,你可以仅使用limit_area='inside'填充内部值,然后使用fillna填充第一行和最后一行的均值:

out = (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate(method='linear', limit_area='inside')
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
   .ffill().reset_index()[df.columns]
)

输出:

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  11.500000

如果你希望多个中间值具有相同的值,另一种选择是自己计算均值(使用ffill/bfill):

tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
tmp2 = tmp.ffill()

out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
       .select_dtypes('number').div(2)
       .combine_first(tmp2).reset_index()[df.columns]
      )

输出:

    name  theta     r
0   wind      0  10.0
1   wind     30  17.0
2   wind     60  19.0
3   wind     90  14.0
4   wind    120  17.0
5   wind    150  17.5  # 相同的值
6   wind    180  17.5  #
7   wind    210  18.0
8   wind    240  17.0
9   wind    270  11.0
10  wind    300  13.0
11  wind    330  11.5

这些方法适用于任何数量的数值列(不仅仅是'r'列)。

英文:

You can use interpolate and ffill:

out = (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate().ffill().reset_index()[df.columns]
)

Output:

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  13.000000

Performing a circular interpolation, you can only fill the inner values with limit_area='inside', then fillna with the mean of the first and last valid rows:

out = (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate(method='linear', limit_area='inside')
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
   .ffill().reset_index()[df.columns]
)

Output:

    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  11.500000

If you really want the same values for multiple intermediates, another option could be to compute yourself the mean (with ffill/bfill):

tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
tmp2 = tmp.ffill()

out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
       .select_dtypes('number').div(2)
       .combine_first(tmp2).reset_index()[df.columns]
      )

Output:

    name  theta     r
0   wind      0  10.0
1   wind     30  17.0
2   wind     60  19.0
3   wind     90  14.0
4   wind    120  17.0
5   wind    150  17.5  # same values
6   wind    180  17.5  #
7   wind    210  18.0
8   wind    240  17.0
9   wind    270  11.0
10  wind    300  13.0
11  wind    330  11.5

NB. these approaches should work with any number of numeric columns (not just 'r').

working with groups

One simple approach is to use a function and groupby.apply:

def interp(df):
    return  (
 df.set_index('theta').reindex(range(0, 330+1, 30))
   .interpolate(method='linear', limit_area='inside')
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
   .ffill().reset_index()[df.columns]
)

out = df.groupby('name', group_keys=False).apply(interp)

Or, first pivot your data:

out = (
 df.pivot(index='theta', columns='name')
   .reindex(range(0, 330+1, 30))
   .interpolate(method='linear', limit_area='inside')
   .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
   .ffill().stack().reset_index()[df.columns]
)

Example output (# shows the initially missing values):

    name  theta           r
0   turb      0  100.000000
1   turb     30  170.000000
2   turb     60  190.000000
3   turb     90  140.000000
4   turb    120  170.000000
5   turb    150  173.333333  #
6   turb    180  176.666667  #
7   turb    210  180.000000
8   turb    240  170.000000
9   turb    270  110.000000
10  turb    300  130.000000
11  turb    330  115.000000  #
0   wind      0   10.000000
1   wind     30   17.000000
2   wind     60   19.000000
3   wind     90   14.000000
4   wind    120   17.000000  #
5   wind    150   17.333333  #
6   wind    180   17.666667
7   wind    210   18.000000
8   wind    240   17.000000
9   wind    270   11.000000
10  wind    300   13.000000
11  wind    330   11.500000  #

With bfill/ffill:

tmp = (df.set_index(['name', 'theta'])
         .reindex(pd.MultiIndex.from_product([df['name'].unique(), range(0, 330+1, 30)],
                                             names=['name', 'theta']
                                            ))
      )
tmp2 = tmp.groupby(level='name').ffill()

out = ((tmp2+tmp.groupby(level='name').bfill().fillna(df.iloc[0]))
       .select_dtypes('number').div(2)
       .combine_first(tmp2).reset_index()[df.columns]
      )

答案2

得分: 2

Here is the translated code part:

如果在name列中有相同的值,您可以使用DataFrame.reindex,通过range进行前向和后向填充值,将s2中的最后缺失值替换为s1的第一个值:

df1 = df.set_index('theta').reindex(range(0, 360, 30))

s1 = df1['r'].ffill()
s2 = df1['r'].bfill().fillna(s1.iat[0])

df = s1.add(s2).div(2).reset_index().assign(name='wind')[df.columns]
print(df)

使用DataFrame.interpolater的后向填充值辅助行的插值解决方案:

df1 = df.set_index('theta').reindex(range(0, 360, 30))

df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])

print(df)

如果可能缺失第一行:

print(df)

df1 = df.set_index('theta').reindex(range(0, 360, 30))

df = (pd.concat([df1[['r']].ffill().iloc[[-1]], 
                 df1, 
                 df1[['r']].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])

print(df)

Please note that this code is in Python, and I have translated it for you.

英文:

If there is same value in name column you can use DataFrame.reindex by range with divide forwar and back filling values with replace last missing values in s2 by first value of s1:

df1 = df.set_index('theta').reindex(range(0, 360, 30))

s1 = df1['r'].ffill()  
s2 = df1['r'].bfill().fillna(s1.iat[0])  

df = s1.add(s2).div(2).reset_index().assign(name = 'wind')[df.columns]
print (df)
    name  theta     r
0   wind      0  10.0
1   wind     30  17.0
2   wind     60  19.0
3   wind     90  14.0
4   wind    120  17.0
5   wind    150  17.5
6   wind    180  17.5
7   wind    210  18.0
8   wind    240  17.0
9   wind    270  11.0
10  wind    300  13.0
11  wind    330  11.5

Solution with interpolation with DataFrame.interpolate and helper row by backfilled values of r:

df1 = df.set_index('theta').reindex(range(0, 360, 30))

df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])

print (df)
    name  theta          r
0   wind      0  10.000000
1   wind     30  17.000000
2   wind     60  19.000000
3   wind     90  14.000000
4   wind    120  17.000000
5   wind    150  17.333333
6   wind    180  17.666667
7   wind    210  18.000000
8   wind    240  17.000000
9   wind    270  11.000000
10  wind    300  13.000000
11  wind    330  11.500000

If possible missing first row:

print (df)
   name  theta   r
2  wind     30  17
3  wind     60  19
4  wind     90  14
5  wind    120  17
6  wind    210  18
7  wind    240  17
8  wind    270  11
9  wind    300  13


df1 = df.set_index('theta').reindex(range(0, 360, 30))

df = (pd.concat([df1[['r']].ffill().iloc[[-1]], 
                 df1, 
                 df1[['r']].bfill().iloc[[0]]])
        .interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])

print (df)
    name  theta          r
1   wind      0  15.000000
2   wind     30  17.000000
3   wind     60  19.000000
4   wind     90  14.000000
5   wind    120  17.000000
6   wind    150  17.333333
7   wind    180  17.666667
8   wind    210  18.000000
9   wind    240  17.000000
10  wind    270  11.000000
11  wind    300  13.000000
12  wind    330  15.000000

huangapple
  • 本文由 发表于 2023年4月13日 17:04:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76003623.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定