英文:
How to insert and fill the rows with calculated value in pandas?
问题
我有一个带有缺失theta步骤的pandas数据帧,如下所示,
我需要添加缺失的theta值,然后用线性插值填充空值。为了简单起见,我们可以考虑前一个和后一个可用值的平均值。如何做到这一点?
英文:
I have a pandas dataframe with missing theta steps as below,
index name theta r
1 wind 0 10
2 wind 30 17
3 wind 60 19
4 wind 90 14
5 wind 120 17
6 wind 210 18
7 wind 240 17
8 wind 270 11
9 wind 300 13
I need to add the missing theta with values,
index name theta r
1 wind 0 10
2 wind 30 17
3 wind 60 19
4 wind 90 14
5 wind 120 17
6 wind 150 null
7 wind 180 null
8 wind 210 18
9 wind 240 17
10 wind 270 11
11 wind 300 13
12 wind 330 null
And then fill the null values with linear interpolation. For simplicity here we can consider average of previous and next available value,
index name theta r
1 wind 0 10
2 wind 30 17
3 wind 60 19
4 wind 90 14
5 wind 120 17
6 wind 150 17.5 #(17 + 18)/2
7 wind 180 17.5 #(17 + 18)/2
8 wind 210 18
9 wind 240 17
10 wind 270 11
11 wind 300 13
12 wind 330 11.5 #(13 + 10)/2
How can I do this?
答案1
得分: 3
你可以使用interpolate
和ffill
来执行插值操作:
out = (
df.set_index('theta').reindex(range(0, 330+1, 30))
.interpolate().ffill().reset_index()[df.columns]
)
输出:
name theta r
0 wind 0 10.000000
1 wind 30 17.000000
2 wind 60 19.000000
3 wind 90 14.000000
4 wind 120 17.000000
5 wind 150 17.333333
6 wind 180 17.666667
7 wind 210 18.000000
8 wind 240 17.000000
9 wind 270 11.000000
10 wind 300 13.000000
11 wind 330 13.000000
执行圆形插值时,你可以仅使用limit_area='inside'
填充内部值,然后使用fillna
填充第一行和最后一行的均值:
out = (
df.set_index('theta').reindex(range(0, 330+1, 30))
.interpolate(method='linear', limit_area='inside')
.pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
.ffill().reset_index()[df.columns]
)
输出:
name theta r
0 wind 0 10.000000
1 wind 30 17.000000
2 wind 60 19.000000
3 wind 90 14.000000
4 wind 120 17.000000
5 wind 150 17.333333
6 wind 180 17.666667
7 wind 210 18.000000
8 wind 240 17.000000
9 wind 270 11.000000
10 wind 300 13.000000
11 wind 330 11.500000
如果你希望多个中间值具有相同的值,另一种选择是自己计算均值(使用ffill
/bfill
):
tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
tmp2 = tmp.ffill()
out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
.select_dtypes('number').div(2)
.combine_first(tmp2).reset_index()[df.columns]
)
输出:
name theta r
0 wind 0 10.0
1 wind 30 17.0
2 wind 60 19.0
3 wind 90 14.0
4 wind 120 17.0
5 wind 150 17.5 # 相同的值
6 wind 180 17.5 #
7 wind 210 18.0
8 wind 240 17.0
9 wind 270 11.0
10 wind 300 13.0
11 wind 330 11.5
这些方法适用于任何数量的数值列(不仅仅是'r'
列)。
英文:
You can use interpolate
and ffill
:
out = (
df.set_index('theta').reindex(range(0, 330+1, 30))
.interpolate().ffill().reset_index()[df.columns]
)
Output:
name theta r
0 wind 0 10.000000
1 wind 30 17.000000
2 wind 60 19.000000
3 wind 90 14.000000
4 wind 120 17.000000
5 wind 150 17.333333
6 wind 180 17.666667
7 wind 210 18.000000
8 wind 240 17.000000
9 wind 270 11.000000
10 wind 300 13.000000
11 wind 330 13.000000
Performing a circular interpolation, you can only fill the inner values with limit_area='inside'
, then fillna
with the mean of the first and last valid rows:
out = (
df.set_index('theta').reindex(range(0, 330+1, 30))
.interpolate(method='linear', limit_area='inside')
.pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
.ffill().reset_index()[df.columns]
)
Output:
name theta r
0 wind 0 10.000000
1 wind 30 17.000000
2 wind 60 19.000000
3 wind 90 14.000000
4 wind 120 17.000000
5 wind 150 17.333333
6 wind 180 17.666667
7 wind 210 18.000000
8 wind 240 17.000000
9 wind 270 11.000000
10 wind 300 13.000000
11 wind 330 11.500000
If you really want the same values for multiple intermediates, another option could be to compute yourself the mean (with ffill
/bfill
):
tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
tmp2 = tmp.ffill()
out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
.select_dtypes('number').div(2)
.combine_first(tmp2).reset_index()[df.columns]
)
Output:
name theta r
0 wind 0 10.0
1 wind 30 17.0
2 wind 60 19.0
3 wind 90 14.0
4 wind 120 17.0
5 wind 150 17.5 # same values
6 wind 180 17.5 #
7 wind 210 18.0
8 wind 240 17.0
9 wind 270 11.0
10 wind 300 13.0
11 wind 330 11.5
NB. these approaches should work with any number of numeric columns (not just 'r'
).
working with groups
One simple approach is to use a function and groupby.apply
:
def interp(df):
return (
df.set_index('theta').reindex(range(0, 330+1, 30))
.interpolate(method='linear', limit_area='inside')
.pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
.ffill().reset_index()[df.columns]
)
out = df.groupby('name', group_keys=False).apply(interp)
Or, first pivot
your data:
out = (
df.pivot(index='theta', columns='name')
.reindex(range(0, 330+1, 30))
.interpolate(method='linear', limit_area='inside')
.pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
.ffill().stack().reset_index()[df.columns]
)
Example output (#
shows the initially missing values):
name theta r
0 turb 0 100.000000
1 turb 30 170.000000
2 turb 60 190.000000
3 turb 90 140.000000
4 turb 120 170.000000
5 turb 150 173.333333 #
6 turb 180 176.666667 #
7 turb 210 180.000000
8 turb 240 170.000000
9 turb 270 110.000000
10 turb 300 130.000000
11 turb 330 115.000000 #
0 wind 0 10.000000
1 wind 30 17.000000
2 wind 60 19.000000
3 wind 90 14.000000
4 wind 120 17.000000 #
5 wind 150 17.333333 #
6 wind 180 17.666667
7 wind 210 18.000000
8 wind 240 17.000000
9 wind 270 11.000000
10 wind 300 13.000000
11 wind 330 11.500000 #
With bfill
/ffill
:
tmp = (df.set_index(['name', 'theta'])
.reindex(pd.MultiIndex.from_product([df['name'].unique(), range(0, 330+1, 30)],
names=['name', 'theta']
))
)
tmp2 = tmp.groupby(level='name').ffill()
out = ((tmp2+tmp.groupby(level='name').bfill().fillna(df.iloc[0]))
.select_dtypes('number').div(2)
.combine_first(tmp2).reset_index()[df.columns]
)
答案2
得分: 2
Here is the translated code part:
如果在name
列中有相同的值,您可以使用DataFrame.reindex
,通过range
进行前向和后向填充值,将s2
中的最后缺失值替换为s1
的第一个值:
df1 = df.set_index('theta').reindex(range(0, 360, 30))
s1 = df1['r'].ffill()
s2 = df1['r'].bfill().fillna(s1.iat[0])
df = s1.add(s2).div(2).reset_index().assign(name='wind')[df.columns]
print(df)
使用DataFrame.interpolate
和r
的后向填充值辅助行的插值解决方案:
df1 = df.set_index('theta').reindex(range(0, 360, 30))
df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
.interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])
print(df)
如果可能缺失第一行:
print(df)
df1 = df.set_index('theta').reindex(range(0, 360, 30))
df = (pd.concat([df1[['r']].ffill().iloc[[-1]],
df1,
df1[['r']].bfill().iloc[[0]]])
.interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])
print(df)
Please note that this code is in Python, and I have translated it for you.
英文:
If there is same value in name
column you can use DataFrame.reindex
by range
with divide forwar and back filling values with replace last missing values in s2
by first value of s1
:
df1 = df.set_index('theta').reindex(range(0, 360, 30))
s1 = df1['r'].ffill()
s2 = df1['r'].bfill().fillna(s1.iat[0])
df = s1.add(s2).div(2).reset_index().assign(name = 'wind')[df.columns]
print (df)
name theta r
0 wind 0 10.0
1 wind 30 17.0
2 wind 60 19.0
3 wind 90 14.0
4 wind 120 17.0
5 wind 150 17.5
6 wind 180 17.5
7 wind 210 18.0
8 wind 240 17.0
9 wind 270 11.0
10 wind 300 13.0
11 wind 330 11.5
Solution with interpolation with DataFrame.interpolate
and helper row by backfilled values of r
:
df1 = df.set_index('theta').reindex(range(0, 360, 30))
df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
.interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])
print (df)
name theta r
0 wind 0 10.000000
1 wind 30 17.000000
2 wind 60 19.000000
3 wind 90 14.000000
4 wind 120 17.000000
5 wind 150 17.333333
6 wind 180 17.666667
7 wind 210 18.000000
8 wind 240 17.000000
9 wind 270 11.000000
10 wind 300 13.000000
11 wind 330 11.500000
If possible missing first row:
print (df)
name theta r
2 wind 30 17
3 wind 60 19
4 wind 90 14
5 wind 120 17
6 wind 210 18
7 wind 240 17
8 wind 270 11
9 wind 300 13
df1 = df.set_index('theta').reindex(range(0, 360, 30))
df = (pd.concat([df1[['r']].ffill().iloc[[-1]],
df1,
df1[['r']].bfill().iloc[[0]]])
.interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])
print (df)
name theta r
1 wind 0 15.000000
2 wind 30 17.000000
3 wind 60 19.000000
4 wind 90 14.000000
5 wind 120 17.000000
6 wind 150 17.333333
7 wind 180 17.666667
8 wind 210 18.000000
9 wind 240 17.000000
10 wind 270 11.000000
11 wind 300 13.000000
12 wind 330 15.000000
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论