如何在 pandas 中插入并填充带有计算值的行?

huangapple go评论98阅读模式
英文:

How to insert and fill the rows with calculated value in pandas?

问题

我有一个带有缺失theta步骤的pandas数据帧,如下所示,

我需要添加缺失的theta值,然后用线性插值填充空值。为了简单起见,我们可以考虑前一个和后一个可用值的平均值。如何做到这一点?

英文:

I have a pandas dataframe with missing theta steps as below,

  1. index name theta r
  2. 1 wind 0 10
  3. 2 wind 30 17
  4. 3 wind 60 19
  5. 4 wind 90 14
  6. 5 wind 120 17
  7. 6 wind 210 18
  8. 7 wind 240 17
  9. 8 wind 270 11
  10. 9 wind 300 13

I need to add the missing theta with values,

  1. index name theta r
  2. 1 wind 0 10
  3. 2 wind 30 17
  4. 3 wind 60 19
  5. 4 wind 90 14
  6. 5 wind 120 17
  7. 6 wind 150 null
  8. 7 wind 180 null
  9. 8 wind 210 18
  10. 9 wind 240 17
  11. 10 wind 270 11
  12. 11 wind 300 13
  13. 12 wind 330 null

And then fill the null values with linear interpolation. For simplicity here we can consider average of previous and next available value,

  1. index name theta r
  2. 1 wind 0 10
  3. 2 wind 30 17
  4. 3 wind 60 19
  5. 4 wind 90 14
  6. 5 wind 120 17
  7. 6 wind 150 17.5 #(17 + 18)/2
  8. 7 wind 180 17.5 #(17 + 18)/2
  9. 8 wind 210 18
  10. 9 wind 240 17
  11. 10 wind 270 11
  12. 11 wind 300 13
  13. 12 wind 330 11.5 #(13 + 10)/2

How can I do this?

答案1

得分: 3

你可以使用interpolateffill来执行插值操作:

  1. out = (
  2. df.set_index('theta').reindex(range(0, 330+1, 30))
  3. .interpolate().ffill().reset_index()[df.columns]
  4. )

输出:

  1. name theta r
  2. 0 wind 0 10.000000
  3. 1 wind 30 17.000000
  4. 2 wind 60 19.000000
  5. 3 wind 90 14.000000
  6. 4 wind 120 17.000000
  7. 5 wind 150 17.333333
  8. 6 wind 180 17.666667
  9. 7 wind 210 18.000000
  10. 8 wind 240 17.000000
  11. 9 wind 270 11.000000
  12. 10 wind 300 13.000000
  13. 11 wind 330 13.000000

执行圆形插值时,你可以仅使用limit_area='inside'填充内部值,然后使用fillna填充第一行和最后一行的均值:

  1. out = (
  2. df.set_index('theta').reindex(range(0, 330+1, 30))
  3. .interpolate(method='linear', limit_area='inside')
  4. .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
  5. .ffill().reset_index()[df.columns]
  6. )

输出:

  1. name theta r
  2. 0 wind 0 10.000000
  3. 1 wind 30 17.000000
  4. 2 wind 60 19.000000
  5. 3 wind 90 14.000000
  6. 4 wind 120 17.000000
  7. 5 wind 150 17.333333
  8. 6 wind 180 17.666667
  9. 7 wind 210 18.000000
  10. 8 wind 240 17.000000
  11. 9 wind 270 11.000000
  12. 10 wind 300 13.000000
  13. 11 wind 330 11.500000

如果你希望多个中间值具有相同的值,另一种选择是自己计算均值(使用ffill/bfill):

  1. tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
  2. tmp2 = tmp.ffill()
  3. out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
  4. .select_dtypes('number').div(2)
  5. .combine_first(tmp2).reset_index()[df.columns]
  6. )

输出:

  1. name theta r
  2. 0 wind 0 10.0
  3. 1 wind 30 17.0
  4. 2 wind 60 19.0
  5. 3 wind 90 14.0
  6. 4 wind 120 17.0
  7. 5 wind 150 17.5 # 相同的值
  8. 6 wind 180 17.5 #
  9. 7 wind 210 18.0
  10. 8 wind 240 17.0
  11. 9 wind 270 11.0
  12. 10 wind 300 13.0
  13. 11 wind 330 11.5

这些方法适用于任何数量的数值列(不仅仅是'r'列)。

英文:

You can use interpolate and ffill:

  1. out = (
  2. df.set_index('theta').reindex(range(0, 330+1, 30))
  3. .interpolate().ffill().reset_index()[df.columns]
  4. )

Output:

  1. name theta r
  2. 0 wind 0 10.000000
  3. 1 wind 30 17.000000
  4. 2 wind 60 19.000000
  5. 3 wind 90 14.000000
  6. 4 wind 120 17.000000
  7. 5 wind 150 17.333333
  8. 6 wind 180 17.666667
  9. 7 wind 210 18.000000
  10. 8 wind 240 17.000000
  11. 9 wind 270 11.000000
  12. 10 wind 300 13.000000
  13. 11 wind 330 13.000000

Performing a circular interpolation, you can only fill the inner values with limit_area='inside', then fillna with the mean of the first and last valid rows:

  1. out = (
  2. df.set_index('theta').reindex(range(0, 330+1, 30))
  3. .interpolate(method='linear', limit_area='inside')
  4. .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
  5. .ffill().reset_index()[df.columns]
  6. )

Output:

  1. name theta r
  2. 0 wind 0 10.000000
  3. 1 wind 30 17.000000
  4. 2 wind 60 19.000000
  5. 3 wind 90 14.000000
  6. 4 wind 120 17.000000
  7. 5 wind 150 17.333333
  8. 6 wind 180 17.666667
  9. 7 wind 210 18.000000
  10. 8 wind 240 17.000000
  11. 9 wind 270 11.000000
  12. 10 wind 300 13.000000
  13. 11 wind 330 11.500000

If you really want the same values for multiple intermediates, another option could be to compute yourself the mean (with ffill/bfill):

  1. tmp = df.set_index('theta').reindex(range(0, 330+1, 30))
  2. tmp2 = tmp.ffill()
  3. out = ((tmp2+tmp.bfill().fillna(df.iloc[0]))
  4. .select_dtypes('number').div(2)
  5. .combine_first(tmp2).reset_index()[df.columns]
  6. )

Output:

  1. name theta r
  2. 0 wind 0 10.0
  3. 1 wind 30 17.0
  4. 2 wind 60 19.0
  5. 3 wind 90 14.0
  6. 4 wind 120 17.0
  7. 5 wind 150 17.5 # same values
  8. 6 wind 180 17.5 #
  9. 7 wind 210 18.0
  10. 8 wind 240 17.0
  11. 9 wind 270 11.0
  12. 10 wind 300 13.0
  13. 11 wind 330 11.5

NB. these approaches should work with any number of numeric columns (not just 'r').

working with groups

One simple approach is to use a function and groupby.apply:

  1. def interp(df):
  2. return (
  3. df.set_index('theta').reindex(range(0, 330+1, 30))
  4. .interpolate(method='linear', limit_area='inside')
  5. .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
  6. .ffill().reset_index()[df.columns]
  7. )
  8. out = df.groupby('name', group_keys=False).apply(interp)

Or, first pivot your data:

  1. out = (
  2. df.pivot(index='theta', columns='name')
  3. .reindex(range(0, 330+1, 30))
  4. .interpolate(method='linear', limit_area='inside')
  5. .pipe(lambda d: d.fillna(d.dropna().iloc[[0, -1]].select_dtypes('number').mean()))
  6. .ffill().stack().reset_index()[df.columns]
  7. )

Example output (# shows the initially missing values):

  1. name theta r
  2. 0 turb 0 100.000000
  3. 1 turb 30 170.000000
  4. 2 turb 60 190.000000
  5. 3 turb 90 140.000000
  6. 4 turb 120 170.000000
  7. 5 turb 150 173.333333 #
  8. 6 turb 180 176.666667 #
  9. 7 turb 210 180.000000
  10. 8 turb 240 170.000000
  11. 9 turb 270 110.000000
  12. 10 turb 300 130.000000
  13. 11 turb 330 115.000000 #
  14. 0 wind 0 10.000000
  15. 1 wind 30 17.000000
  16. 2 wind 60 19.000000
  17. 3 wind 90 14.000000
  18. 4 wind 120 17.000000 #
  19. 5 wind 150 17.333333 #
  20. 6 wind 180 17.666667
  21. 7 wind 210 18.000000
  22. 8 wind 240 17.000000
  23. 9 wind 270 11.000000
  24. 10 wind 300 13.000000
  25. 11 wind 330 11.500000 #

With bfill/ffill:

  1. tmp = (df.set_index(['name', 'theta'])
  2. .reindex(pd.MultiIndex.from_product([df['name'].unique(), range(0, 330+1, 30)],
  3. names=['name', 'theta']
  4. ))
  5. )
  6. tmp2 = tmp.groupby(level='name').ffill()
  7. out = ((tmp2+tmp.groupby(level='name').bfill().fillna(df.iloc[0]))
  8. .select_dtypes('number').div(2)
  9. .combine_first(tmp2).reset_index()[df.columns]
  10. )

答案2

得分: 2

Here is the translated code part:

如果在name列中有相同的值,您可以使用DataFrame.reindex,通过range进行前向和后向填充值,将s2中的最后缺失值替换为s1的第一个值:

  1. df1 = df.set_index('theta').reindex(range(0, 360, 30))
  2. s1 = df1['r'].ffill()
  3. s2 = df1['r'].bfill().fillna(s1.iat[0])
  4. df = s1.add(s2).div(2).reset_index().assign(name='wind')[df.columns]
  5. print(df)

使用DataFrame.interpolater的后向填充值辅助行的插值解决方案:

  1. df1 = df.set_index('theta').reindex(range(0, 360, 30))
  2. df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
  3. .interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])
  4. print(df)

如果可能缺失第一行:

  1. print(df)
  2. df1 = df.set_index('theta').reindex(range(0, 360, 30))
  3. df = (pd.concat([df1[['r']].ffill().iloc[[-1]],
  4. df1,
  5. df1[['r']].bfill().iloc[[0]]])
  6. .interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])
  7. print(df)

Please note that this code is in Python, and I have translated it for you.

英文:

If there is same value in name column you can use DataFrame.reindex by range with divide forwar and back filling values with replace last missing values in s2 by first value of s1:

  1. df1 = df.set_index('theta').reindex(range(0, 360, 30))
  2. s1 = df1['r'].ffill()
  3. s2 = df1['r'].bfill().fillna(s1.iat[0])
  4. df = s1.add(s2).div(2).reset_index().assign(name = 'wind')[df.columns]
  5. print (df)
  6. name theta r
  7. 0 wind 0 10.0
  8. 1 wind 30 17.0
  9. 2 wind 60 19.0
  10. 3 wind 90 14.0
  11. 4 wind 120 17.0
  12. 5 wind 150 17.5
  13. 6 wind 180 17.5
  14. 7 wind 210 18.0
  15. 8 wind 240 17.0
  16. 9 wind 270 11.0
  17. 10 wind 300 13.0
  18. 11 wind 330 11.5

Solution with interpolation with DataFrame.interpolate and helper row by backfilled values of r:

  1. df1 = df.set_index('theta').reindex(range(0, 360, 30))
  2. df = (pd.concat([df1, df1[['r']].bfill().iloc[[0]]])
  3. .interpolate().reset_index().iloc[:-1].assign(name='wind')[df.columns])
  4. print (df)
  5. name theta r
  6. 0 wind 0 10.000000
  7. 1 wind 30 17.000000
  8. 2 wind 60 19.000000
  9. 3 wind 90 14.000000
  10. 4 wind 120 17.000000
  11. 5 wind 150 17.333333
  12. 6 wind 180 17.666667
  13. 7 wind 210 18.000000
  14. 8 wind 240 17.000000
  15. 9 wind 270 11.000000
  16. 10 wind 300 13.000000
  17. 11 wind 330 11.500000

If possible missing first row:

  1. print (df)
  2. name theta r
  3. 2 wind 30 17
  4. 3 wind 60 19
  5. 4 wind 90 14
  6. 5 wind 120 17
  7. 6 wind 210 18
  8. 7 wind 240 17
  9. 8 wind 270 11
  10. 9 wind 300 13
  11. df1 = df.set_index('theta').reindex(range(0, 360, 30))
  12. df = (pd.concat([df1[['r']].ffill().iloc[[-1]],
  13. df1,
  14. df1[['r']].bfill().iloc[[0]]])
  15. .interpolate().reset_index().iloc[1:-1].assign(name='wind')[df.columns])
  16. print (df)
  17. name theta r
  18. 1 wind 0 15.000000
  19. 2 wind 30 17.000000
  20. 3 wind 60 19.000000
  21. 4 wind 90 14.000000
  22. 5 wind 120 17.000000
  23. 6 wind 150 17.333333
  24. 7 wind 180 17.666667
  25. 8 wind 210 18.000000
  26. 9 wind 240 17.000000
  27. 10 wind 270 11.000000
  28. 11 wind 300 13.000000
  29. 12 wind 330 15.000000

huangapple
  • 本文由 发表于 2023年4月13日 17:04:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76003623.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定