基于日期时间进行插值。

huangapple go评论88阅读模式
英文:

Interpolate based on datetimes

问题

在pandas中,我可以基于日期时间进行插值,如下所示:

df1 = pd.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, np.nan, np.nan, 3],
    }
)
df1.set_index('ts').interpolate(method='index')

输出结果:

                        value
ts
2020-01-01 00:00:00  1.000000
2020-01-03 00:00:12  2.333426
2020-01-03 00:01:35  2.334066
2020-01-04 00:00:00  3.000000

在polars中是否有类似的方法?例如,从以下代码开始:

df1 = pl.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, None, None, 3],
    }
)
shape: (4, 2)
┌─────────────────────┬───────┐
 ts                   value 
 ---                  ---   
 datetime[μs]         i64   
╞═════════════════════╪═══════╡
 2020-01-01 00:00:00  1     
 2020-01-03 00:00:12  null  
 2020-01-03 00:01:35  null  
 2020-01-04 00:00:00  3     
└─────────────────────┴───────┘

编辑:我已更新示例,使其更加“不规则”,因此无法使用upsample作为解决方案,并明确需要更通用的方法。

英文:

In pandas, I can interpolate based on a datetimes like this:

df1 = pd.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, np.nan, np.nan, 3],
    }
)
df1.set_index('ts').interpolate(method='index')

Outputs:

                        value
ts
2020-01-01 00:00:00  1.000000
2020-01-03 00:00:12  2.333426
2020-01-03 00:01:35  2.334066
2020-01-04 00:00:00  3.000000

Is there a similar method in polars? Say, starting with

df1 = pl.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, None, None, 3],
    }
)
shape: (4, 2)
┌─────────────────────┬───────┐
 ts                   value 
 ---                  ---   
 datetime[μs]         i64   
╞═════════════════════╪═══════╡
 2020-01-01 00:00:00  1     
 2020-01-03 00:00:12  null  
 2020-01-03 00:01:35  null  
 2020-01-04 00:00:00  3     
└─────────────────────┴───────┘

EDIT: I've updated the example to make it a bit more "irregular", so that upsample can't be used as a solution and to make it clear that we need something more generic

答案1

得分: 1

看起来 pandas 在进行插值之前先进行了上采样。因此,在 Polars 中,我们可以通过先上采样,然后进行插值,然后再将其与自身连接,以保留最初的日期时间:

(df1
        .sort('ts')
        .with_columns(pl.col('value').cast(pl.Float64))
        .upsample(time_column='ts', every='1d')
        .interpolate()
        .join(
            df1.select('ts'), on='ts'
            )
        )

你还需要注意列的数据类型,它应该是浮点型,否则会得到整数插值。

ts(datetime[μs]) value(f64)
2020-01-01 00:00:00 1.0
2020-01-03 00:00:00 2.333333
2020-01-04 00:00:00 3.0

<details>
<summary>英文:</summary>

It seems pandas upsampling first before doing interpolation. So, we can do the same thing in Polars by upsampling, then interpolating and then joining itself back so we only keep the datetimes you had initially:

```python
(df1
        .sort(&#39;ts&#39;)
        .with_columns(pl.col(&#39;value&#39;).cast(pl.Float64))
        .upsample(time_column=&#39;ts&#39;, every=&#39;1d&#39;)
        .interpolate()
        .join(
            df1.select(&#39;ts&#39;), on=&#39;ts&#39;
            )
        )

You also need to take care of the column dtype, it should be a float otherwise you get integer interpolation.

ts(datetime[μs]) value(f64)
2020-01-01 00:00:00 1.0
2020-01-03 00:00:00 2.333333
2020-01-04 00:00:00 3.0

答案2

得分: 1

不确定这是否有用,但看起来pandas调用np.interp()来执行此操作:

  • https://github.com/pandas-dev/pandas/blob/main/pandas/core/missing.py#L481
    invalid = pl.when(pl.col(&#39;value&#39;).is_null()).then(pl.col(&#39;ts&#39;)).alias(&#39;invalid&#39;) valid = pl.when(pl.col(&#39;value&#39;).is_not_null()).then(pl.col(&#39;ts&#39;)).alias(&#39;valid&#39;) values = pl.when(pl.col(&#39;value&#39;).is_not_null()).then(pl.col(&#39;value&#39;)).alias(&#39;values&#39;) df.select( pl.struct(invalid, valid, values) .map(lambda args: np.interp( args.struct[&#39;invalid&#39;].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), args.struct[&#39;valid&#39;].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), args.struct[&#39;values&#39;].drop_nulls().to_numpy(zero_copy_only=True) ) ) .flatten() )
┌──────────┐
│ 无效     │
│ ---      │
│ f64      │
╞══════════╡
│ 2.333426 │
│ 2.334066 │
└──────────┘```尽管似乎还有很多其他事情在进行。

<details>
<summary>英文:</summary>

Not sure how useful this is but it looks like pandas calls `np.interp()` to do this:

- https://github.com/pandas-dev/pandas/blob/main/pandas/core/missing.py#L481

invalid = pl.when(pl.col('value').is_null()).then(pl.col('ts')).alias('invalid')
valid = pl.when(pl.col('value').is_not_null()).then(pl.col('ts')).alias('valid')
values = pl.when(pl.col('value').is_not_null()).then(pl.col('value')).alias('values')

df.select(
pl.struct(invalid, valid, values)
.map(lambda args:
np.interp(
args.struct['invalid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True),
args.struct['valid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True),
args.struct['values'].drop_nulls().to_numpy(zero_copy_only=True)
)
)
.flatten()
)

shape: (2, 1)
┌──────────┐
│ invalid │
│ --- │
│ f64 │
╞══════════╡
│ 2.333426 │
│ 2.334066 │
└──────────┘

Although there does seem to be a lot of other stuff going on.

</details>



# 答案3
**得分**: 1

这是一个使用 `scipy` 的解决方案。对于这些数值,转换到 `numpy` 应该是零复制的,所以我认为它应该是高效的。

```python
from scipy import interpolate
df1 = pl.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, None, None, 3],
    }
)
x = (
    df1.filter(pl.col("value").is_not_null())["ts"]
    .dt.timestamp()
    .to_numpy(zero_copy_only=True)
)
y = df1.filter(pl.col("value").is_not_null())[
    "value"
].to_numpy(zero_copy_only=True)
xnew = (
    df1["ts"].dt.timestamp().to_numpy(zero_copy_only=True)
)
ynew = interpolate.interp1d(x, y)(xnew)
df1 = df1.with_columns(pl.Series(ynew).alias("value"))

结果是

In [6]: df1
Out[6]:
shape: (4, 2)
┌─────────────────────┬──────────┐
 ts                   value    
 ---                  ---      
 datetime[μs]         f64      
╞═════════════════════╪══════════╡
 2020-01-01 00:00:00  1.0      
 2020-01-03 00:00:12  2.333426 
 2020-01-03 00:01:35  2.334066 
 2020-01-04 00:00:00  3.0      
└─────────────────────┴──────────┘
英文:

Here's a solution which uses scipy. Conversion to numpy should be zero-copy for these values, so I think it should be efficient

from scipy import interpolate
df1 = pl.DataFrame(
    {
        &quot;ts&quot;: [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        &quot;value&quot;: [1, None, None, 3],
    }
)
x = (
    df1.filter(pl.col(&quot;value&quot;).is_not_null())[&quot;ts&quot;]
    .dt.timestamp()
    .to_numpy(zero_copy_only=True)
)
y = df1.filter(pl.col(&quot;value&quot;).is_not_null())[
    &quot;value&quot;
].to_numpy(zero_copy_only=True)
xnew = (
    df1[&quot;ts&quot;].dt.timestamp().to_numpy(zero_copy_only=True)
)
ynew = interpolate.interp1d(x, y)(xnew)
df1 = df1.with_columns(pl.Series(ynew).alias(&quot;value&quot;))

The result is

In [6]: df1
Out[6]:
shape: (4, 2)
┌─────────────────────┬──────────┐
 ts                   value    
 ---                  ---      
 datetime[μs]         f64      
╞═════════════════════╪══════════╡
 2020-01-01 00:00:00  1.0      
 2020-01-03 00:00:12  2.333426 
 2020-01-03 00:01:35  2.334066 
 2020-01-04 00:00:00  3.0      
└─────────────────────┴──────────┘

huangapple
  • 本文由 发表于 2023年6月26日 22:45:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557773.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定