英文:
Interpolate based on datetimes
问题
在pandas中,我可以基于日期时间进行插值,如下所示:
df1 = pd.DataFrame(
{
"ts": [
datetime(2020, 1, 1),
datetime(2020, 1, 3, 0, 0, 12),
datetime(2020, 1, 3, 0, 1, 35),
datetime(2020, 1, 4),
],
"value": [1, np.nan, np.nan, 3],
}
)
df1.set_index('ts').interpolate(method='index')
输出结果:
value
ts
2020-01-01 00:00:00 1.000000
2020-01-03 00:00:12 2.333426
2020-01-03 00:01:35 2.334066
2020-01-04 00:00:00 3.000000
在polars中是否有类似的方法?例如,从以下代码开始:
df1 = pl.DataFrame(
{
"ts": [
datetime(2020, 1, 1),
datetime(2020, 1, 3, 0, 0, 12),
datetime(2020, 1, 3, 0, 1, 35),
datetime(2020, 1, 4),
],
"value": [1, None, None, 3],
}
)
shape: (4, 2)
┌─────────────────────┬───────┐
│ ts ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2020-01-01 00:00:00 ┆ 1 │
│ 2020-01-03 00:00:12 ┆ null │
│ 2020-01-03 00:01:35 ┆ null │
│ 2020-01-04 00:00:00 ┆ 3 │
└─────────────────────┴───────┘
编辑:我已更新示例,使其更加“不规则”,因此无法使用upsample
作为解决方案,并明确需要更通用的方法。
英文:
In pandas, I can interpolate based on a datetimes like this:
df1 = pd.DataFrame(
{
"ts": [
datetime(2020, 1, 1),
datetime(2020, 1, 3, 0, 0, 12),
datetime(2020, 1, 3, 0, 1, 35),
datetime(2020, 1, 4),
],
"value": [1, np.nan, np.nan, 3],
}
)
df1.set_index('ts').interpolate(method='index')
Outputs:
value
ts
2020-01-01 00:00:00 1.000000
2020-01-03 00:00:12 2.333426
2020-01-03 00:01:35 2.334066
2020-01-04 00:00:00 3.000000
Is there a similar method in polars? Say, starting with
df1 = pl.DataFrame(
{
"ts": [
datetime(2020, 1, 1),
datetime(2020, 1, 3, 0, 0, 12),
datetime(2020, 1, 3, 0, 1, 35),
datetime(2020, 1, 4),
],
"value": [1, None, None, 3],
}
)
shape: (4, 2)
┌─────────────────────┬───────┐
│ ts ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2020-01-01 00:00:00 ┆ 1 │
│ 2020-01-03 00:00:12 ┆ null │
│ 2020-01-03 00:01:35 ┆ null │
│ 2020-01-04 00:00:00 ┆ 3 │
└─────────────────────┴───────┘
EDIT: I've updated the example to make it a bit more "irregular", so that upsample
can't be used as a solution and to make it clear that we need something more generic
答案1
得分: 1
看起来 pandas 在进行插值之前先进行了上采样。因此,在 Polars 中,我们可以通过先上采样,然后进行插值,然后再将其与自身连接,以保留最初的日期时间:
(df1
.sort('ts')
.with_columns(pl.col('value').cast(pl.Float64))
.upsample(time_column='ts', every='1d')
.interpolate()
.join(
df1.select('ts'), on='ts'
)
)
你还需要注意列的数据类型,它应该是浮点型,否则会得到整数插值。
ts(datetime[μs]) | value(f64) |
---|---|
2020-01-01 00:00:00 | 1.0 |
2020-01-03 00:00:00 | 2.333333 |
2020-01-04 00:00:00 | 3.0 |
<details>
<summary>英文:</summary>
It seems pandas upsampling first before doing interpolation. So, we can do the same thing in Polars by upsampling, then interpolating and then joining itself back so we only keep the datetimes you had initially:
```python
(df1
.sort('ts')
.with_columns(pl.col('value').cast(pl.Float64))
.upsample(time_column='ts', every='1d')
.interpolate()
.join(
df1.select('ts'), on='ts'
)
)
You also need to take care of the column dtype, it should be a float otherwise you get integer interpolation.
ts(datetime[μs]) | value(f64) |
---|---|
2020-01-01 00:00:00 | 1.0 |
2020-01-03 00:00:00 | 2.333333 |
2020-01-04 00:00:00 | 3.0 |
答案2
得分: 1
不确定这是否有用,但看起来pandas调用np.interp()
来执行此操作:
- https://github.com/pandas-dev/pandas/blob/main/pandas/core/missing.py#L481
invalid = pl.when(pl.col('value').is_null()).then(pl.col('ts')).alias('invalid') valid = pl.when(pl.col('value').is_not_null()).then(pl.col('ts')).alias('valid') values = pl.when(pl.col('value').is_not_null()).then(pl.col('value')).alias('values') df.select( pl.struct(invalid, valid, values) .map(lambda args: np.interp( args.struct['invalid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), args.struct['valid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), args.struct['values'].drop_nulls().to_numpy(zero_copy_only=True) ) ) .flatten() )
┌──────────┐
│ 无效 │
│ --- │
│ f64 │
╞══════════╡
│ 2.333426 │
│ 2.334066 │
└──────────┘```尽管似乎还有很多其他事情在进行。
<details>
<summary>英文:</summary>
Not sure how useful this is but it looks like pandas calls `np.interp()` to do this:
- https://github.com/pandas-dev/pandas/blob/main/pandas/core/missing.py#L481
invalid = pl.when(pl.col('value').is_null()).then(pl.col('ts')).alias('invalid')
valid = pl.when(pl.col('value').is_not_null()).then(pl.col('ts')).alias('valid')
values = pl.when(pl.col('value').is_not_null()).then(pl.col('value')).alias('values')
df.select(
pl.struct(invalid, valid, values)
.map(lambda args:
np.interp(
args.struct['invalid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True),
args.struct['valid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True),
args.struct['values'].drop_nulls().to_numpy(zero_copy_only=True)
)
)
.flatten()
)
shape: (2, 1)
┌──────────┐
│ invalid │
│ --- │
│ f64 │
╞══════════╡
│ 2.333426 │
│ 2.334066 │
└──────────┘
Although there does seem to be a lot of other stuff going on.
</details>
# 答案3
**得分**: 1
这是一个使用 `scipy` 的解决方案。对于这些数值,转换到 `numpy` 应该是零复制的,所以我认为它应该是高效的。
```python
from scipy import interpolate
df1 = pl.DataFrame(
{
"ts": [
datetime(2020, 1, 1),
datetime(2020, 1, 3, 0, 0, 12),
datetime(2020, 1, 3, 0, 1, 35),
datetime(2020, 1, 4),
],
"value": [1, None, None, 3],
}
)
x = (
df1.filter(pl.col("value").is_not_null())["ts"]
.dt.timestamp()
.to_numpy(zero_copy_only=True)
)
y = df1.filter(pl.col("value").is_not_null())[
"value"
].to_numpy(zero_copy_only=True)
xnew = (
df1["ts"].dt.timestamp().to_numpy(zero_copy_only=True)
)
ynew = interpolate.interp1d(x, y)(xnew)
df1 = df1.with_columns(pl.Series(ynew).alias("value"))
结果是
In [6]: df1
Out[6]:
shape: (4, 2)
┌─────────────────────┬──────────┐
│ ts ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪══════════╡
│ 2020-01-01 00:00:00 ┆ 1.0 │
│ 2020-01-03 00:00:12 ┆ 2.333426 │
│ 2020-01-03 00:01:35 ┆ 2.334066 │
│ 2020-01-04 00:00:00 ┆ 3.0 │
└─────────────────────┴──────────┘
英文:
Here's a solution which uses scipy
. Conversion to numpy
should be zero-copy for these values, so I think it should be efficient
from scipy import interpolate
df1 = pl.DataFrame(
{
"ts": [
datetime(2020, 1, 1),
datetime(2020, 1, 3, 0, 0, 12),
datetime(2020, 1, 3, 0, 1, 35),
datetime(2020, 1, 4),
],
"value": [1, None, None, 3],
}
)
x = (
df1.filter(pl.col("value").is_not_null())["ts"]
.dt.timestamp()
.to_numpy(zero_copy_only=True)
)
y = df1.filter(pl.col("value").is_not_null())[
"value"
].to_numpy(zero_copy_only=True)
xnew = (
df1["ts"].dt.timestamp().to_numpy(zero_copy_only=True)
)
ynew = interpolate.interp1d(x, y)(xnew)
df1 = df1.with_columns(pl.Series(ynew).alias("value"))
The result is
In [6]: df1
Out[6]:
shape: (4, 2)
┌─────────────────────┬──────────┐
│ ts ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪══════════╡
│ 2020-01-01 00:00:00 ┆ 1.0 │
│ 2020-01-03 00:00:12 ┆ 2.333426 │
│ 2020-01-03 00:01:35 ┆ 2.334066 │
│ 2020-01-04 00:00:00 ┆ 3.0 │
└─────────────────────┴──────────┘
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论