2023年6月26日 22:45:50go评论88阅读模式

英文:

Interpolate based on datetimes

问题

在pandas中，我可以基于日期时间进行插值，如下所示：

df1 = pd.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, np.nan, np.nan, 3],
    }
)
df1.set_index('ts').interpolate(method='index')

输出结果：

                        value
ts
2020-01-01 00:00:00  1.000000
2020-01-03 00:00:12  2.333426
2020-01-03 00:01:35  2.334066
2020-01-04 00:00:00  3.000000

在polars中是否有类似的方法？例如，从以下代码开始：

df1 = pl.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, None, None, 3],
    }
)

shape: (4, 2)
┌─────────────────────┬───────┐
│ ts                  ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2020-01-01 00:00:00 ┆ 1     │
│ 2020-01-03 00:00:12 ┆ null  │
│ 2020-01-03 00:01:35 ┆ null  │
│ 2020-01-04 00:00:00 ┆ 3     │
└─────────────────────┴───────┘

编辑：我已更新示例，使其更加“不规则”，因此无法使用upsample作为解决方案，并明确需要更通用的方法。

英文:

In pandas, I can interpolate based on a datetimes like this:

df1 = pd.DataFrame(
    {
        &quot;ts&quot;: [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        &quot;value&quot;: [1, np.nan, np.nan, 3],
    }
)
df1.set_index(&#39;ts&#39;).interpolate(method=&#39;index&#39;)

Outputs:

                        value
ts
2020-01-01 00:00:00  1.000000
2020-01-03 00:00:12  2.333426
2020-01-03 00:01:35  2.334066
2020-01-04 00:00:00  3.000000

Is there a similar method in polars? Say, starting with

df1 = pl.DataFrame(
    {
        &quot;ts&quot;: [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        &quot;value&quot;: [1, None, None, 3],
    }
)

shape: (4, 2)
┌─────────────────────┬───────┐
│ ts                  ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2020-01-01 00:00:00 ┆ 1     │
│ 2020-01-03 00:00:12 ┆ null  │
│ 2020-01-03 00:01:35 ┆ null  │
│ 2020-01-04 00:00:00 ┆ 3     │
└─────────────────────┴───────┘

EDIT: I've updated the example to make it a bit more "irregular", so that upsample can't be used as a solution and to make it clear that we need something more generic

答案1

得分: 1

看起来 pandas 在进行插值之前先进行了上采样。因此，在 Polars 中，我们可以通过先上采样，然后进行插值，然后再将其与自身连接，以保留最初的日期时间：

(df1
        .sort('ts')
        .with_columns(pl.col('value').cast(pl.Float64))
        .upsample(time_column='ts', every='1d')
        .interpolate()
        .join(
            df1.select('ts'), on='ts'
            )
        )

你还需要注意列的数据类型，它应该是浮点型，否则会得到整数插值。

ts(datetime[μs])	value(f64)
2020-01-01 00:00:00	1.0
2020-01-03 00:00:00	2.333333
2020-01-04 00:00:00	3.0


<details>
<summary>英文:</summary>

It seems pandas upsampling first before doing interpolation. So, we can do the same thing in Polars by upsampling, then interpolating and then joining itself back so we only keep the datetimes you had initially:

```python
(df1
        .sort(&#39;ts&#39;)
        .with_columns(pl.col(&#39;value&#39;).cast(pl.Float64))
        .upsample(time_column=&#39;ts&#39;, every=&#39;1d&#39;)
        .interpolate()
        .join(
            df1.select(&#39;ts&#39;), on=&#39;ts&#39;
            )
        )

You also need to take care of the column dtype, it should be a float otherwise you get integer interpolation.

ts(datetime[μs])	value(f64)
2020-01-01 00:00:00	1.0
2020-01-03 00:00:00	2.333333
2020-01-04 00:00:00	3.0

答案2

得分: 1

不确定这是否有用，但看起来pandas调用np.interp()来执行此操作：

https://github.com/pandas-dev/pandas/blob/main/pandas/core/missing.py#L481
invalid = pl.when(pl.col('value').is_null()).then(pl.col('ts')).alias('invalid') valid = pl.when(pl.col('value').is_not_null()).then(pl.col('ts')).alias('valid') values = pl.when(pl.col('value').is_not_null()).then(pl.col('value')).alias('values') df.select( pl.struct(invalid, valid, values) .map(lambda args: np.interp( args.struct['invalid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), args.struct['valid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True), args.struct['values'].drop_nulls().to_numpy(zero_copy_only=True) ) ) .flatten() )

┌──────────┐
│ 无效     │
│ ---      │
│ f64      │
╞══════════╡
│ 2.333426 │
│ 2.334066 │
└──────────┘```尽管似乎还有很多其他事情在进行。

<details>
<summary>英文:</summary>

Not sure how useful this is but it looks like pandas calls `np.interp()` to do this:

- https://github.com/pandas-dev/pandas/blob/main/pandas/core/missing.py#L481

invalid = pl.when(pl.col('value').is_null()).then(pl.col('ts')).alias('invalid')
valid = pl.when(pl.col('value').is_not_null()).then(pl.col('ts')).alias('valid')
values = pl.when(pl.col('value').is_not_null()).then(pl.col('value')).alias('values')

df.select(
pl.struct(invalid, valid, values)
.map(lambda args:
np.interp(
args.struct['invalid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True),
args.struct['valid'].drop_nulls().dt.timestamp().to_numpy(zero_copy_only=True),
args.struct['values'].drop_nulls().to_numpy(zero_copy_only=True)
)
)
.flatten()
)

shape: (2, 1)
┌──────────┐
│ invalid │
│ --- │
│ f64 │
╞══════════╡
│ 2.333426 │
│ 2.334066 │
└──────────┘

Although there does seem to be a lot of other stuff going on.

</details>



# 答案3
**得分**: 1

这是一个使用 `scipy` 的解决方案。对于这些数值，转换到 `numpy` 应该是零复制的，所以我认为它应该是高效的。

```python
from scipy import interpolate
df1 = pl.DataFrame(
    {
        "ts": [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        "value": [1, None, None, 3],
    }
)
x = (
    df1.filter(pl.col("value").is_not_null())["ts"]
    .dt.timestamp()
    .to_numpy(zero_copy_only=True)
)
y = df1.filter(pl.col("value").is_not_null())[
    "value"
].to_numpy(zero_copy_only=True)
xnew = (
    df1["ts"].dt.timestamp().to_numpy(zero_copy_only=True)
)
ynew = interpolate.interp1d(x, y)(xnew)
df1 = df1.with_columns(pl.Series(ynew).alias("value"))

结果是

In [6]: df1
Out[6]:
shape: (4, 2)
┌─────────────────────┬──────────┐
│ ts                  ┆ value    │
│ ---                 ┆ ---      │
│ datetime[μs]        ┆ f64      │
╞═════════════════════╪══════════╡
│ 2020-01-01 00:00:00 ┆ 1.0      │
│ 2020-01-03 00:00:12 ┆ 2.333426 │
│ 2020-01-03 00:01:35 ┆ 2.334066 │
│ 2020-01-04 00:00:00 ┆ 3.0      │
└─────────────────────┴──────────┘

英文:

Here's a solution which uses scipy. Conversion to numpy should be zero-copy for these values, so I think it should be efficient

from scipy import interpolate
df1 = pl.DataFrame(
    {
        &quot;ts&quot;: [
            datetime(2020, 1, 1),
            datetime(2020, 1, 3, 0, 0, 12),
            datetime(2020, 1, 3, 0, 1, 35),
            datetime(2020, 1, 4),
        ],
        &quot;value&quot;: [1, None, None, 3],
    }
)
x = (
    df1.filter(pl.col(&quot;value&quot;).is_not_null())[&quot;ts&quot;]
    .dt.timestamp()
    .to_numpy(zero_copy_only=True)
)
y = df1.filter(pl.col(&quot;value&quot;).is_not_null())[
    &quot;value&quot;
].to_numpy(zero_copy_only=True)
xnew = (
    df1[&quot;ts&quot;].dt.timestamp().to_numpy(zero_copy_only=True)
)
ynew = interpolate.interp1d(x, y)(xnew)
df1 = df1.with_columns(pl.Series(ynew).alias(&quot;value&quot;))

The result is

In [6]: df1
Out[6]:
shape: (4, 2)
┌─────────────────────┬──────────┐
│ ts                  ┆ value    │
│ ---                 ┆ ---      │
│ datetime[μs]        ┆ f64      │
╞═════════════════════╪══════════╡
│ 2020-01-01 00:00:00 ┆ 1.0      │
│ 2020-01-03 00:00:12 ┆ 2.333426 │
│ 2020-01-03 00:01:35 ┆ 2.334066 │
│ 2020-01-04 00:00:00 ┆ 3.0      │
└─────────────────────┴──────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

基于日期时间进行插值。

问题

答案1

答案2

使用rasterio从栅格数据中编写带有颜色映射的TIFF。

确定使用Python单独识别表格单元格。

How to custom sort datetime column in pandas?

在 pandas 中，如何按照自定义规则对列按值进行分组排序。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论