Polars arr.to_struct() throws "pyo3_runtime.PanicException: not implemented for dtype Unknown" exception

huangapple go评论95阅读模式
英文:

Polars arr.to_struct() throws "pyo3_runtime.PanicException: not implemented for dtype Unknown" exception

问题

以下是翻译好的部分:

这是一个与 https://stackoverflow.com/questions/75516576/how-to-return-multiple-stats-as-multiple-columns-in-polars-grouby-contexthttps://stackoverflow.com/questions/75595957/how-to-flatten-split-a-tuple-of-arrays-and-calculate-column-means-in-polars-data/75596769#75596769 相关的新问题/问题的后续。

基本上,问题/问题可以通过下面的示例轻松地说明:

from functools import partial

import polars as pl
import statsmodels.api as sm

# 省略了一些代码

res = df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
)

res.with_columns(
    pl.col("params").arr.eval(pl.element().arr.explode()).arr.to_struct()
).unnest("params").collect()

在运行上述代码后,出现了以下错误:

pyo3_runtime.PanicException: not implemented for dtype Unknown

但是当从上述代码中删除 .lazy().collect() 时,代码完全按预期工作。下面是在急切模式下运行的结果(期望的行为)。

shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
 day  field_0   field_1   field_2   field_3   
 ---  ---       ---       ---       ---       
 i64  f64       f64       f64       f64       
╞═════╪══════════╪══════════╪══════════╪═══════════╡
 2    0.466089  0.503127  0.916982  1.451151  
 1    1.008659  -0.03324  3.204266  -0.124422 
└─────┴──────────┴──────────┴──────────┴───────────┘

那么,问题出在哪里,我应该如何解决它呢?

英文:

This is a new question/issue as a follow up to <https://stackoverflow.com/questions/75516576/how-to-return-multiple-stats-as-multiple-columns-in-polars-grouby-context> and <https://stackoverflow.com/questions/75595957/how-to-flatten-split-a-tuple-of-arrays-and-calculate-column-means-in-polars-data/75596769#75596769>

Basically, the problem/issue can be easily illustrated by the example below:

from functools import partial

import polars as pl
import statsmodels.api as sm


def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing=&quot;drop&quot;).fit()
    return pl.Series(values=(reg.params, reg.tvalues), nan_to_null=True)


df = pl.DataFrame(
    {
        &quot;day&quot;: [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        &quot;y&quot;: [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        &quot;x1&quot;: [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        &quot;x2&quot;: [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
).lazy()

res = df.groupby(&quot;day&quot;).agg(
    pl.struct([&quot;y&quot;, &quot;x1&quot;, &quot;x2&quot;])
    .apply(partial(ols_stats, yvar=&quot;y&quot;, xvars=[&quot;x1&quot;, &quot;x2&quot;]))
    .alias(&quot;params&quot;)
)

res.with_columns(
    pl.col(&quot;params&quot;).arr.eval(pl.element().arr.explode()).arr.to_struct()
).unnest(&quot;params&quot;).collect()

After running the code above, the following error is got:

pyo3_runtime.PanicException: not implemented for dtype Unknown

But when .lazy() and .collect() are removed from the code above, the code works perfectly as intended. Below are the results (expected behavior) if running in eager mode.

shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2  ┆ field_3   │
│ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 2   ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151  │
│ 1   ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
└─────┴──────────┴──────────┴──────────┴───────────┘

So, what is the problem and how am I supposed to resolve it?

答案1

得分: 1

不要从ols_stats()返回一个Series,而是返回一个dict,然后它应该工作。这在语义上也更好,因为你在最后显示的结构是一团乱: 前两个字段表示params,后两个字段表示tvalues。请尝试以下方式:

def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
    return {"params": reg.params.tolist(), "tvalues": reg.tvalues.tolist()}

Polars会自动将dict[list[f64]]转换为struct[2]。我不得不尝试了一下才弄清楚,但看起来可以工作。

这样你会得到具有语义含义的结果:

shape: (3, 3)
┌─────┬─────────────────────────────────┬────────────────────────────────┐
│ day ┆ params                          ┆ tvalues                        │
│ --- ┆ ---                             ┆ ---                            │
│ i64 ┆ list[f64]                       ┆ list[f64]╞═════╪═════════════════════════════════╪════════════════════════════════╡
1[4.866232, 0.640294, -0.659869][1.547251, 1.81586, -1.430613]3[0.5, 0.5][0.0, 0.0]2[2.0462, 0.223971, 0.336793][1.524834, 0.495378, 1.091109]└─────┴─────────────────────────────────┴────────────────────────────────┘

现在它是惰性执行的:

res = df.lazy().groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("params")
).unnest("params").collect()

如果你想要解开嵌套的数据,为什么不立即返回它们解开的形式,如下:

def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
    param_dict = {f"param_{i}": v for i, v in enumerate(reg.params.tolist())}
    tvalues_dict = {f"tvalue_{i}": v for i, v in enumerate(reg.tvalues.tolist())}
    return (param_dict | tvalues_dict)

df = pl.DataFrame(
    {
        "day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        "x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        "x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
).lazy()

res = df.groupby("day").agg(
    pl.struct(["y", "x1", "x2"])
    .apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
    .alias("results")
).unnest("results").collect()
print(res)

返回:

shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ param_0  ┆ param_1  ┆ tvalue_0 ┆ tvalue_1  │
│ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
1   ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
2   ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151  │
└─────┴──────────┴──────────┴──────────┴───────────┘
英文:

Don't return a Series from ols_stats() but a dict then it should work. This is also semantically better as the struct you show in the end is a mess: the first two fields mean params, the second two fields mean tvalues. Try this instead:

def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing=&quot;drop&quot;).fit()
    return {&quot;params&quot;:reg.params.tolist(),&quot;tvalues&quot;:reg.tvalues.tolist()}

Polars automatically turns the dict[list[f64]] into a struct[2]. I had to play around a bit to figure this out but it seems to work.

This way you end up with semantically meaningful results:

shape: (3, 3)
┌─────┬─────────────────────────────────┬────────────────────────────────┐
│ day ┆ params                          ┆ tvalues                        │
│ --- ┆ ---                             ┆ ---                            │
│ i64 ┆ list[f64]                       ┆ list[f64]                      │
╞═════╪═════════════════════════════════╪════════════════════════════════╡
│ 1   ┆ [4.866232, 0.640294, -0.659869] ┆ [1.547251, 1.81586, -1.430613] │
│ 3   ┆ [0.5, 0.5]                      ┆ [0.0, 0.0]                     │
│ 2   ┆ [2.0462, 0.223971, 0.336793]    ┆ [1.524834, 0.495378, 1.091109] │
└─────┴─────────────────────────────────┴────────────────────────────────┘

Now it works lazily:

res = df.lazy().groupby(&quot;day&quot;).agg(
    pl.struct([&quot;y&quot;, &quot;x1&quot;, &quot;x2&quot;])
    .apply(partial(ols_stats, yvar=&quot;y&quot;, xvars=[&quot;x1&quot;, &quot;x2&quot;]))
    .alias(&quot;params&quot;)
).unnest(&quot;params&quot;).collect()

If you want things unnested, why not return them unnested immediately as:

def ols_stats(s, yvar, xvars):
    df = s.struct.unnest()
    reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing=&quot;drop&quot;).fit()
    param_dict = {f&quot;param_{i}&quot;: v for i, v in enumerate(reg.params.tolist())}
    tvalues_dict = {f&quot;tvalue_{i}&quot;: v for i, v in enumerate(reg.tvalues.tolist())}
    return (param_dict | tvalues_dict)

df = pl.DataFrame(
    {
        &quot;day&quot;: [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        &quot;y&quot;: [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
        &quot;x1&quot;: [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
        &quot;x2&quot;: [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
    }
).lazy()

res = df.groupby(&quot;day&quot;).agg(
    pl.struct([&quot;y&quot;, &quot;x1&quot;, &quot;x2&quot;])
    .apply(partial(ols_stats, yvar=&quot;y&quot;, xvars=[&quot;x1&quot;, &quot;x2&quot;]))
    .alias(&quot;results&quot;)
).unnest(&quot;results&quot;).collect()
print(res)

Returns:

shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ param_0  ┆ param_1  ┆ tvalue_0 ┆ tvalue_1  │
│ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 1   ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
│ 2   ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151  │
└─────┴──────────┴──────────┴──────────┴───────────┘

答案2

得分: 0

如@jqurious在评论中指出的,如果在惰性和急切模式之间存在行为差异,这很可能是一个错误。

然而,在arr.to_struct()的文档中,有一些可能正在发生的线索:

Polars arr.to_struct() throws "pyo3_runtime.PanicException: not implemented for dtype Unknown" exception

特别是,upper_bound指向了LazyFrame的特殊要求:需要随时了解模式。

如果我们查看在急切模式下运行的查询的中间输出,就在这一点之后:

res = df.groupby(&quot;day&quot;).agg(
    pl.struct([&quot;y&quot;, &quot;x1&quot;, &quot;x2&quot;])
    .apply(partial(ols_stats, yvar=&quot;y&quot;, xvars=[&quot;x1&quot;, &quot;x2&quot;]))
    .alias(&quot;params&quot;)
)

res看起来像这样:

shape: (3, 2)
day	params
i64	list[list[f64]]
1	[[4.866232, 0.640294, -0.659869], [1.547251, 1.81586, -1.430613]]
3	[[0.5, 0.5], [0.0, 0.0]]
2	[[2.0462, 0.223971, 0.336793], [1.524834, 0.495378, 1.091109]]

请注意,与第1和第2天的子列表相比,第3天的子列表只有长度为2。

这几乎肯定不是你想要发生的事情,所以Polars抛出错误可能不是什么坏事。

在最后一步中,现在您想要将这些嵌套列表转换为结构。但取决于哪一天先到来,您将获得不同的结果?实际上,如果多次运行第一部分,有时天数3将首先出现。然后,to_struct将实际上会以默认设置生成4个字段。

自己尝试一下。如果天数3先出现,像这样:

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ list[list[f64]]                     │
╞═════╪═════════════════════════════════════╡
│ 3   ┆ [[0.5, 0.5], [0.0, 0.0]]            │
│ 1   ┆ [[4.866232, 0.640294, -0.659869]... │
│ 2   ┆ [[2.0462, 0.223971, 0.336793], [... │
└─────┴─────────────────────────────────────┘

最后一步将生成以下结构:

shape: (3, 5)
┌─────┬──────────┬──────────┬───────────┬──────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      │
╞═════╪══════════╪══════════╪═══════════╪══════════╡
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      │
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 │
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 │
└─────┴──────────┴──────────┴───────────┴──────────┘

碰巧,聚合后的顺序可能会不同:

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ list[list[f64]]                     │
╞═════╪═════════════════════════════════════╡
│ 2   ┆ [[2.0462, 0.223971, 0.336793], [... │
│ 1   ┆ [[4.866232, 0.640294, -0.659869]... │
│ 3   ┆ [[0.5, 0.5], [0.0, 0.0]]            │
└─────┴─────────────────────────────────────┘

这将导致以下结构:

shape: (3, 7)
┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  ┆ field_4  ┆ field_5   │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 ┆ 0.495378 ┆ 1.091109  │
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586  ┆ -1.430613 │
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 

<details>
<summary>英文:</summary>

As @jqurious points out in the comments, this may well be a bug if there&#39;s a difference in behaviour between lazy and eager.

However, there is a hint at what may be going on in the documentation of [`arr.to_struct()`][1]:

[!
[2]][2] In particular, `upper_bound` points to special requirements of a `LazyFrame`: needing to know the schema at all time. If we look at intermediate output of your query (running in eager mode), after this point: ```python res = df.groupby(&quot;day&quot;).agg( pl.struct([&quot;y&quot;, &quot;x1&quot;, &quot;x2&quot;]) .apply(partial(ols_stats, yvar=&quot;y&quot;, xvars=[&quot;x1&quot;, &quot;x2&quot;])) .alias(&quot;params&quot;) )

res looks like this:

shape: (3, 2)
day	params
i64	list[list[f64]]
1	[[4.866232, 0.640294, -0.659869], [1.547251, 1.81586, -1.430613]]
3	[[0.5, 0.5], [0.0, 0.0]]
2	[[2.0462, 0.223971, 0.336793], [1.524834, 0.495378, 1.091109]]

Note that the sublists of day 3 have only length 2 in contrast to the sublists of day 1 and 2.

This is almost definitely something you don't want to happen, so Polars throwing an error is maybe not such a bad thing.

In the last step, you now want to turn these nested lists into a struct. But depending on which day comes first, you will get different results? In fact if you run the first part multiple times, sometimes day 3 will comes first. Then to_struct will actually result in 4 fields with default settings.

Try it yourself. If day 3 comes first, like this:

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ list[list[f64]]                     │
╞═════╪═════════════════════════════════════╡
│ 3   ┆ [[0.5, 0.5], [0.0, 0.0]]            │
│ 1   ┆ [[4.866232, 0.640294, -0.659869]... │
│ 2   ┆ [[2.0462, 0.223971, 0.336793], [... │
└─────┴─────────────────────────────────────┘

the last step will result in the following struct:

shape: (3, 5)
┌─────┬──────────┬──────────┬───────────┬──────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      │
╞═════╪══════════╪══════════╪═══════════╪══════════╡
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      │
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 │
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 │
└─────┴──────────┴──────────┴───────────┴──────────┘

By chance, you can get a different order after aggregation:

shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params                              │
│ --- ┆ ---                                 │
│ i64 ┆ list[list[f64]]                     │
╞═════╪═════════════════════════════════════╡
│ 2   ┆ [[2.0462, 0.223971, 0.336793], [... │
│ 1   ┆ [[4.866232, 0.640294, -0.659869]... │
│ 3   ┆ [[0.5, 0.5], [0.0, 0.0]]            │
└─────┴─────────────────────────────────────┘

This will lead to the following struct:

shape: (3, 7)
┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0  ┆ field_1  ┆ field_2   ┆ field_3  ┆ field_4  ┆ field_5   │
│ --- ┆ ---      ┆ ---      ┆ ---       ┆ ---      ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ f64      ┆ f64       ┆ f64      ┆ f64      ┆ f64       │
╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
│ 2   ┆ 2.0462   ┆ 0.223971 ┆ 0.336793  ┆ 1.524834 ┆ 0.495378 ┆ 1.091109  │
│ 1   ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586  ┆ -1.430613 │
│ 3   ┆ 0.5      ┆ 0.5      ┆ 0.0       ┆ 0.0      ┆ null     ┆ null      │
└─────┴──────────┴──────────┴───────────┴──────────┴──────────┴───────────┘

So I think this non-deterministic nature of your query may be at the root of it not working in lazy mode. But do open an issue.

huangapple
  • 本文由 发表于 2023年3月4日 04:43:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631703.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定