英文:
Polars arr.to_struct() throws "pyo3_runtime.PanicException: not implemented for dtype Unknown" exception
问题
以下是翻译好的部分:
这是一个与 https://stackoverflow.com/questions/75516576/how-to-return-multiple-stats-as-multiple-columns-in-polars-grouby-context 和 https://stackoverflow.com/questions/75595957/how-to-flatten-split-a-tuple-of-arrays-and-calculate-column-means-in-polars-data/75596769#75596769 相关的新问题/问题的后续。
基本上,问题/问题可以通过下面的示例轻松地说明:
from functools import partial
import polars as pl
import statsmodels.api as sm
# 省略了一些代码
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
)
res.with_columns(
pl.col("params").arr.eval(pl.element().arr.explode()).arr.to_struct()
).unnest("params").collect()
在运行上述代码后,出现了以下错误:
pyo3_runtime.PanicException: not implemented for dtype Unknown
但是当从上述代码中删除 .lazy()
和 .collect()
时,代码完全按预期工作。下面是在急切模式下运行的结果(期望的行为)。
shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 2 ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151 │
│ 1 ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
└─────┴──────────┴──────────┴──────────┴───────────┘
那么,问题出在哪里,我应该如何解决它呢?
英文:
This is a new question/issue as a follow up to <https://stackoverflow.com/questions/75516576/how-to-return-multiple-stats-as-multiple-columns-in-polars-grouby-context> and <https://stackoverflow.com/questions/75595957/how-to-flatten-split-a-tuple-of-arrays-and-calculate-column-means-in-polars-data/75596769#75596769>
Basically, the problem/issue can be easily illustrated by the example below:
from functools import partial
import polars as pl
import statsmodels.api as sm
def ols_stats(s, yvar, xvars):
df = s.struct.unnest()
reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
return pl.Series(values=(reg.params, reg.tvalues), nan_to_null=True)
df = pl.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
).lazy()
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
)
res.with_columns(
pl.col("params").arr.eval(pl.element().arr.explode()).arr.to_struct()
).unnest("params").collect()
After running the code above, the following error is got:
pyo3_runtime.PanicException: not implemented for dtype Unknown
But when .lazy()
and .collect()
are removed from the code above, the code works perfectly as intended. Below are the results (expected behavior) if running in eager mode.
shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 2 ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151 │
│ 1 ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
└─────┴──────────┴──────────┴──────────┴───────────┘
So, what is the problem and how am I supposed to resolve it?
答案1
得分: 1
不要从ols_stats()
返回一个Series
,而是返回一个dict
,然后它应该工作。这在语义上也更好,因为你在最后显示的结构是一团乱: 前两个字段表示params
,后两个字段表示tvalues
。请尝试以下方式:
def ols_stats(s, yvar, xvars):
df = s.struct.unnest()
reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
return {"params": reg.params.tolist(), "tvalues": reg.tvalues.tolist()}
Polars会自动将dict[list[f64]]
转换为struct[2]
。我不得不尝试了一下才弄清楚,但看起来可以工作。
这样你会得到具有语义含义的结果:
shape: (3, 3)
┌─────┬─────────────────────────────────┬────────────────────────────────┐
│ day ┆ params ┆ tvalues │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[f64] ┆ list[f64] │
╞═════╪═════════════════════════════════╪════════════════════════════════╡
│ 1 ┆ [4.866232, 0.640294, -0.659869] ┆ [1.547251, 1.81586, -1.430613] │
│ 3 ┆ [0.5, 0.5] ┆ [0.0, 0.0] │
│ 2 ┆ [2.0462, 0.223971, 0.336793] ┆ [1.524834, 0.495378, 1.091109] │
└─────┴─────────────────────────────────┴────────────────────────────────┘
现在它是惰性执行的:
res = df.lazy().groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
).unnest("params").collect()
如果你想要解开嵌套的数据,为什么不立即返回它们解开的形式,如下:
def ols_stats(s, yvar, xvars):
df = s.struct.unnest()
reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
param_dict = {f"param_{i}": v for i, v in enumerate(reg.params.tolist())}
tvalues_dict = {f"tvalue_{i}": v for i, v in enumerate(reg.tvalues.tolist())}
return (param_dict | tvalues_dict)
df = pl.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
).lazy()
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("results")
).unnest("results").collect()
print(res)
返回:
shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ param_0 ┆ param_1 ┆ tvalue_0 ┆ tvalue_1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 1 ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
│ 2 ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151 │
└─────┴──────────┴──────────┴──────────┴───────────┘
英文:
Don't return a Series
from ols_stats()
but a dict
then it should work. This is also semantically better as the struct you show in the end is a mess: the first two fields mean params
, the second two fields mean tvalues
. Try this instead:
def ols_stats(s, yvar, xvars):
df = s.struct.unnest()
reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
return {"params":reg.params.tolist(),"tvalues":reg.tvalues.tolist()}
Polars automatically turns the dict[list[f64]]
into a struct[2]
. I had to play around a bit to figure this out but it seems to work.
This way you end up with semantically meaningful results:
shape: (3, 3)
┌─────┬─────────────────────────────────┬────────────────────────────────┐
│ day ┆ params ┆ tvalues │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[f64] ┆ list[f64] │
╞═════╪═════════════════════════════════╪════════════════════════════════╡
│ 1 ┆ [4.866232, 0.640294, -0.659869] ┆ [1.547251, 1.81586, -1.430613] │
│ 3 ┆ [0.5, 0.5] ┆ [0.0, 0.0] │
│ 2 ┆ [2.0462, 0.223971, 0.336793] ┆ [1.524834, 0.495378, 1.091109] │
└─────┴─────────────────────────────────┴────────────────────────────────┘
Now it works lazily:
res = df.lazy().groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
).unnest("params").collect()
If you want things unnested, why not return them unnested immediately as:
def ols_stats(s, yvar, xvars):
df = s.struct.unnest()
reg = sm.OLS(df[yvar].to_numpy(), df[xvars].to_numpy(), missing="drop").fit()
param_dict = {f"param_{i}": v for i, v in enumerate(reg.params.tolist())}
tvalues_dict = {f"tvalue_{i}": v for i, v in enumerate(reg.tvalues.tolist())}
return (param_dict | tvalues_dict)
df = pl.DataFrame(
{
"day": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"y": [1, 6, 3, 2, 8, 4, 5, 2, 7, 3],
"x1": [1, 8, 2, 3, 5, 2, 1, 2, 7, 3],
"x2": [8, 5, 3, 6, 3, 7, 3, 2, 9, 1],
}
).lazy()
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("results")
).unnest("results").collect()
print(res)
Returns:
shape: (2, 5)
┌─────┬──────────┬──────────┬──────────┬───────────┐
│ day ┆ param_0 ┆ param_1 ┆ tvalue_0 ┆ tvalue_1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪═══════════╡
│ 1 ┆ 1.008659 ┆ -0.03324 ┆ 3.204266 ┆ -0.124422 │
│ 2 ┆ 0.466089 ┆ 0.503127 ┆ 0.916982 ┆ 1.451151 │
└─────┴──────────┴──────────┴──────────┴───────────┘
答案2
得分: 0
如@jqurious在评论中指出的,如果在惰性和急切模式之间存在行为差异,这很可能是一个错误。
然而,在arr.to_struct()
的文档中,有一些可能正在发生的线索:
特别是,upper_bound
指向了LazyFrame
的特殊要求:需要随时了解模式。
如果我们查看在急切模式下运行的查询的中间输出,就在这一点之后:
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
)
res
看起来像这样:
shape: (3, 2)
day params
i64 list[list[f64]]
1 [[4.866232, 0.640294, -0.659869], [1.547251, 1.81586, -1.430613]]
3 [[0.5, 0.5], [0.0, 0.0]]
2 [[2.0462, 0.223971, 0.336793], [1.524834, 0.495378, 1.091109]]
请注意,与第1和第2天的子列表相比,第3天的子列表只有长度为2。
这几乎肯定不是你想要发生的事情,所以Polars抛出错误可能不是什么坏事。
在最后一步中,现在您想要将这些嵌套列表转换为结构。但取决于哪一天先到来,您将获得不同的结果?实际上,如果多次运行第一部分,有时天数3将首先出现。然后,to_struct
将实际上会以默认设置生成4个字段。
自己尝试一下。如果天数3先出现,像这样:
shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params │
│ --- ┆ --- │
│ i64 ┆ list[list[f64]] │
╞═════╪═════════════════════════════════════╡
│ 3 ┆ [[0.5, 0.5], [0.0, 0.0]] │
│ 1 ┆ [[4.866232, 0.640294, -0.659869]... │
│ 2 ┆ [[2.0462, 0.223971, 0.336793], [... │
└─────┴─────────────────────────────────────┘
最后一步将生成以下结构:
shape: (3, 5)
┌─────┬──────────┬──────────┬───────────┬──────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪═══════════╪══════════╡
│ 3 ┆ 0.5 ┆ 0.5 ┆ 0.0 ┆ 0.0 │
│ 2 ┆ 2.0462 ┆ 0.223971 ┆ 0.336793 ┆ 1.524834 │
│ 1 ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 │
└─────┴──────────┴──────────┴───────────┴──────────┘
碰巧,聚合后的顺序可能会不同:
shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params │
│ --- ┆ --- │
│ i64 ┆ list[list[f64]] │
╞═════╪═════════════════════════════════════╡
│ 2 ┆ [[2.0462, 0.223971, 0.336793], [... │
│ 1 ┆ [[4.866232, 0.640294, -0.659869]... │
│ 3 ┆ [[0.5, 0.5], [0.0, 0.0]] │
└─────┴─────────────────────────────────────┘
这将导致以下结构:
shape: (3, 7)
┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ field_4 ┆ field_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
│ 2 ┆ 2.0462 ┆ 0.223971 ┆ 0.336793 ┆ 1.524834 ┆ 0.495378 ┆ 1.091109 │
│ 1 ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586 ┆ -1.430613 │
│ 3 ┆ 0.5 ┆ 0.5 ┆ 0.0 ┆
<details>
<summary>英文:</summary>
As @jqurious points out in the comments, this may well be a bug if there's a difference in behaviour between lazy and eager.
However, there is a hint at what may be going on in the documentation of [`arr.to_struct()`][1]:
[![2]][2]
In particular, `upper_bound` points to special requirements of a `LazyFrame`: needing to know the schema at all time.
If we look at intermediate output of your query (running in eager mode), after this point:
```python
res = df.groupby("day").agg(
pl.struct(["y", "x1", "x2"])
.apply(partial(ols_stats, yvar="y", xvars=["x1", "x2"]))
.alias("params")
)
res
looks like this:
shape: (3, 2)
day params
i64 list[list[f64]]
1 [[4.866232, 0.640294, -0.659869], [1.547251, 1.81586, -1.430613]]
3 [[0.5, 0.5], [0.0, 0.0]]
2 [[2.0462, 0.223971, 0.336793], [1.524834, 0.495378, 1.091109]]
Note that the sublists of day 3 have only length 2 in contrast to the sublists of day 1 and 2.
This is almost definitely something you don't want to happen, so Polars throwing an error is maybe not such a bad thing.
In the last step, you now want to turn these nested lists into a struct. But depending on which day comes first, you will get different results? In fact if you run the first part multiple times, sometimes day 3 will comes first. Then to_struct
will actually result in 4 fields with default settings.
Try it yourself. If day 3 comes first, like this:
shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params │
│ --- ┆ --- │
│ i64 ┆ list[list[f64]] │
╞═════╪═════════════════════════════════════╡
│ 3 ┆ [[0.5, 0.5], [0.0, 0.0]] │
│ 1 ┆ [[4.866232, 0.640294, -0.659869]... │
│ 2 ┆ [[2.0462, 0.223971, 0.336793], [... │
└─────┴─────────────────────────────────────┘
the last step will result in the following struct:
shape: (3, 5)
┌─────┬──────────┬──────────┬───────────┬──────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪═══════════╪══════════╡
│ 3 ┆ 0.5 ┆ 0.5 ┆ 0.0 ┆ 0.0 │
│ 2 ┆ 2.0462 ┆ 0.223971 ┆ 0.336793 ┆ 1.524834 │
│ 1 ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 │
└─────┴──────────┴──────────┴───────────┴──────────┘
By chance, you can get a different order after aggregation:
shape: (3, 2)
┌─────┬─────────────────────────────────────┐
│ day ┆ params │
│ --- ┆ --- │
│ i64 ┆ list[list[f64]] │
╞═════╪═════════════════════════════════════╡
│ 2 ┆ [[2.0462, 0.223971, 0.336793], [... │
│ 1 ┆ [[4.866232, 0.640294, -0.659869]... │
│ 3 ┆ [[0.5, 0.5], [0.0, 0.0]] │
└─────┴─────────────────────────────────────┘
This will lead to the following struct:
shape: (3, 7)
┌─────┬──────────┬──────────┬───────────┬──────────┬──────────┬───────────┐
│ day ┆ field_0 ┆ field_1 ┆ field_2 ┆ field_3 ┆ field_4 ┆ field_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪═══════════╪══════════╪══════════╪═══════════╡
│ 2 ┆ 2.0462 ┆ 0.223971 ┆ 0.336793 ┆ 1.524834 ┆ 0.495378 ┆ 1.091109 │
│ 1 ┆ 4.866232 ┆ 0.640294 ┆ -0.659869 ┆ 1.547251 ┆ 1.81586 ┆ -1.430613 │
│ 3 ┆ 0.5 ┆ 0.5 ┆ 0.0 ┆ 0.0 ┆ null ┆ null │
└─────┴──────────┴──────────┴───────────┴──────────┴──────────┴───────────┘
So I think this non-deterministic nature of your query may be at the root of it not working in lazy mode. But do open an issue.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论