英文:
Polars how to turn column of type list[list[...]] into ndarray
问题
I know i can turn a normal polars series into a numpy array via .to_numpy()
.
import polars as pl
s = pl.Series("a", [1,2,3])
s
shape: (3,)
Series: 'a' [i64]
[
1
2
3
]
s.to_numpy()
[1 2 3]
However that does not work with a list type. What would be they way to turn such a construct into a 2-D array.
And even more general is there a way to turn a series of list[list[whatever]] into a 3-D and so on?
s = pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]])
s
shape: (3,)
Series: 'a' [list[i64]]
[
[1, 1, 1]
[1, 2, 3]
[1, 0, 1]
]
s.to_numpy() # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)
Desired output would be:
[[1, 1, 1],
[1, 2, 3],
[1, 0, 1]]
Or one step further
s = pl.Series("a", [[[1,1],[1,2]],[[1,1],[1,1]]])
s
shape: (2,)
Series: 'a' [list[list[i64]]]
[
[[1, 1], [1, 2]]
[[1, 1], [1, 1]]
]
s.to_numpy() # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)
[[[1 1]
[1 2]]
[[1 1]
[1 1]]]
英文:
I know i can turn a normal polars series into a numpy array via .to_numpy()
.
import polars as pl
s = pl.Series("a", [1,2,3])
s
shape: (3,)
Series: 'a' [i64]
[
1
2
3
]
s.to_numpy()
[1 2 3]
However that does not work with a list type. What would be they way to turn such a construct into a 2-D array.
And even more general is there a way to turn a series of list[list[whatever]] into a 3-D and so on?
s = pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]])
s
shape: (3,)
Series: 'a' [list[i64]]
[
[1, 1, 1]
[1, 2, 3]
[1, 0, 1]
]
s.to_numpy() # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)
Desired output would be:
[[1, 1, 1],
[1, 2, 3],
[1, 0, 1]]
Or one step further
s = pl.Series("a", [[[1,1],[1,2]],[[1,1],[1,1]]])
s
shape: (2,)
Series: 'a' [list[list[i64]]]
[
[[1, 1], [1, 2]]
[[1, 1], [1, 1]]
]
s.to_numpy() # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)
[[[1 1]
[1 2]]
[[1 1]
[1 1]]]
答案1
得分: 1
您可以在将系列"explode"(分解)之后重新整形numpy数组。这可能是当前ComputeError
指定在polars中不支持的唯一方法。list
数据类型的行可以具有不同的列表长度,这会破坏任何这样的计算,因此不支持是有道理的。
尽管如此,如果您知道每一行的列表列长度都是统一的,那么可以为任何任意嵌套的list
类型编写此操作。它只涉及跟踪每个explode
(分解)后的维度变化,然后计算正确的新维度:
from itertools import pairwise
def multidimensional_to_numpy(s):
dimensions = [1, len(s)]
while s.dtype == pl.List:
s = s.explode()
dimensions.append(len(s))
dimensions = // p[0] for p in pairwise(dimensions)]
return s.to_numpy().reshape(dimensions)
multidimensional_to_numpy(pl.Series("a", [1,2,3]))
array([1, 2, 3], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]]))
array([[1, 1, 1],
[1, 2, 3],
[1, 0, 1]], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[[1,1],[1,2]], [[1,1],[1,1]]]))
array([[[1, 1],
[1, 2]],
[[1, 1],
[1, 1]]], dtype=int64)
请注意,即将发布的Array数据类型将保证整个列中具有相同长度的数组(当前的arr
将变为list
),因此这个答案在适当的时候可以得到改进(也许可以直接支持to_numpy
)。特别是,上面计算维度的部分应该能够简化为跟踪每个内部数组数据类型的dtype.width
。
英文:
You could explode
the series then reshape the numpy array after. That is probably the only way with the current ComputeError
specifying it's unsupported in polars. The list
dtype can have varying list lengths row to row, which would ruin any computation like this, so it makes sense it is not supported.
That said, if you know your list column is of uniform length for every row, this operation can be generally written for any arbitrary nesting of list
type. It just involves keeping track of the changed dimensions with each explode
, and then calculating the proper new dimensions:
from itertools import pairwise
def multidimensional_to_numpy(s):
dimensions = [1, len(s)]
while s.dtype == pl.List:
s = s.explode()
dimensions.append(len(s))
dimensions = // p[0] for p in pairwise(dimensions)]
return s.to_numpy().reshape(dimensions)
multidimensional_to_numpy(pl.Series("a", [1,2,3]))
array([1, 2, 3], dtype=int64
multidimensional_to_numpy(pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]]))
array([[1, 1, 1],
[1, 2, 3],
[1, 0, 1]], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[[1,1],[1,2]], [[1,1],[1,1]]]))
array([[[1, 1],
[1, 2]],
[[1, 1],
[1, 1]]], dtype=int64)
Note with the soon to be released Array dtype that guarantees same-length arrays throughout the column (and the current arr
will become list
), this answer could be improved upon in due time (maybe direct to_numpy support there?). In particular, the dimension calculating above should be able to be simplified to tracking the dtype.width
for each inner array dtype.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论