将类型为list[list[…]]的列转换为ndarray。

huangapple go评论67阅读模式
英文:

Polars how to turn column of type list[list[...]] into ndarray

问题

I know i can turn a normal polars series into a numpy array via .to_numpy().

    import polars as pl
    s = pl.Series("a", [1,2,3])
    s
    shape: (3,)
    Series: 'a' [i64]
    [
    	1
    	2
    	3
    ]
    s.to_numpy()
    [1 2 3]

However that does not work with a list type. What would be they way to turn such a construct into a 2-D array.

And even more general is there a way to turn a series of list[list[whatever]] into a 3-D and so on?

    s = pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]])
    s
    shape: (3,)
    Series: 'a' [list[i64]]
    [
    	[1, 1, 1]
    	[1, 2, 3]
    	[1, 0, 1]
    ]
    s.to_numpy()  # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)

Desired output would be:

    [[1, 1, 1],
    [1, 2, 3],
    [1, 0, 1]]

Or one step further

    s = pl.Series("a", [[[1,1],[1,2]],[[1,1],[1,1]]])
    s
    shape: (2,)
    Series: 'a' [list[list[i64]]]
    [
    	[[1, 1], [1, 2]]
    	[[1, 1], [1, 1]]
    ]
    s.to_numpy()  # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)

    [[[1 1]
      [1 2]]
    
     [[1 1]
      [1 1]]]
英文:

I know i can turn a normal polars series into a numpy array via .to_numpy().

    import polars as pl
    s = pl.Series("a", [1,2,3])
    s
    shape: (3,)
    Series: 'a' [i64]
    [
    	1
    	2
    	3
    ]
    s.to_numpy()
    [1 2 3]

However that does not work with a list type. What would be they way to turn such a construct into a 2-D array.

And even more general is there a way to turn a series of list[list[whatever]] into a 3-D and so on?

    s = pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]])
    s
    shape: (3,)
    Series: 'a' [list[i64]]
    [
    	[1, 1, 1]
    	[1, 2, 3]
    	[1, 0, 1]
    ]
    s.to_numpy()  # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)

Desired output would be:

    [[1, 1, 1],
    [1, 2, 3],
    [1, 0, 1]]

Or one step further

    s = pl.Series("a", [[[1,1],[1,2]],[[1,1],[1,1]]])
    s
    shape: (2,)
    Series: 'a' [list[list[i64]]]
    [
    	[[1, 1], [1, 2]]
    	[[1, 1], [1, 1]]
    ]
    s.to_numpy()  # exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)

    [[[1 1]
      [1 2]]
    
     [[1 1]
      [1 1]]]

答案1

得分: 1

您可以在将系列"explode"(分解)之后重新整形numpy数组。这可能是当前ComputeError指定在polars中不支持的唯一方法。list数据类型的行可以具有不同的列表长度,这会破坏任何这样的计算,因此不支持是有道理的。

尽管如此,如果您知道每一行的列表列长度都是统一的,那么可以为任何任意嵌套的list类型编写此操作。它只涉及跟踪每个explode(分解)后的维度变化,然后计算正确的新维度:

from itertools import pairwise

def multidimensional_to_numpy(s):
    dimensions = [1, len(s)]
    while s.dtype == pl.List:
        s = s.explode()
        dimensions.append(len(s))
    dimensions = 

// p[0] for p in pairwise(dimensions)] return s.to_numpy().reshape(dimensions)

multidimensional_to_numpy(pl.Series("a", [1,2,3]))
array([1, 2, 3], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]]))

array([[1, 1, 1],
       [1, 2, 3],
       [1, 0, 1]], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[[1,1],[1,2]], [[1,1],[1,1]]]))

array([[[1, 1],
        [1, 2]],

       [[1, 1],
        [1, 1]]], dtype=int64)

请注意,即将发布的Array数据类型将保证整个列中具有相同长度的数组(当前的arr将变为list),因此这个答案在适当的时候可以得到改进(也许可以直接支持to_numpy)。特别是,上面计算维度的部分应该能够简化为跟踪每个内部数组数据类型的dtype.width

英文:

You could explode the series then reshape the numpy array after. That is probably the only way with the current ComputeError specifying it's unsupported in polars. The list dtype can have varying list lengths row to row, which would ruin any computation like this, so it makes sense it is not supported.

That said, if you know your list column is of uniform length for every row, this operation can be generally written for any arbitrary nesting of list type. It just involves keeping track of the changed dimensions with each explode, and then calculating the proper new dimensions:

from itertools import pairwise

def multidimensional_to_numpy(s):
	dimensions = [1, len(s)]
	while s.dtype == pl.List:
		s = s.explode()
		dimensions.append(len(s))
	dimensions = 

// p[0] for p in pairwise(dimensions)] return s.to_numpy().reshape(dimensions)

multidimensional_to_numpy(pl.Series("a", [1,2,3]))
array([1, 2, 3], dtype=int64
multidimensional_to_numpy(pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]]))

array([[1, 1, 1],
       [1, 2, 3],
       [1, 0, 1]], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[[1,1],[1,2]], [[1,1],[1,1]]]))

array([[[1, 1],
        [1, 2]],

       [[1, 1],
        [1, 1]]], dtype=int64)

Note with the soon to be released Array dtype that guarantees same-length arrays throughout the column (and the current arr will become list), this answer could be improved upon in due time (maybe direct to_numpy support there?). In particular, the dimension calculating above should be able to be simplified to tracking the dtype.width for each inner array dtype.

huangapple
  • 本文由 发表于 2023年5月29日 00:19:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76352445.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定