Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

huangapple go评论83阅读模式
英文:

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

问题

# 将列转换为具有对象dtype的NumPy数组
col_np = df['col_a'].to_numpy()

# 使用NumPy操作找到列表的最大长度
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))

# 创建一个用于填充的掩码
mask = np.arange(max_length) < np.frompyfunc(len, 1, 1)(col_np)[:, None]

# 在必要时使用'None'填充列表
result = np.where(mask, col_np, 'None')
英文:

I currently have the following dataframe

data = {&#39;col_a&#39;: [[&#39;a&#39;, &#39;b&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], [&#39;a&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]],
        &#39;col_b&#39;:[[1, 3], [1, 0, 0], [4], [1, 1, 2, 0], [0, 0, 5], [3, 1, 2, 5]]}
df= pd.DataFrame(data)

Suppose I work with col_a, I want to resize the lists in col_a in a vectorized manner so that the length of all the sub lists = max length of largest list and I want to fill the empty values with &#39;None&#39; in the case of col_a. I want the final output to look as follows

                   col_a               col_b
0     [a, b, None, None]    [1, 3, nan, nan]
1        [a, b, c, None]      [1, 0, 0, nan]
2  [a, None, None, None]  [4, nan, nan, nan]
3           [a, b, c, d]        [1, 1, 2, 0]
4        [a, b, c, None]      [0, 0, 5, nan]
5           [a, b, c, d]        [3, 1, 2, 5]

So far I have done the following

# Convert the column to a NumPy array with object dtype
col_np = df[&#39;col_a&#39;].to_numpy()

# Find the maximum length of the lists using NumPy operations
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))

# Create a mask for padding
mask = np.arange(max_length) &lt; np.frompyfunc(len, 1, 1)(col_np)[:, None]

# Pad the lists with None where necessary
result = np.where(mask, col_np, &#39;None&#39;)

This results in the following error
ValueError: operands could not be broadcast together with shapes (6,4) (6,) ()

I feel like I'm close but there's something that I'm missing here. Please note that only vectorized solutions will be marked as the answer.

答案1

得分: 2

Only vectorized solutions will be marked as the answer. -> 仅矢量化解决方案将被标记为答案。

英文:

Only vectorized solutions will be marked as the answer. -> that's too bad because no (true) vectorized approach is possible with an array of lists. To this extent, np.frompyfunc is certainly not truly vectorized.

If by "vectorized" you mean without explicit python loop, you could use:

df[&#39;out_a&#39;] = pd.Series(pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist()).to_numpy().tolist())

An alternative with an explicit loop would be:

size = df[&#39;col_a&#39;].str.len().max()

df[&#39;out_a&#39;] = [l+[None]*(size-len(l)) for l in df[&#39;col_a&#39;]]

Output:

          col_a         col_b                  out_a
0        [a, b]        [1, 3]     [a, b, None, None]
1     [a, b, c]     [1, 0, 0]        [a, b, c, None]
2           [a]           [4]  [a, None, None, None]
3  [a, b, c, d]  [1, 1, 2, 0]           [a, b, c, d]
4     [a, b, c]     [0, 0, 5]        [a, b, c, None]
5  [a, b, c, d]  [3, 1, 2, 5]           [a, b, c, d]

timings

For small lists the "vectorized" and loop solution have very similar timings.

Here for lists with 1 to 10 items:

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

However, when the size of lists increases, the python loop become more efficient.

For list with 0 to 50 items:

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

0 to 200 items:

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

0 to 2000 items:

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

Code used for the timings:

import pandas as pd
import perfplot
import numpy as np

def pandas_vectorized(df):
    df[&#39;out_a&#39;] = pd.Series(pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist()).to_numpy().tolist())
    
def python_loop(df):
    size = df[&#39;col_a&#39;].str.len().max()
    df[&#39;out_a&#39;] = [l+[None]*(size-len(l)) for l in df[&#39;col_a&#39;]]

MAX_LIST_SIZE = 2000
    
perfplot.show(
    setup=lambda n: pd.DataFrame({&#39;col_a&#39;: [[&#39;x&#39;]*n for n in np.random.randint(0, MAX_LIST_SIZE, size=n)]}),
    kernels=[pandas_vectorized, python_loop],
    n_range=[2**k for k in range(1, 18)],  # upper bound was 22 for small lists
    xlabel=&quot;len(df)&quot;,
    equality_check=None,
    max_time=10,
)

答案2

得分: 1

这基本上是一个“填充”问题 - 通过添加填充值将列表扩展到匹配的长度。这已经出现了很多次。

itertools 有一个在这方面非常有用的 zip 变体:

from itertools import zip_longest

你的列:

x = df.col_a.to_numpy()
x

这会填充,但它的结果是我们想要的结果的转置:

list(zip_longest(*x))

我们可以使用数组的转置,但有一个与之相同的列表惯用法:

list(zip(*(zip_longest(*x))))

对于数值列也可以使用相同的方法,使用 nan 进行填充:

list(zip(*(zip_longest(*df.col_b, fillvalue=np.nan))))

对于大型列表,有更快的填充方法,但是 zip_longest 是其中一个更容易记住和使用的方法之一。

mask 是一个看起来很好的数组,告诉我们在哪里要填充值:

mask

它的形状是 (6, 4)。要在 where 中使用它,我们需要将 col_np 扩展到 (6, 1):

result = np.where(mask, col_np[:, None], 'None')

但是结果几乎不是我们想要的:

result

然而,有一种方法可以使用该掩码来填充一个数组。首先创建具有正确形状和 dtype 的目标:

result = np.empty((6, 4), object)

然后进行填充:

result[mask] = np.hstack(col_np)

这种填充有效,因为 result[mask] 是一个扁平的值数组:

result[mask]

总的来说,这种方法非常接近于一种更聪明和更快的填充方法之一。我认为 Warren Weckesser 是最初的来源。

对于数值列也可以使用相同的方法:

result = np.full((6, 4), np.nan)
result[mask] = np.hstack(df.col_b)

掩码可以使用一个列表推导式来创建 - 以获取每个列表的长度:

lens = np.array([len(i) for i in x])
np.max(lens)
mask1 = np.arange(np.max(lens)) < lens[:, None]
mask1

性能比较:

使用 Mozway 的数据帧扩展:

timeit pd.DataFrame(df['col_a'].to_numpy().tolist())
timeit pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())

将掩码方法封装成一个函数:

def foo(df):
    x = df.col_a.to_numpy()
    lens = np.array([len(i) for i in x])
    mask = np.arange(np.max(lens)) < lens[:, None]
    result = np.empty(mask.shape, dtype=object)
    result[mask] = np.hstack(x)
    return result

获取填充数组所需的时间:

timeit foo(df)

将该数组转换为 Series 所需的时间:

timeit pd.Series(foo(df).tolist())

使用 zip_longest 方法(对于此大小的数据最快):

pd.Series(list(zip(*(zip_longest(*df.col_a)))))
timeit list(zip(*(zip_longest(*df.col_a))))
timeit pd.Series(list(zip(*(zip_longest(*df.col_a)))))
英文:

This is basically a padding question - expanding lists to the matching length by adding fill values. This has come up a number times.

itertools has a variant on zip that is useful for this:

In [273]: from itertools import zip_longest

Your column:

In [274]: x=df.col_a.to_numpy()
In [275]: x
Out[275]: 
array([list([&#39;a&#39;, &#39;b&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), list([&#39;a&#39;]),
       list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]),
       list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;])], dtype=object)

This pads, but it's the transpose of what we want:

In [276]: list(zip_longest(*x))
Out[276]: 
[(&#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;a&#39;),
 (&#39;b&#39;, &#39;b&#39;, None, &#39;b&#39;, &#39;b&#39;, &#39;b&#39;),
 (None, &#39;c&#39;, None, &#39;c&#39;, &#39;c&#39;, &#39;c&#39;),
 (None, None, None, &#39;d&#39;, None, &#39;d&#39;)]

We could use an array transpose, but there's a list idiom that does the same:

In [277]: list(zip(*(zip_longest(*x))))
Out[277]: 
[(&#39;a&#39;, &#39;b&#39;, None, None),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None),
 (&#39;a&#39;, None, None, None),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;)]

And the same thing applied to the numeric column, with nan fill:

In [281]: list(zip(*(zip_longest(*df.col_b, fillvalue=np.nan))))
Out[281]: 
[(1, 3, nan, nan),
 (1, 0, 0, nan),
 (4, nan, nan, nan),
 (1, 1, 2, 0),
 (0, 0, 5, nan),
 (3, 1, 2, 5)]

For large lists of lists there are faster padding methods, but this zip_longest is one of the easier ones to remember and use.

edit

With your code, mask is a good looking array, telling where we want the fill values:

In [285]: mask
Out[285]: 
array([[ True,  True, False, False],
       [ True,  True,  True, False],
       [ True, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True, False],
       [ True,  True,  True,  True]])

It's (6,4). To be used in where where have to expand col_np to (6,1):

In [286]: result = np.where(mask, col_np[:,None], &#39;None&#39;)
     ...: 

But the result is hardly what we want:

In [287]: result
Out[287]: 
array([[list([&#39;a&#39;, &#39;b&#39;]), list([&#39;a&#39;, &#39;b&#39;]), &#39;None&#39;, &#39;None&#39;],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), &#39;None&#39;],
       [list([&#39;a&#39;]), &#39;None&#39;, &#39;None&#39;, &#39;None&#39;],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;])],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), &#39;None&#39;],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;])]],
      dtype=object)

However there is a way of using that mask to fill an array. First make the target with the right shape and dtype:

In [288]: result = np.empty((6,4),object)
In [289]: result
Out[289]: 
array([[None, None, None, None],
       [None, None, None, None],
       [None, None, None, None],
       [None, None, None, None],
       [None, None, None, None],
       [None, None, None, None]], dtype=object)
In [290]: result[mask] = np.hstack(col_np)
In [291]: result
Out[291]: 
array([[&#39;a&#39;, &#39;b&#39;, None, None],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None],
       [&#39;a&#39;, None, None, None],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]], dtype=object)

This fill works because result[mask] is a flat array of values:

In [292]: result[mask]
Out[292]: 
array([&#39;a&#39;, &#39;b&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;a&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;,
       &#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;], dtype=object)

Overall this approach is quite close to one of the more clever, and faster, padding methods that have been proposed. I believe Warren Weckesser was the original source.

And the numeric column

In [300]: result = np.full((6,4),np.nan)
In [301]: result
Out[301]: 
array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])
In [302]: result[mask] = np.hstack(df.col_b)
In [303]: result
Out[303]: 
array([[ 1.,  3., nan, nan],
       [ 1.,  0.,  0., nan],
       [ 4., nan, nan, nan],
       [ 1.,  1.,  2.,  0.],
       [ 0.,  0.,  5., nan],
       [ 3.,  1.,  2.,  5.]])

The mask can be created with only one list comprehension - to get the length of each list.

In [318]: lens = np.array([len(i) for i in x])
In [319]: np.max(lens)
Out[319]: 4
In [320]: mask1 = np.arange(np.max(lens))&lt;lens[:,None]
In [321]: mask1
Out[321]: 
array([[ True,  True, False, False],
       [ True,  True,  True, False],
       [ True, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True, False],
       [ True,  True,  True,  True]])

timings

With Mozway's dataframe expansion:

In [325]: timeit pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist())
296 &#181;s &#177; 3.17 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)
In [326]: timeit pd.Series(pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist()).to_numpy().toli
     ...: st())
567 &#181;s &#177; 52.8 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

Packaging the mask approach into a function:

def foo(df):
    x = df.col_a.to_numpy()
    lens = np.array([len(i) for i in x])
    mask = np.arange(np.max(lens))&lt;lens[:,None]
    result = np.empty(mask.shape, dtype=object)
    result[mask] = np.hstack(x)
    return result

Time to get the padded array:

In [337]: timeit foo(df)
74.6 &#181;s &#177; 3.45 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)

and converting that array to a Series:

In [341]: timeit pd.Series(foo(df).tolist())
299 &#181;s &#177; 68.6 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

and the zip_longest approach (fastest for this size)

In [344]: pd.Series(list(zip(*(zip_longest(*df.col_a)))))
Out[344]: 
0       (a, b, None, None)
1          (a, b, c, None)
2    (a, None, None, None)
3             (a, b, c, d)
4          (a, b, c, None)
5             (a, b, c, d)
dtype: object
In [345]: timeit list(zip(*(zip_longest(*df.col_a))))
19.7 &#181;s &#177; 151 ns per loop (mean &#177; std. dev. of 7 runs, 100,000 loops each)
In [346]: timeit pd.Series(list(zip(*(zip_longest(*df.col_a)))))
165 &#181;s &#177; 29.6 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)

huangapple
  • 本文由 发表于 2023年7月20日 19:09:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729250.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定