2023年7月20日 19:09:48go评论83阅读模式

英文:

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

问题

# 将列转换为具有对象dtype的NumPy数组
col_np = df['col_a'].to_numpy()

# 使用NumPy操作找到列表的最大长度
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))

# 创建一个用于填充的掩码
mask = np.arange(max_length) < np.frompyfunc(len, 1, 1)(col_np)[:, None]

# 在必要时使用'None'填充列表
result = np.where(mask, col_np, 'None')

英文:

I currently have the following dataframe

data = {&#39;col_a&#39;: [[&#39;a&#39;, &#39;b&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], [&#39;a&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]],
        &#39;col_b&#39;:[[1, 3], [1, 0, 0], [4], [1, 1, 2, 0], [0, 0, 5], [3, 1, 2, 5]]}
df= pd.DataFrame(data)

Suppose I work with col_a, I want to resize the lists in col_a in a vectorized manner so that the length of all the sub lists = max length of largest list and I want to fill the empty values with 'None' in the case of col_a. I want the final output to look as follows

                   col_a               col_b
0     [a, b, None, None]    [1, 3, nan, nan]
1        [a, b, c, None]      [1, 0, 0, nan]
2  [a, None, None, None]  [4, nan, nan, nan]
3           [a, b, c, d]        [1, 1, 2, 0]
4        [a, b, c, None]      [0, 0, 5, nan]
5           [a, b, c, d]        [3, 1, 2, 5]

So far I have done the following

# Convert the column to a NumPy array with object dtype
col_np = df[&#39;col_a&#39;].to_numpy()

# Find the maximum length of the lists using NumPy operations
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))

# Create a mask for padding
mask = np.arange(max_length) &lt; np.frompyfunc(len, 1, 1)(col_np)[:, None]

# Pad the lists with None where necessary
result = np.where(mask, col_np, &#39;None&#39;)

This results in the following error
ValueError: operands could not be broadcast together with shapes (6,4) (6,) ()

I feel like I'm close but there's something that I'm missing here. Please note that only vectorized solutions will be marked as the answer.

答案1

得分: 2

Only vectorized solutions will be marked as the answer. -> 仅矢量化解决方案将被标记为答案。

英文:

Only vectorized solutions will be marked as the answer. -> that's too bad because no (true) vectorized approach is possible with an array of lists. To this extent, np.frompyfunc is certainly not truly vectorized.

If by "vectorized" you mean without explicit python loop, you could use:

df[&#39;out_a&#39;] = pd.Series(pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist()).to_numpy().tolist())

An alternative with an explicit loop would be:

size = df[&#39;col_a&#39;].str.len().max()

df[&#39;out_a&#39;] = [l+[None]*(size-len(l)) for l in df[&#39;col_a&#39;]]

Output:

          col_a         col_b                  out_a
0        [a, b]        [1, 3]     [a, b, None, None]
1     [a, b, c]     [1, 0, 0]        [a, b, c, None]
2           [a]           [4]  [a, None, None, None]
3  [a, b, c, d]  [1, 1, 2, 0]           [a, b, c, d]
4     [a, b, c]     [0, 0, 5]        [a, b, c, None]
5  [a, b, c, d]  [3, 1, 2, 5]           [a, b, c, d]

timings

For small lists the "vectorized" and loop solution have very similar timings.

Here for lists with 1 to 10 items:

However, when the size of lists increases, the python loop become more efficient.

For list with 0 to 50 items:

0 to 200 items:

0 to 2000 items:

Code used for the timings:

import pandas as pd
import perfplot
import numpy as np

def pandas_vectorized(df):
    df[&#39;out_a&#39;] = pd.Series(pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist()).to_numpy().tolist())
    
def python_loop(df):
    size = df[&#39;col_a&#39;].str.len().max()
    df[&#39;out_a&#39;] = [l+[None]*(size-len(l)) for l in df[&#39;col_a&#39;]]

MAX_LIST_SIZE = 2000
    
perfplot.show(
    setup=lambda n: pd.DataFrame({&#39;col_a&#39;: [[&#39;x&#39;]*n for n in np.random.randint(0, MAX_LIST_SIZE, size=n)]}),
    kernels=[pandas_vectorized, python_loop],
    n_range=[2**k for k in range(1, 18)],  # upper bound was 22 for small lists
    xlabel=&quot;len(df)&quot;,
    equality_check=None,
    max_time=10,
)

答案2

得分: 1

这基本上是一个“填充”问题 - 通过添加填充值将列表扩展到匹配的长度。这已经出现了很多次。

itertools 有一个在这方面非常有用的 zip 变体：

from itertools import zip_longest

你的列：

x = df.col_a.to_numpy()
x

这会填充，但它的结果是我们想要的结果的转置：

list(zip_longest(*x))

我们可以使用数组的转置，但有一个与之相同的列表惯用法：

list(zip(*(zip_longest(*x))))

对于数值列也可以使用相同的方法，使用 nan 进行填充：

list(zip(*(zip_longest(*df.col_b, fillvalue=np.nan))))

对于大型列表，有更快的填充方法，但是 zip_longest 是其中一个更容易记住和使用的方法之一。

mask 是一个看起来很好的数组，告诉我们在哪里要填充值：

mask

它的形状是 (6, 4)。要在 where 中使用它，我们需要将 col_np 扩展到 (6, 1)：

result = np.where(mask, col_np[:, None], 'None')

但是结果几乎不是我们想要的：

result

然而，有一种方法可以使用该掩码来填充一个数组。首先创建具有正确形状和 dtype 的目标：

result = np.empty((6, 4), object)

然后进行填充：

result[mask] = np.hstack(col_np)

这种填充有效，因为 result[mask] 是一个扁平的值数组：

result[mask]

总的来说，这种方法非常接近于一种更聪明和更快的填充方法之一。我认为 Warren Weckesser 是最初的来源。

对于数值列也可以使用相同的方法：

result = np.full((6, 4), np.nan)
result[mask] = np.hstack(df.col_b)

掩码可以使用一个列表推导式来创建 - 以获取每个列表的长度：

lens = np.array([len(i) for i in x])
np.max(lens)
mask1 = np.arange(np.max(lens)) < lens[:, None]
mask1

性能比较：

使用 Mozway 的数据帧扩展：

timeit pd.DataFrame(df['col_a'].to_numpy().tolist())
timeit pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())

将掩码方法封装成一个函数：

def foo(df):
    x = df.col_a.to_numpy()
    lens = np.array([len(i) for i in x])
    mask = np.arange(np.max(lens)) < lens[:, None]
    result = np.empty(mask.shape, dtype=object)
    result[mask] = np.hstack(x)
    return result

获取填充数组所需的时间：

timeit foo(df)

将该数组转换为 Series 所需的时间：

timeit pd.Series(foo(df).tolist())

使用 zip_longest 方法（对于此大小的数据最快）：

pd.Series(list(zip(*(zip_longest(*df.col_a)))))
timeit list(zip(*(zip_longest(*df.col_a))))
timeit pd.Series(list(zip(*(zip_longest(*df.col_a)))))

英文:

This is basically a padding question - expanding lists to the matching length by adding fill values. This has come up a number times.

itertools has a variant on zip that is useful for this:

In [273]: from itertools import zip_longest

Your column:

In [274]: x=df.col_a.to_numpy()
In [275]: x
Out[275]: 
array([list([&#39;a&#39;, &#39;b&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), list([&#39;a&#39;]),
       list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]),
       list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;])], dtype=object)

This pads, but it's the transpose of what we want:

In [276]: list(zip_longest(*x))
Out[276]: 
[(&#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;a&#39;, &#39;a&#39;),
 (&#39;b&#39;, &#39;b&#39;, None, &#39;b&#39;, &#39;b&#39;, &#39;b&#39;),
 (None, &#39;c&#39;, None, &#39;c&#39;, &#39;c&#39;, &#39;c&#39;),
 (None, None, None, &#39;d&#39;, None, &#39;d&#39;)]

We could use an array transpose, but there's a list idiom that does the same:

In [277]: list(zip(*(zip_longest(*x))))
Out[277]: 
[(&#39;a&#39;, &#39;b&#39;, None, None),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None),
 (&#39;a&#39;, None, None, None),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None),
 (&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;)]

And the same thing applied to the numeric column, with nan fill:

In [281]: list(zip(*(zip_longest(*df.col_b, fillvalue=np.nan))))
Out[281]: 
[(1, 3, nan, nan),
 (1, 0, 0, nan),
 (4, nan, nan, nan),
 (1, 1, 2, 0),
 (0, 0, 5, nan),
 (3, 1, 2, 5)]

For large lists of lists there are faster padding methods, but this zip_longest is one of the easier ones to remember and use.

edit

With your code, mask is a good looking array, telling where we want the fill values:

In [285]: mask
Out[285]: 
array([[ True,  True, False, False],
       [ True,  True,  True, False],
       [ True, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True, False],
       [ True,  True,  True,  True]])

It's (6,4). To be used in where where have to expand col_np to (6,1):

In [286]: result = np.where(mask, col_np[:,None], &#39;None&#39;)
     ...:

But the result is hardly what we want:

In [287]: result
Out[287]: 
array([[list([&#39;a&#39;, &#39;b&#39;]), list([&#39;a&#39;, &#39;b&#39;]), &#39;None&#39;, &#39;None&#39;],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), &#39;None&#39;],
       [list([&#39;a&#39;]), &#39;None&#39;, &#39;None&#39;, &#39;None&#39;],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;])],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;]), &#39;None&#39;],
       [list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]),
        list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]), list([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;])]],
      dtype=object)

However there is a way of using that mask to fill an array. First make the target with the right shape and dtype:

In [288]: result = np.empty((6,4),object)
In [289]: result
Out[289]: 
array([[None, None, None, None],
       [None, None, None, None],
       [None, None, None, None],
       [None, None, None, None],
       [None, None, None, None],
       [None, None, None, None]], dtype=object)
In [290]: result[mask] = np.hstack(col_np)
In [291]: result
Out[291]: 
array([[&#39;a&#39;, &#39;b&#39;, None, None],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None],
       [&#39;a&#39;, None, None, None],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, None],
       [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;]], dtype=object)

This fill works because result[mask] is a flat array of values:

In [292]: result[mask]
Out[292]: 
array([&#39;a&#39;, &#39;b&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;a&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;, &#39;a&#39;, &#39;b&#39;, &#39;c&#39;,
       &#39;a&#39;, &#39;b&#39;, &#39;c&#39;, &#39;d&#39;], dtype=object)

Overall this approach is quite close to one of the more clever, and faster, padding methods that have been proposed. I believe Warren Weckesser was the original source.

And the numeric column

In [300]: result = np.full((6,4),np.nan)
In [301]: result
Out[301]: 
array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])
In [302]: result[mask] = np.hstack(df.col_b)
In [303]: result
Out[303]: 
array([[ 1.,  3., nan, nan],
       [ 1.,  0.,  0., nan],
       [ 4., nan, nan, nan],
       [ 1.,  1.,  2.,  0.],
       [ 0.,  0.,  5., nan],
       [ 3.,  1.,  2.,  5.]])

The mask can be created with only one list comprehension - to get the length of each list.

In [318]: lens = np.array([len(i) for i in x])
In [319]: np.max(lens)
Out[319]: 4
In [320]: mask1 = np.arange(np.max(lens))&lt;lens[:,None]
In [321]: mask1
Out[321]: 
array([[ True,  True, False, False],
       [ True,  True,  True, False],
       [ True, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True, False],
       [ True,  True,  True,  True]])

timings

With Mozway's dataframe expansion:

In [325]: timeit pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist())
296 &#181;s &#177; 3.17 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)
In [326]: timeit pd.Series(pd.DataFrame(df[&#39;col_a&#39;].to_numpy().tolist()).to_numpy().toli
     ...: st())
567 &#181;s &#177; 52.8 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

Packaging the mask approach into a function:

def foo(df):
    x = df.col_a.to_numpy()
    lens = np.array([len(i) for i in x])
    mask = np.arange(np.max(lens))&lt;lens[:,None]
    result = np.empty(mask.shape, dtype=object)
    result[mask] = np.hstack(x)
    return result

Time to get the padded array:

In [337]: timeit foo(df)
74.6 &#181;s &#177; 3.45 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)

and converting that array to a Series:

In [341]: timeit pd.Series(foo(df).tolist())
299 &#181;s &#177; 68.6 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

and the zip_longest approach (fastest for this size)

In [344]: pd.Series(list(zip(*(zip_longest(*df.col_a)))))
Out[344]: 
0       (a, b, None, None)
1          (a, b, c, None)
2    (a, None, None, None)
3             (a, b, c, d)
4          (a, b, c, None)
5             (a, b, c, d)
dtype: object
In [345]: timeit list(zip(*(zip_longest(*df.col_a))))
19.7 &#181;s &#177; 151 ns per loop (mean &#177; std. dev. of 7 runs, 100,000 loops each)
In [346]: timeit pd.Series(list(zip(*(zip_longest(*df.col_a)))))
165 &#181;s &#177; 29.6 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly

问题

答案1

timings

答案2

edit

timings

如何在tkinter中只播放一次gif？

你可以在Kivy的MapView中用画布圆圈替换标准标记。

mul魔术方法为什么表现不同于人类可读版本？

如何从特定 div 类别下的 HTML 中抓取 <p>。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论