英文:
Resize a numpy array of lists so that so that the lists all have the same length and dtype of numpy array can be inferred correctly
问题
# 将列转换为具有对象dtype的NumPy数组
col_np = df['col_a'].to_numpy()
# 使用NumPy操作找到列表的最大长度
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))
# 创建一个用于填充的掩码
mask = np.arange(max_length) < np.frompyfunc(len, 1, 1)(col_np)[:, None]
# 在必要时使用'None'填充列表
result = np.where(mask, col_np, 'None')
英文:
I currently have the following dataframe
data = {'col_a': [['a', 'b'], ['a', 'b', 'c'], ['a'], ['a', 'b', 'c', 'd'], ['a', 'b', 'c'], ['a', 'b', 'c', 'd']],
'col_b':[[1, 3], [1, 0, 0], [4], [1, 1, 2, 0], [0, 0, 5], [3, 1, 2, 5]]}
df= pd.DataFrame(data)
Suppose I work with col_a
, I want to resize the lists in col_a
in a vectorized manner so that the length of all the sub lists = max length of largest list
and I want to fill the empty values with 'None'
in the case of col_a
. I want the final output to look as follows
col_a col_b
0 [a, b, None, None] [1, 3, nan, nan]
1 [a, b, c, None] [1, 0, 0, nan]
2 [a, None, None, None] [4, nan, nan, nan]
3 [a, b, c, d] [1, 1, 2, 0]
4 [a, b, c, None] [0, 0, 5, nan]
5 [a, b, c, d] [3, 1, 2, 5]
So far I have done the following
# Convert the column to a NumPy array with object dtype
col_np = df['col_a'].to_numpy()
# Find the maximum length of the lists using NumPy operations
max_length = np.max(np.frompyfunc(len, 1, 1)(col_np))
# Create a mask for padding
mask = np.arange(max_length) < np.frompyfunc(len, 1, 1)(col_np)[:, None]
# Pad the lists with None where necessary
result = np.where(mask, col_np, 'None')
This results in the following error
ValueError: operands could not be broadcast together with shapes (6,4) (6,) ()
I feel like I'm close but there's something that I'm missing here. Please note that only vectorized solutions will be marked as the answer.
答案1
得分: 2
Only vectorized solutions will be marked as the answer. -> 仅矢量化解决方案将被标记为答案。
英文:
Only vectorized solutions will be marked as the answer. -> that's too bad because no (true) vectorized approach is possible with an array of lists. To this extent, np.frompyfunc
is certainly not truly vectorized.
If by "vectorized" you mean without explicit python loop, you could use:
df['out_a'] = pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())
An alternative with an explicit loop would be:
size = df['col_a'].str.len().max()
df['out_a'] = [l+[None]*(size-len(l)) for l in df['col_a']]
Output:
col_a col_b out_a
0 [a, b] [1, 3] [a, b, None, None]
1 [a, b, c] [1, 0, 0] [a, b, c, None]
2 [a] [4] [a, None, None, None]
3 [a, b, c, d] [1, 1, 2, 0] [a, b, c, d]
4 [a, b, c] [0, 0, 5] [a, b, c, None]
5 [a, b, c, d] [3, 1, 2, 5] [a, b, c, d]
timings
For small lists the "vectorized" and loop solution have very similar timings.
Here for lists with 1 to 10 items:
However, when the size of lists increases, the python loop become more efficient.
For list with 0 to 50 items:
0 to 200 items:
0 to 2000 items:
Code used for the timings:
import pandas as pd
import perfplot
import numpy as np
def pandas_vectorized(df):
df['out_a'] = pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())
def python_loop(df):
size = df['col_a'].str.len().max()
df['out_a'] = [l+[None]*(size-len(l)) for l in df['col_a']]
MAX_LIST_SIZE = 2000
perfplot.show(
setup=lambda n: pd.DataFrame({'col_a': [['x']*n for n in np.random.randint(0, MAX_LIST_SIZE, size=n)]}),
kernels=[pandas_vectorized, python_loop],
n_range=[2**k for k in range(1, 18)], # upper bound was 22 for small lists
xlabel="len(df)",
equality_check=None,
max_time=10,
)
答案2
得分: 1
这基本上是一个“填充”问题 - 通过添加填充值将列表扩展到匹配的长度。这已经出现了很多次。
itertools
有一个在这方面非常有用的 zip
变体:
from itertools import zip_longest
你的列:
x = df.col_a.to_numpy()
x
这会填充,但它的结果是我们想要的结果的转置:
list(zip_longest(*x))
我们可以使用数组的转置,但有一个与之相同的列表惯用法:
list(zip(*(zip_longest(*x))))
对于数值列也可以使用相同的方法,使用 nan
进行填充:
list(zip(*(zip_longest(*df.col_b, fillvalue=np.nan))))
对于大型列表,有更快的填充方法,但是 zip_longest
是其中一个更容易记住和使用的方法之一。
mask
是一个看起来很好的数组,告诉我们在哪里要填充值:
mask
它的形状是 (6, 4)。要在 where
中使用它,我们需要将 col_np
扩展到 (6, 1):
result = np.where(mask, col_np[:, None], 'None')
但是结果几乎不是我们想要的:
result
然而,有一种方法可以使用该掩码来填充一个数组。首先创建具有正确形状和 dtype 的目标:
result = np.empty((6, 4), object)
然后进行填充:
result[mask] = np.hstack(col_np)
这种填充有效,因为 result[mask]
是一个扁平的值数组:
result[mask]
总的来说,这种方法非常接近于一种更聪明和更快的填充方法之一。我认为 Warren Weckesser 是最初的来源。
对于数值列也可以使用相同的方法:
result = np.full((6, 4), np.nan)
result[mask] = np.hstack(df.col_b)
掩码可以使用一个列表推导式来创建 - 以获取每个列表的长度:
lens = np.array([len(i) for i in x])
np.max(lens)
mask1 = np.arange(np.max(lens)) < lens[:, None]
mask1
性能比较:
使用 Mozway 的数据帧扩展:
timeit pd.DataFrame(df['col_a'].to_numpy().tolist())
timeit pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().tolist())
将掩码方法封装成一个函数:
def foo(df):
x = df.col_a.to_numpy()
lens = np.array([len(i) for i in x])
mask = np.arange(np.max(lens)) < lens[:, None]
result = np.empty(mask.shape, dtype=object)
result[mask] = np.hstack(x)
return result
获取填充数组所需的时间:
timeit foo(df)
将该数组转换为 Series 所需的时间:
timeit pd.Series(foo(df).tolist())
使用 zip_longest
方法(对于此大小的数据最快):
pd.Series(list(zip(*(zip_longest(*df.col_a)))))
timeit list(zip(*(zip_longest(*df.col_a))))
timeit pd.Series(list(zip(*(zip_longest(*df.col_a)))))
英文:
This is basically a padding
question - expanding lists to the matching length by adding fill values. This has come up a number times.
itertools
has a variant on zip
that is useful for this:
In [273]: from itertools import zip_longest
Your column:
In [274]: x=df.col_a.to_numpy()
In [275]: x
Out[275]:
array([list(['a', 'b']), list(['a', 'b', 'c']), list(['a']),
list(['a', 'b', 'c', 'd']), list(['a', 'b', 'c']),
list(['a', 'b', 'c', 'd'])], dtype=object)
This pads, but it's the transpose of what we want:
In [276]: list(zip_longest(*x))
Out[276]:
[('a', 'a', 'a', 'a', 'a', 'a'),
('b', 'b', None, 'b', 'b', 'b'),
(None, 'c', None, 'c', 'c', 'c'),
(None, None, None, 'd', None, 'd')]
We could use an array transpose, but there's a list idiom that does the same:
In [277]: list(zip(*(zip_longest(*x))))
Out[277]:
[('a', 'b', None, None),
('a', 'b', 'c', None),
('a', None, None, None),
('a', 'b', 'c', 'd'),
('a', 'b', 'c', None),
('a', 'b', 'c', 'd')]
And the same thing applied to the numeric column, with nan
fill:
In [281]: list(zip(*(zip_longest(*df.col_b, fillvalue=np.nan))))
Out[281]:
[(1, 3, nan, nan),
(1, 0, 0, nan),
(4, nan, nan, nan),
(1, 1, 2, 0),
(0, 0, 5, nan),
(3, 1, 2, 5)]
For large lists of lists there are faster padding methods, but this zip_longest
is one of the easier ones to remember and use.
edit
With your code, mask
is a good looking array, telling where we want the fill values:
In [285]: mask
Out[285]:
array([[ True, True, False, False],
[ True, True, True, False],
[ True, False, False, False],
[ True, True, True, True],
[ True, True, True, False],
[ True, True, True, True]])
It's (6,4). To be used in where
where have to expand col_np
to (6,1):
In [286]: result = np.where(mask, col_np[:,None], 'None')
...:
But the result is hardly what we want:
In [287]: result
Out[287]:
array([[list(['a', 'b']), list(['a', 'b']), 'None', 'None'],
[list(['a', 'b', 'c']), list(['a', 'b', 'c']),
list(['a', 'b', 'c']), 'None'],
[list(['a']), 'None', 'None', 'None'],
[list(['a', 'b', 'c', 'd']), list(['a', 'b', 'c', 'd']),
list(['a', 'b', 'c', 'd']), list(['a', 'b', 'c', 'd'])],
[list(['a', 'b', 'c']), list(['a', 'b', 'c']),
list(['a', 'b', 'c']), 'None'],
[list(['a', 'b', 'c', 'd']), list(['a', 'b', 'c', 'd']),
list(['a', 'b', 'c', 'd']), list(['a', 'b', 'c', 'd'])]],
dtype=object)
However there is a way of using that mask to fill an array. First make the target with the right shape and dtype:
In [288]: result = np.empty((6,4),object)
In [289]: result
Out[289]:
array([[None, None, None, None],
[None, None, None, None],
[None, None, None, None],
[None, None, None, None],
[None, None, None, None],
[None, None, None, None]], dtype=object)
In [290]: result[mask] = np.hstack(col_np)
In [291]: result
Out[291]:
array([['a', 'b', None, None],
['a', 'b', 'c', None],
['a', None, None, None],
['a', 'b', 'c', 'd'],
['a', 'b', 'c', None],
['a', 'b', 'c', 'd']], dtype=object)
This fill works because result[mask]
is a flat array of values:
In [292]: result[mask]
Out[292]:
array(['a', 'b', 'a', 'b', 'c', 'a', 'a', 'b', 'c', 'd', 'a', 'b', 'c',
'a', 'b', 'c', 'd'], dtype=object)
Overall this approach is quite close to one of the more clever, and faster, padding methods that have been proposed. I believe Warren Weckesser was the original source.
And the numeric column
In [300]: result = np.full((6,4),np.nan)
In [301]: result
Out[301]:
array([[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]])
In [302]: result[mask] = np.hstack(df.col_b)
In [303]: result
Out[303]:
array([[ 1., 3., nan, nan],
[ 1., 0., 0., nan],
[ 4., nan, nan, nan],
[ 1., 1., 2., 0.],
[ 0., 0., 5., nan],
[ 3., 1., 2., 5.]])
The mask
can be created with only one list comprehension - to get the length of each list.
In [318]: lens = np.array([len(i) for i in x])
In [319]: np.max(lens)
Out[319]: 4
In [320]: mask1 = np.arange(np.max(lens))<lens[:,None]
In [321]: mask1
Out[321]:
array([[ True, True, False, False],
[ True, True, True, False],
[ True, False, False, False],
[ True, True, True, True],
[ True, True, True, False],
[ True, True, True, True]])
timings
With Mozway's dataframe expansion:
In [325]: timeit pd.DataFrame(df['col_a'].to_numpy().tolist())
296 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [326]: timeit pd.Series(pd.DataFrame(df['col_a'].to_numpy().tolist()).to_numpy().toli
...: st())
567 µs ± 52.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Packaging the mask approach into a function:
def foo(df):
x = df.col_a.to_numpy()
lens = np.array([len(i) for i in x])
mask = np.arange(np.max(lens))<lens[:,None]
result = np.empty(mask.shape, dtype=object)
result[mask] = np.hstack(x)
return result
Time to get the padded array:
In [337]: timeit foo(df)
74.6 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
and converting that array to a Series:
In [341]: timeit pd.Series(foo(df).tolist())
299 µs ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
and the zip_longest approach (fastest for this size)
In [344]: pd.Series(list(zip(*(zip_longest(*df.col_a)))))
Out[344]:
0 (a, b, None, None)
1 (a, b, c, None)
2 (a, None, None, None)
3 (a, b, c, d)
4 (a, b, c, None)
5 (a, b, c, d)
dtype: object
In [345]: timeit list(zip(*(zip_longest(*df.col_a))))
19.7 µs ± 151 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [346]: timeit pd.Series(list(zip(*(zip_longest(*df.col_a)))))
165 µs ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论