2023年7月13日 20:41:37go评论102阅读模式

英文:

How to set multiple values in Pandas in column of dtype np.array?

问题

我在Pandas中有一列Numpy数组，类似于：
```python
   col1 col2    col3
0     1    a    None
1     2    b  [2, 4]
2     3    c    None

[2, 4]实际上是np.array([2, 4])。现在我需要填充缺失值，我有一个用于填充的数组列表。例如：

vals_to_impute = [np.array([1, 2]), np.array([1, 4])]

我尝试了以下操作：

mask = col3.isna()
df.loc[mask, "col3"] = vals_to_impute

但是出现了错误：

ValueError: Must have equal len keys and value when setting with an ndarray

我尝试将其转换为Numpy数组，提取列等，但都没有成功。是否实际上可以使用矢量化操作来设置它，还是必须手动循环？


<details>
<summary>英文:</summary>
I have a column of Numpy arrays in Pandas, something like:

col1 col2 col3
0 1 a None
1 2 b [2, 4]
2 3 c None

The `[2, 4]` is really `np.array([2, 4])`. Now I need to impute the missing values, and I have a list of arrays for that. For example:

vals_to_impute = [np.array([1, 2]), np.array([1, 4])]


I tried:

mask = col3.isna()
df.loc[mask, "col3"] = vals_to_impute


This results in error:

ValueError: Must have equal len keys and value when setting with an ndarray


I tried converting to Numpy array, extracting column etc., nothing worked. Is it actually possible to set this in a vectorized operation, or do I have to do a manual loop?
</details>
# 答案1
**得分**: 3
我使用`pd.Series`而不是列表来完成了这个任务。我还必须为这个Series输入一个索引，以确保插入正确。也许有更简单的方法。
```python
df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["a", "b", "c"],
    "col3": [None, np.array([2, 4]), None]
})
mask = df["col3"].isna()
vals_to_impute = pd.Series(
    [np.array([1, 2]), np.array([1, 4])],
    index=mask[mask].index
)
df.loc[mask, "col3"] = vals_to_impute
print(df)

输出：

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

英文:

I managed to do it using pd.Series instead of list. I also had to input an index to this Series so that the insertion is correct. Maybe it can be done easier.

df = pd.DataFrame({
    &quot;col1&quot;: [1, 2, 3],
    &quot;col2&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;],
    &quot;col3&quot;: [None, np.array([2, 4]), None]
})
mask = df[&quot;col3&quot;].isna()
vals_to_impute = pd.Series(
    [np.array([1, 2]), np.array([1, 4])],
    index=mask[mask].index
)
df.loc[mask, &quot;col3&quot;] = vals_to_impute
print(df)

Output:

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

答案2

得分: 2

一种使用短循环的选项：

mask = df['col3'].isna()
vals = iter(vals_to_impute)
for idx in df.index[mask]:
    df.at[idx, 'col3'] = next(vals, None)

这也可以通过类似于此处提供的方法解决（https://stackoverflow.com/a/76641469/16343464），利用底层 numpy 数组的共享内存：

mask = df['col3'].isna()
arr = df['col3'].to_numpy()
arr[np.where(mask)] = vals_to_impute

修改后的 DataFrame：

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

使用的输入：

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': ['a', 'b', 'c'],
                   'col3': [None, np.array([2, 4]), None]})
vals_to_impute = [np.array([1, 2]), np.array([1, 4])]

英文:

One option using a short loop:

mask = df[&#39;col3&#39;].isna()
vals = iter(vals_to_impute)
for idx in df.index[mask]:
    df.at[idx, &#39;col3&#39;] = next(vals, None)

This can also be solved using an approach similar to that presented here, taking advantage of the shared memory when using the underlying numpy array:

mask = df[&#39;col3&#39;].isna()
arr = df[&#39;col3&#39;].to_numpy()
arr[np.where(mask)] = vals_to_impute

Modified DataFrame:

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

Used input:

df = pd.DataFrame({&#39;col1&#39;: [1, 2, 3],
                   &#39;col2&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],
                   &#39;col3&#39;: [None, np.array([2, 4]), None]})
vals_to_impute = [np.array([1, 2]), np.array([1, 4])]

答案3

得分: 1

不确定你的问题实际在哪里，因为你只展示了你代码的一部分。。

对我来说，这正如预期的那样工作：

import pandas as pd
import numpy as np
data = [None, None, np.array([2, 2]), None, None]
df = pd.DataFrame(dict(data=data))
mask = df["data"].isna()
maskdata = [np.array([i, i]) for i in mask[mask].index]
df.loc[mask, "data"] = maskdata
# print(df)
# >>>     data
# >>> 0  [0, 0]
# >>> 1  [1, 1]
# >>> 2  [2, 2]
# >>> 3  [3, 3]
# >>> 4  [4, 4]

请注意，我已经按照你的要求，只返回翻译好的代码部分。

英文:

not sure where your problem actually is since you show only part of your code..

For me, this works just as expected:

import pandas as pd
import numpy as np
data = [None, None, np.array([2, 2]), None, None]
df = pd.DataFrame(dict(data = data))
mask = df[&quot;data&quot;].isna()
maskdata = [np.array([i, i]) for i in mask[mask].index]
df.loc[mask, &quot;data&quot;] = maskdata
# print(df)
# &gt;&gt;&gt;     data
# &gt;&gt;&gt; 0  [0, 0]
# &gt;&gt;&gt; 1  [1, 1]
# &gt;&gt;&gt; 2  [2, 2]
# &gt;&gt;&gt; 2  [3, 3]
# &gt;&gt;&gt; 2  [4, 4]

答案4

得分: 1

根据我的理解，问题出现在loc对基于索引匹配的值产生影响。

以你的示例为例：

df = pd.DataFrame({"col1":[1,2,3], 
                   "col2":["a","b","c"], 
                   "col3":[None, np.array([2,4]), None]})

如果你看一下：

mask = df.col3.isna()
mask

实际上你会得到：

0     True
1    False
2     True
Name: col3, dtype: bool

因此，尽管你只选择了具有“True”的行，但索引仍然具有长度为3。此外，你的列表vals_to_impute没有索引，因此Pandas不知道如何影响值。

一个快速的修复方法可能是：

mask = df[df.col3.isna()].index
vals_to_impute = pd.Series([np.array([1, 2]), np.array([1, 4])], index=mask)
df.loc[mask,"col3"] = vals_to_impute

注意：

可能有一种更合适的Pandas方法来做这个。
在DataFrame列中使用nd.arrays在我所知道的情况下相当不常见，请注意你可能会在以后遇到其他问题。

英文:

From what I understand, the problem arises from loc affecting values based on index matching.

Taking your example:

df = pd.DataFrame({&quot;col1&quot;:[1,2,3], 
                   &quot;col2&quot;:[&quot;a&quot;,&quot;b&quot;,&quot;c&quot;], 
                   &quot;col3&quot;:[None, np.array([2,4]), None]})

If you look at

mask = df.col3.isna()
mask

You actually have:

0     True
1    False
2     True
Name: col3, dtype: bool

So, although you are selecting only the rows with "True", the index still has a length of 3. Furthermore, your list vals_to_impute is not indexed, so Pandas does not know how to affect values.

A quick fix could be:

mask = df[df.col3.isna()].index
vals_to_impute = pd.Series([np.array([1, 2]), np.array([1, 4])], index=mask)
df.loc[mask,&quot;col3&quot;] = vals_to_impute

Note:

There might be a more proper pandas way to do this.
Using nd.arrays in a DataFrame column is rather uncommon to my knowledge, be aware you might encounter other problems later.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在 Pandas 中设置列的多个值，数据类型为 np.array？

问题

答案2

答案3

答案4

Using AWS lambda with python or java code to get JSON from RestAPI call and insert into oracle DB

如何使用变量而不是数字在花括号内格式化字符串？

“无法在xarray中打开Netcdf变量”

如何在Python中检查嵌套字典是否具有特定类型的值？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。