如何在 Pandas 中设置列的多个值,数据类型为 np.array?

huangapple go评论68阅读模式
英文:

How to set multiple values in Pandas in column of dtype np.array?

问题

我在Pandas中有一列Numpy数组类似于
```python
   col1 col2    col3
0     1    a    None
1     2    b  [2, 4]
2     3    c    None

[2, 4]实际上是np.array([2, 4])。现在我需要填充缺失值,我有一个用于填充的数组列表。例如:

vals_to_impute = [np.array([1, 2]), np.array([1, 4])]

我尝试了以下操作:

mask = col3.isna()
df.loc[mask, "col3"] = vals_to_impute

但是出现了错误:

ValueError: Must have equal len keys and value when setting with an ndarray

我尝试将其转换为Numpy数组,提取列等,但都没有成功。是否实际上可以使用矢量化操作来设置它,还是必须手动循环?


<details>
<summary>英文:</summary>

I have a column of Numpy arrays in Pandas, something like:

col1 col2 col3
0 1 a None
1 2 b [2, 4]
2 3 c None

The `[2, 4]` is really `np.array([2, 4])`. Now I need to impute the missing values, and I have a list of arrays for that. For example:

vals_to_impute = [np.array([1, 2]), np.array([1, 4])]


I tried:

mask = col3.isna()
df.loc[mask, "col3"] = vals_to_impute


This results in error:

ValueError: Must have equal len keys and value when setting with an ndarray


I tried converting to Numpy array, extracting column etc., nothing worked. Is it actually possible to set this in a vectorized operation, or do I have to do a manual loop?

</details>


# 答案1
**得分**: 3

我使用`pd.Series`而不是列表来完成了这个任务。我还必须为这个Series输入一个索引,以确保插入正确。也许有更简单的方法。

```python
df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": ["a", "b", "c"],
    "col3": [None, np.array([2, 4]), None]
})

mask = df["col3"].isna()
vals_to_impute = pd.Series(
    [np.array([1, 2]), np.array([1, 4])],
    index=mask[mask].index
)

df.loc[mask, "col3"] = vals_to_impute

print(df)

输出:

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]
英文:

I managed to do it using pd.Series instead of list. I also had to input an index to this Series so that the insertion is correct. Maybe it can be done easier.

df = pd.DataFrame({
    &quot;col1&quot;: [1, 2, 3],
    &quot;col2&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;],
    &quot;col3&quot;: [None, np.array([2, 4]), None]
})

mask = df[&quot;col3&quot;].isna()
vals_to_impute = pd.Series(
    [np.array([1, 2]), np.array([1, 4])],
    index=mask[mask].index
)

df.loc[mask, &quot;col3&quot;] = vals_to_impute

print(df)

Output:

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

答案2

得分: 2

一种使用短循环的选项:

mask = df['col3'].isna()
vals = iter(vals_to_impute)

for idx in df.index[mask]:
    df.at[idx, 'col3'] = next(vals, None)

这也可以通过类似于此处提供的方法解决(https://stackoverflow.com/a/76641469/16343464),利用底层 numpy 数组的共享内存:

mask = df['col3'].isna()
arr = df['col3'].to_numpy()
arr[np.where(mask)] = vals_to_impute

修改后的 DataFrame:

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

使用的输入:

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': ['a', 'b', 'c'],
                   'col3': [None, np.array([2, 4]), None]})

vals_to_impute = [np.array([1, 2]), np.array([1, 4])]
英文:

One option using a short loop:

mask = df[&#39;col3&#39;].isna()
vals = iter(vals_to_impute)

for idx in df.index[mask]:
    df.at[idx, &#39;col3&#39;] = next(vals, None)

This can also be solved using an approach similar to that presented here, taking advantage of the shared memory when using the underlying numpy array:

mask = df[&#39;col3&#39;].isna()
arr = df[&#39;col3&#39;].to_numpy()
arr[np.where(mask)] = vals_to_impute

Modified DataFrame:

   col1 col2    col3
0     1    a  [1, 2]
1     2    b  [2, 4]
2     3    c  [1, 4]

Used input:

df = pd.DataFrame({&#39;col1&#39;: [1, 2, 3],
                   &#39;col2&#39;: [&#39;a&#39;, &#39;b&#39;, &#39;c&#39;],
                   &#39;col3&#39;: [None, np.array([2, 4]), None]})

vals_to_impute = [np.array([1, 2]), np.array([1, 4])]

答案3

得分: 1

不确定你的问题实际在哪里,因为你只展示了你代码的一部分。。

对我来说,这正如预期的那样工作:

import pandas as pd
import numpy as np

data = [None, None, np.array([2, 2]), None, None]
df = pd.DataFrame(dict(data=data))

mask = df["data"].isna()
maskdata = [np.array([i, i]) for i in mask[mask].index]

df.loc[mask, "data"] = maskdata

# print(df)
# >>>     data
# >>> 0  [0, 0]
# >>> 1  [1, 1]
# >>> 2  [2, 2]
# >>> 3  [3, 3]
# >>> 4  [4, 4]

请注意,我已经按照你的要求,只返回翻译好的代码部分。

英文:

not sure where your problem actually is since you show only part of your code..

For me, this works just as expected:

import pandas as pd
import numpy as np

data = [None, None, np.array([2, 2]), None, None]
df = pd.DataFrame(dict(data = data))

mask = df[&quot;data&quot;].isna()
maskdata = [np.array([i, i]) for i in mask[mask].index]

df.loc[mask, &quot;data&quot;] = maskdata

# print(df)
# &gt;&gt;&gt;     data
# &gt;&gt;&gt; 0  [0, 0]
# &gt;&gt;&gt; 1  [1, 1]
# &gt;&gt;&gt; 2  [2, 2]
# &gt;&gt;&gt; 2  [3, 3]
# &gt;&gt;&gt; 2  [4, 4]

答案4

得分: 1

根据我的理解,问题出现在loc对基于索引匹配的值产生影响。

以你的示例为例:

df = pd.DataFrame({"col1":[1,2,3], 
                   "col2":["a","b","c"], 
                   "col3":[None, np.array([2,4]), None]})

如果你看一下:

mask = df.col3.isna()
mask

实际上你会得到:

0     True
1    False
2     True
Name: col3, dtype: bool

因此,尽管你只选择了具有“True”的行,但索引仍然具有长度为3。此外,你的列表vals_to_impute没有索引,因此Pandas不知道如何影响值。

一个快速的修复方法可能是:

mask = df[df.col3.isna()].index
vals_to_impute = pd.Series([np.array([1, 2]), np.array([1, 4])], index=mask)
df.loc[mask,"col3"] = vals_to_impute

注意:

  • 可能有一种更合适的Pandas方法来做这个。
  • 在DataFrame列中使用nd.arrays在我所知道的情况下相当不常见,请注意你可能会在以后遇到其他问题。
英文:

From what I understand, the problem arises from loc affecting values based on index matching.

Taking your example:

df = pd.DataFrame({&quot;col1&quot;:[1,2,3], 
                   &quot;col2&quot;:[&quot;a&quot;,&quot;b&quot;,&quot;c&quot;], 
                   &quot;col3&quot;:[None, np.array([2,4]), None]})

If you look at

mask = df.col3.isna()
mask

You actually have:

0     True
1    False
2     True
Name: col3, dtype: bool

So, although you are selecting only the rows with "True", the index still has a length of 3. Furthermore, your list vals_to_impute is not indexed, so Pandas does not know how to affect values.

A quick fix could be:

mask = df[df.col3.isna()].index
vals_to_impute = pd.Series([np.array([1, 2]), np.array([1, 4])], index=mask)
df.loc[mask,&quot;col3&quot;] = vals_to_impute

Note:

  • There might be a more proper pandas way to do this.
  • Using nd.arrays in a DataFrame column is rather uncommon to my knowledge, be aware you might encounter other problems later.

huangapple
  • 本文由 发表于 2023年7月13日 20:41:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76679508.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定