英文:
How to set multiple values in Pandas in column of dtype np.array?
问题
我在Pandas中有一列Numpy数组,类似于:
```python
col1 col2 col3
0 1 a None
1 2 b [2, 4]
2 3 c None
[2, 4]
实际上是np.array([2, 4])
。现在我需要填充缺失值,我有一个用于填充的数组列表。例如:
vals_to_impute = [np.array([1, 2]), np.array([1, 4])]
我尝试了以下操作:
mask = col3.isna()
df.loc[mask, "col3"] = vals_to_impute
但是出现了错误:
ValueError: Must have equal len keys and value when setting with an ndarray
我尝试将其转换为Numpy数组,提取列等,但都没有成功。是否实际上可以使用矢量化操作来设置它,还是必须手动循环?
<details>
<summary>英文:</summary>
I have a column of Numpy arrays in Pandas, something like:
col1 col2 col3
0 1 a None
1 2 b [2, 4]
2 3 c None
The `[2, 4]` is really `np.array([2, 4])`. Now I need to impute the missing values, and I have a list of arrays for that. For example:
vals_to_impute = [np.array([1, 2]), np.array([1, 4])]
I tried:
mask = col3.isna()
df.loc[mask, "col3"] = vals_to_impute
This results in error:
ValueError: Must have equal len keys and value when setting with an ndarray
I tried converting to Numpy array, extracting column etc., nothing worked. Is it actually possible to set this in a vectorized operation, or do I have to do a manual loop?
</details>
# 答案1
**得分**: 3
我使用`pd.Series`而不是列表来完成了这个任务。我还必须为这个Series输入一个索引,以确保插入正确。也许有更简单的方法。
```python
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "b", "c"],
"col3": [None, np.array([2, 4]), None]
})
mask = df["col3"].isna()
vals_to_impute = pd.Series(
[np.array([1, 2]), np.array([1, 4])],
index=mask[mask].index
)
df.loc[mask, "col3"] = vals_to_impute
print(df)
输出:
col1 col2 col3
0 1 a [1, 2]
1 2 b [2, 4]
2 3 c [1, 4]
英文:
I managed to do it using pd.Series instead of list. I also had to input an index to this Series so that the insertion is correct. Maybe it can be done easier.
df = pd.DataFrame({
"col1": [1, 2, 3],
"col2": ["a", "b", "c"],
"col3": [None, np.array([2, 4]), None]
})
mask = df["col3"].isna()
vals_to_impute = pd.Series(
[np.array([1, 2]), np.array([1, 4])],
index=mask[mask].index
)
df.loc[mask, "col3"] = vals_to_impute
print(df)
Output:
col1 col2 col3
0 1 a [1, 2]
1 2 b [2, 4]
2 3 c [1, 4]
答案2
得分: 2
一种使用短循环的选项:
mask = df['col3'].isna()
vals = iter(vals_to_impute)
for idx in df.index[mask]:
df.at[idx, 'col3'] = next(vals, None)
这也可以通过类似于此处提供的方法解决(https://stackoverflow.com/a/76641469/16343464),利用底层 numpy 数组的共享内存:
mask = df['col3'].isna()
arr = df['col3'].to_numpy()
arr[np.where(mask)] = vals_to_impute
修改后的 DataFrame:
col1 col2 col3
0 1 a [1, 2]
1 2 b [2, 4]
2 3 c [1, 4]
使用的输入:
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': ['a', 'b', 'c'],
'col3': [None, np.array([2, 4]), None]})
vals_to_impute = [np.array([1, 2]), np.array([1, 4])]
英文:
One option using a short loop:
mask = df['col3'].isna()
vals = iter(vals_to_impute)
for idx in df.index[mask]:
df.at[idx, 'col3'] = next(vals, None)
This can also be solved using an approach similar to that presented here, taking advantage of the shared memory when using the underlying numpy array:
mask = df['col3'].isna()
arr = df['col3'].to_numpy()
arr[np.where(mask)] = vals_to_impute
Modified DataFrame:
col1 col2 col3
0 1 a [1, 2]
1 2 b [2, 4]
2 3 c [1, 4]
Used input:
df = pd.DataFrame({'col1': [1, 2, 3],
'col2': ['a', 'b', 'c'],
'col3': [None, np.array([2, 4]), None]})
vals_to_impute = [np.array([1, 2]), np.array([1, 4])]
答案3
得分: 1
不确定你的问题实际在哪里,因为你只展示了你代码的一部分。。
对我来说,这正如预期的那样工作:
import pandas as pd
import numpy as np
data = [None, None, np.array([2, 2]), None, None]
df = pd.DataFrame(dict(data=data))
mask = df["data"].isna()
maskdata = [np.array([i, i]) for i in mask[mask].index]
df.loc[mask, "data"] = maskdata
# print(df)
# >>> data
# >>> 0 [0, 0]
# >>> 1 [1, 1]
# >>> 2 [2, 2]
# >>> 3 [3, 3]
# >>> 4 [4, 4]
请注意,我已经按照你的要求,只返回翻译好的代码部分。
英文:
not sure where your problem actually is since you show only part of your code..
For me, this works just as expected:
import pandas as pd
import numpy as np
data = [None, None, np.array([2, 2]), None, None]
df = pd.DataFrame(dict(data = data))
mask = df["data"].isna()
maskdata = [np.array([i, i]) for i in mask[mask].index]
df.loc[mask, "data"] = maskdata
# print(df)
# >>> data
# >>> 0 [0, 0]
# >>> 1 [1, 1]
# >>> 2 [2, 2]
# >>> 2 [3, 3]
# >>> 2 [4, 4]
答案4
得分: 1
根据我的理解,问题出现在loc
对基于索引匹配的值产生影响。
以你的示例为例:
df = pd.DataFrame({"col1":[1,2,3],
"col2":["a","b","c"],
"col3":[None, np.array([2,4]), None]})
如果你看一下:
mask = df.col3.isna()
mask
实际上你会得到:
0 True
1 False
2 True
Name: col3, dtype: bool
因此,尽管你只选择了具有“True”的行,但索引仍然具有长度为3。此外,你的列表vals_to_impute
没有索引,因此Pandas不知道如何影响值。
一个快速的修复方法可能是:
mask = df[df.col3.isna()].index
vals_to_impute = pd.Series([np.array([1, 2]), np.array([1, 4])], index=mask)
df.loc[mask,"col3"] = vals_to_impute
注意:
- 可能有一种更合适的Pandas方法来做这个。
- 在DataFrame列中使用nd.arrays在我所知道的情况下相当不常见,请注意你可能会在以后遇到其他问题。
英文:
From what I understand, the problem arises from loc affecting values based on index matching.
Taking your example:
df = pd.DataFrame({"col1":[1,2,3],
"col2":["a","b","c"],
"col3":[None, np.array([2,4]), None]})
If you look at
mask = df.col3.isna()
mask
You actually have:
0 True
1 False
2 True
Name: col3, dtype: bool
So, although you are selecting only the rows with "True", the index still has a length of 3. Furthermore, your list vals_to_impute
is not indexed, so Pandas does not know how to affect values.
A quick fix could be:
mask = df[df.col3.isna()].index
vals_to_impute = pd.Series([np.array([1, 2]), np.array([1, 4])], index=mask)
df.loc[mask,"col3"] = vals_to_impute
Note:
- There might be a more proper pandas way to do this.
- Using nd.arrays in a DataFrame column is rather uncommon to my knowledge, be aware you might encounter other problems later.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论