英文:
"np.nan" isn't converted properly but "None" is
问题
在以下代码中,我生成了一些包含值 np.nan 的数据:
import pandas as pd
import numpy as np
n = 20
df = pd.DataFrame({"x": np.random.choice(["dog", "cat", np.nan], n), "y": range(0, n)})
随后,我通过函数 pd.notnull 检查缺失值,并且没有指示有任何缺失值:
pd.notnull(df["x"])
好的,原因是在创建中使用的 np.nan 在某种程度上被转换为字符串 "nan"。但是为什么会这样?例如,如果我在表达式中用 None 值替代 np.nan,即如果我通过 np.random.choice(["dog", "cat", None], n) 创建数据,那么一切都正常工作。
有人能解释为什么 np.nan 没有正确转换吗?而且一般来说:如何在不使用 np.nan 或 None 对象的情况下为字符串列创建随机缺失数据?
英文:
In the following code I generate some data containing the value np.nan:
import pandas as pd
import numpy as np
n = 20
df = pd.DataFrame({"x": np.random.choice(["dog","cat",np.nan],n), "y": range(0,n)})
Subsequently I check for missing values via the function pd.notnull and this does not indicate that there are any missing values:
pd.notnull(df["x"])
Ok, the reason is that the np.nan used in the creation got somehow translated into a string "nan". But why? For instance, if I substitute the None value in the expression for np.nan, i.e. if I create the data via np.random.choice(["dog","cat",None],n), then everything works.
Can someone explain why np.nan isn't properly converted? And in general: How do I create random missing data for a string column without using np.nan or the None object?
答案1
得分: 2
np.random.choice
创建一个 numpy 数组,该数组只能容纳一种数据类型,您可以尝试使用 dtype=float
手动设置数据类型(nan 是一个浮点数),但这不适用于字符串值。
options = np.array(["dog", "cat", np.nan], dtype=float) # ValueError: could not convert string to float: 'dog'
df = pd.DataFrame({"x": np.random.choice(options, n), "y": range(0, n)})
编辑:您可以将 dtype 设置为 object,然后代码将正常工作:
import pandas as pd
import numpy as np
n = 20
options = np.array(["dog", "cat", np.nan], dtype=object)
print(options)
df = pd.DataFrame({"x": np.random.choice(options, n), "y": range(0, n)})
print(df)
英文:
np.random.choice
creates a numpy array, which can only hold one type of data, you can try to set the datatype manually with dtype=float
(nan is a float), but that does not work with the string values.
options = np.array(["dog","cat",np.nan], dtype=float) # ValueError: could not convert string to float: 'dog'
df = pd.DataFrame({"x": np.random.choice(options,n), "y": range(0,n)})
edit: you can set dtype to object, then the code will work:
import pandas as pd
import numpy as np
n = 20
options = np.array(["dog","cat",np.nan], dtype=object)
print(options)
df = pd.DataFrame({"x": np.random.choice(options,n), "y": range(0,n)})
print(df)```
</details>
# 答案2
**得分**: 1
关于为字符串列创建随机缺失数据,可以使用 [`.mask()`][1]:
```python
n = 20
df = pd.DataFrame({"x": np.random.choice(["dog","cat"],n), "y": range(0, n)})
mask = pd.Series(np.random.rand(n) < 0.33) # change to any fraction of missing values
df['x'] = df['x'].mask(mask)
英文:
As for creating random missing data for a string column, you can use .mask()
:
n = 20
df = pd.DataFrame({"x": np.random.choice(["dog","cat"],n), "y": range(0, n)})
mask = pd.Series(np.random.rand(n) < 0.33) # change to any fraction of missing values
df['x'] = df['x'].mask(mask)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论