“np.nan” 没有正确转换,但 “None” 是。

huangapple go评论69阅读模式
英文:

"np.nan" isn't converted properly but "None" is

问题

在以下代码中,我生成了一些包含值 np.nan 的数据:

import pandas as pd
import numpy as np

n = 20
df = pd.DataFrame({"x": np.random.choice(["dog", "cat", np.nan], n), "y": range(0, n)})

随后,我通过函数 pd.notnull 检查缺失值,并且没有指示有任何缺失值:

pd.notnull(df["x"])

好的,原因是在创建中使用的 np.nan 在某种程度上被转换为字符串 "nan"。但是为什么会这样?例如,如果我在表达式中用 None 值替代 np.nan,即如果我通过 np.random.choice(["dog", "cat", None], n) 创建数据,那么一切都正常工作。

有人能解释为什么 np.nan 没有正确转换吗?而且一般来说:如何在不使用 np.nan 或 None 对象的情况下为字符串列创建随机缺失数据?

英文:

In the following code I generate some data containing the value np.nan:

import pandas as pd
import numpy as np

n = 20
df = pd.DataFrame({"x": np.random.choice(["dog","cat",np.nan],n), "y": range(0,n)})

Subsequently I check for missing values via the function pd.notnull and this does not indicate that there are any missing values:

pd.notnull(df["x"])

Ok, the reason is that the np.nan used in the creation got somehow translated into a string "nan". But why? For instance, if I substitute the None value in the expression for np.nan, i.e. if I create the data via np.random.choice(["dog","cat",None],n), then everything works.

Can someone explain why np.nan isn't properly converted? And in general: How do I create random missing data for a string column without using np.nan or the None object?

答案1

得分: 2

np.random.choice 创建一个 numpy 数组,该数组只能容纳一种数据类型,您可以尝试使用 dtype=float 手动设置数据类型(nan 是一个浮点数),但这不适用于字符串值。

options = np.array(["dog", "cat", np.nan], dtype=float) # ValueError: could not convert string to float: 'dog'
df = pd.DataFrame({"x": np.random.choice(options, n), "y": range(0, n)})

编辑:您可以将 dtype 设置为 object,然后代码将正常工作:

import pandas as pd
import numpy as np

n = 20
options = np.array(["dog", "cat", np.nan], dtype=object)
print(options)
df = pd.DataFrame({"x": np.random.choice(options, n), "y": range(0, n)})
print(df)
英文:

np.random.choice creates a numpy array, which can only hold one type of data, you can try to set the datatype manually with dtype=float (nan is a float), but that does not work with the string values.

options = np.array(["dog","cat",np.nan], dtype=float) # ValueError: could not convert string to float: 'dog'
df = pd.DataFrame({"x": np.random.choice(options,n), "y": range(0,n)})

edit: you can set dtype to object, then the code will work:

import pandas as pd
import numpy as np

n = 20
options = np.array(["dog","cat",np.nan], dtype=object)
print(options)
df = pd.DataFrame({"x": np.random.choice(options,n), "y": range(0,n)})
print(df)```

</details>



# 答案2
**得分**: 1

关于为字符串列创建随机缺失数据可以使用 [`.mask()`][1]

```python
n = 20  
df = pd.DataFrame({"x": np.random.choice(["dog","cat"],n), "y": range(0, n)})  
mask = pd.Series(np.random.rand(n) < 0.33) # change to any fraction of missing values
df['x'] = df['x'].mask(mask)
英文:

As for creating random missing data for a string column, you can use .mask():

n = 20  
df = pd.DataFrame({&quot;x&quot;: np.random.choice([&quot;dog&quot;,&quot;cat&quot;],n), &quot;y&quot;: range(0, n)})  
mask = pd.Series(np.random.rand(n) &lt; 0.33) # change to any fraction of missing values
df[&#39;x&#39;] = df[&#39;x&#39;].mask(mask)

huangapple
  • 本文由 发表于 2023年6月29日 21:27:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/76581516.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定