将numpy ndarray中的字符串转换为NaN(或更好地替换为插值值)

huangapple go评论58阅读模式
英文:

Convert strings in numpy ndarray to nan (or even better replace by interpolated values)

问题

我有一个包含矩阵的Excel表格。不幸的是,一些随机单元格包含任意字符串。第一行和第一列是坐标轴(自变量)。一个示例:

  |     | 10 | 20 | ... | 90 | 100 |
  | -   | -- | -- | --- | -- | --- |
  | 1   | 3  | 9  | ... | blob | 27 |
  | 3   | -1 | 10 | ... | blib | 12 |
  | ... | ... | ... | ... | ... | ... |
  | 15  | 0  | blub | ... | bleb | 17 |

我使用 pandas 读取这个表格,并将其转换为 numpy ndarray 以便进行处理。

tmp = pd.read_excel('path')
y = tmp.iloc[1:,0].values
x = tmp.iloc[0,1:].values
mat = tmp.iloc[1:,1:].values

我最初希望 numpy 会自动将字符串转换为数值表示,但我需要了解到 ndarray 可以容纳混合类型。

对于如何查看 mat 的最小示例,我无法提供。

这个Excel导入会在某些单元格中产生数字,但在其他单元格中产生字符串。

有没有一种有效的方法来将 mat 中的所有字符串替换为 np.nan?也就是说,不需要扫描所有单元格并基于类型进行替换?我尝试使用 pd.read_excel()dtype 参数来做到这一点,但没有太多成功。

最终我想要实现的目标是用线性插值值替换这些原始字符串。所以也许有一种更有效的方式可以直接进行插值,而不是首先将字符串转换为数值。正如你所看到的,字符串也可以位于完整的列中,因此需要进行二维插值。

感谢你的建议!

英文:

I have an excel sheet that contains a matrix. Unfortunately, it happens that some random cells are arbitrary strings. The first row and column are the axes (independent variables). An example:

10 20 ... 90 100
1 3 9 ... blob 27
3 -1 10 ... blib 12
... ... ... ... ... ...
15 0 blub ... bleb 17

I read this in with pandas and convert it to numpy ndarray in order to work with it.

tmp = pd.read_excel('path')
y = tmp.iloc[1:,0].values
x = tmp.iloc[0,1:].values
mat = tmp.iloc[1:,1:].values

My first hope was that numpyconverts the strings automatically to some numerical representation. But I needed to learn that a ndarray can hold mixed types.

Sorry, I am not even able to present a minimal example of how mat looks like.

import numpy as np
mat = np.ndarray(((3,9,'blob',27),(-1,10,'blib',12),(0,'blub','bleb',17))) # raises an error
mat = np.asarray(((3,9,'blob',27),(-1,10,'blib',12),(0,'blub','bleb',17))) # fills the ndarray with strings only (even the numbers become '3', '9' etc.)

The above excel import yields numbers in some cells but strings in others.

Is there any efficient way to replace all those strings in mat by np.nan? I mean, without scanning through all cells and replace based on the type? I tried to do this by the dtype argument of pd.read_excel() but without much success.

What I finally want to achieve is to replace those original strings by linear interpolated values. So maybe there is even a more efficient way to directly interpolate instead of first numerize the strings. As you see, the strings can also sit in a complete column, so 2D interpolation is required.

Thank you for your ideas!

答案1

得分: 1

你可以使用Pandas的replace来替换整个数据框,结合regex表达式来消除字符串,然后使用Pandas的interpolate来填充缺失值。

这是一个基于你的帖子的简单示例,其中你似乎只处理带有有符号整数和简单字符串的情况(否则,可能需要不同的正则表达式模式)。

import numpy as np
import pandas as pd

df = pd.DataFrame({"col1": [1, "a", 2], "col2": [-9, 7, "b"]})

print(df)
# 输出  col1 col2
0    1   -9
1    a    7
2    2    b
df = df.replace(regex=r"\w+", value=np.nan).interpolate(method="linear")

print(df)
# 输出

   col1  col2
0   1.0  -9.0
1   1.5   7.0
2   2.0   7.0
英文:

You can use Pandas replace on a whole dataframe in combination with regex expression to get rid of strings, and then use Pandas interpolate to fill in the missing values.

Here is a naive example based on your post, where you seem to deal only with signed integers and simple strings (otherwise, a different regex pattern might be necessary).

import numpy as np
import pandas as pd

df = pd.DataFrame({"col1": [1, "a", 2], "col2": [-9, 7, "b"]})

print(df)
# Output  col1 col2
0    1   -9
1    a    7
2    2    b
df = df.replace(regex=r"\w+", value=np.nan).interpolate(method="linear")

print(df)
# Output

   col1  col2
0   1.0  -9.0
1   1.5   7.0
2   2.0   7.0

huangapple
  • 本文由 发表于 2023年5月25日 23:08:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76333811.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定