英文:
Convert strings in numpy ndarray to nan (or even better replace by interpolated values)
问题
我有一个包含矩阵的Excel表格。不幸的是,一些随机单元格包含任意字符串。第一行和第一列是坐标轴(自变量)。一个示例:
| | 10 | 20 | ... | 90 | 100 |
| - | -- | -- | --- | -- | --- |
| 1 | 3 | 9 | ... | blob | 27 |
| 3 | -1 | 10 | ... | blib | 12 |
| ... | ... | ... | ... | ... | ... |
| 15 | 0 | blub | ... | bleb | 17 |
我使用 pandas
读取这个表格,并将其转换为 numpy ndarray
以便进行处理。
tmp = pd.read_excel('path')
y = tmp.iloc[1:,0].values
x = tmp.iloc[0,1:].values
mat = tmp.iloc[1:,1:].values
我最初希望 numpy
会自动将字符串转换为数值表示,但我需要了解到 ndarray
可以容纳混合类型。
对于如何查看 mat
的最小示例,我无法提供。
这个Excel导入会在某些单元格中产生数字,但在其他单元格中产生字符串。
有没有一种有效的方法来将 mat
中的所有字符串替换为 np.nan
?也就是说,不需要扫描所有单元格并基于类型进行替换?我尝试使用 pd.read_excel()
的 dtype
参数来做到这一点,但没有太多成功。
最终我想要实现的目标是用线性插值值替换这些原始字符串。所以也许有一种更有效的方式可以直接进行插值,而不是首先将字符串转换为数值。正如你所看到的,字符串也可以位于完整的列中,因此需要进行二维插值。
感谢你的建议!
英文:
I have an excel sheet that contains a matrix. Unfortunately, it happens that some random cells are arbitrary strings. The first row and column are the axes (independent variables). An example:
10 | 20 | ... | 90 | 100 | |
---|---|---|---|---|---|
1 | 3 | 9 | ... | blob | 27 |
3 | -1 | 10 | ... | blib | 12 |
... | ... | ... | ... | ... | ... |
15 | 0 | blub | ... | bleb | 17 |
I read this in with pandas
and convert it to numpy ndarray
in order to work with it.
tmp = pd.read_excel('path')
y = tmp.iloc[1:,0].values
x = tmp.iloc[0,1:].values
mat = tmp.iloc[1:,1:].values
My first hope was that numpy
converts the strings automatically to some numerical representation. But I needed to learn that a ndarray
can hold mixed types.
Sorry, I am not even able to present a minimal example of how mat
looks like.
import numpy as np
mat = np.ndarray(((3,9,'blob',27),(-1,10,'blib',12),(0,'blub','bleb',17))) # raises an error
mat = np.asarray(((3,9,'blob',27),(-1,10,'blib',12),(0,'blub','bleb',17))) # fills the ndarray with strings only (even the numbers become '3', '9' etc.)
The above excel import yields numbers in some cells but strings in others.
Is there any efficient way to replace all those strings in mat
by np.nan
? I mean, without scanning through all cells and replace based on the type? I tried to do this by the dtype
argument of pd.read_excel()
but without much success.
What I finally want to achieve is to replace those original strings by linear interpolated values. So maybe there is even a more efficient way to directly interpolate instead of first numerize the strings. As you see, the strings can also sit in a complete column, so 2D interpolation is required.
Thank you for your ideas!
答案1
得分: 1
你可以使用Pandas的replace来替换整个数据框,结合regex表达式来消除字符串,然后使用Pandas的interpolate来填充缺失值。
这是一个基于你的帖子的简单示例,其中你似乎只处理带有有符号整数和简单字符串的情况(否则,可能需要不同的正则表达式模式)。
import numpy as np
import pandas as pd
df = pd.DataFrame({"col1": [1, "a", 2], "col2": [-9, 7, "b"]})
print(df)
# 输出 col1 col2
0 1 -9
1 a 7
2 2 b
df = df.replace(regex=r"\w+", value=np.nan).interpolate(method="linear")
print(df)
# 输出
col1 col2
0 1.0 -9.0
1 1.5 7.0
2 2.0 7.0
英文:
You can use Pandas replace on a whole dataframe in combination with regex expression to get rid of strings, and then use Pandas interpolate to fill in the missing values.
Here is a naive example based on your post, where you seem to deal only with signed integers and simple strings (otherwise, a different regex pattern might be necessary).
import numpy as np
import pandas as pd
df = pd.DataFrame({"col1": [1, "a", 2], "col2": [-9, 7, "b"]})
print(df)
# Output col1 col2
0 1 -9
1 a 7
2 2 b
df = df.replace(regex=r"\w+", value=np.nan).interpolate(method="linear")
print(df)
# Output
col1 col2
0 1.0 -9.0
1 1.5 7.0
2 2.0 7.0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论